PediPet
© Claus Thorn Ekstrøm 1998-1999
This document was LATEX'd on Jan 22, 1999
and covers PediPet v0.11g
1 Introduction
PediPet is a small program for assisting the researcher when analyzing genetic data.
Currently PediPet runs on DOS and UNIX (Linux and Solaris for Sparc Stations) and should
be easily portable to almost any other architecture. PediPet is written in C, is free software and
comes with no guarantee whatsoever; use at your own risk -
your data might go bad, your troubles might increase and your computer might self-combust.
1.1 Features
- Uses standard Linkage pedigree format
- Recoding of alleles
- Single marker allele frequency estimation
- Single marker and multi point Haseman-Elston1
- Simulate a single genetic marker and a genetic effect.
2 Importing data into PediPet
As of Jan 22, 1999
, PediPet reads files in the linkage pedigree file format
containing only marker data.
The data should be a standard ASCII file with each line representing an individual and columns
representing variables. The data file must have at least five columns corresponding to
family id (an identifier for each family), person id (a unique number), id of father (0 if father is not in the dataset), id of mother (0 if mother is not in the dataset) and
sex (male is scored as 1 and female as 2).
Missing genotype data should be scored as two zeroes, and it is not allowed to have information
about just a single allele for a marker - either information on both or none must be present.
The data must have the following 5 columns
1 1 0 0 1 ...
1 2 0 0 2 ...
1 3 1 2 2 ...
1 4 1 2 1 ...
1 5 1 2 1 ...
The linkage program does not require the person id to be a unique number, only that it should
be unique within each family. By setting the option XXXX (see section 8), PediPet automatically converts all read id's to the form FAM-ID, where FAM
is the family id and ID is the within family id.
Furthermore, the data must follow these simple rules:
- A person must either both or none parents present in the dataset. It is not allowed to
have, say, only the father and not the mother. If data are avaliable on a person and a
single parent, the unobserved parent must be included in the dataset to complete the
pedigree (with the right family id and sex but with missing values for all other
observations).
- The file must contain at least 5 columns: family-id, id, father, mother and sex.
Quantitative traits ...
Missing values for quantitative traits are scored as `.' (the single character period).
3 Consistency checking
PediPet checks consistency of the pedigrees in the dataset by going through the following steps:
- Checking that there is no person in the dataset with unknown sex.
- Checking that fathers are indeed male and mothers are indeed female.
- Checking that spouses are of different sexes.
- In the each offspring generation there should only exists 4 different alleles among full sibs
if the parents are both heterozygous, 3 different alleles if one parent heterozygous and 1
parent is homozygous and 2 different alleles if both parents are homozygous for the marker.
Before using any of the rutines in the program, the dataset should be checked for inconsistencies, as any inconsistency might get the various functions to miscalculate, hang or possibly even
get the program to core dump2.
4 Allele frequency estimation
PediPet can estimate the allele frequency at a given marker for general pedigrees
by maximum likelihood as described in [Boehnke, 1991] and [Lange, 1997].
Before calculating the allele frequencies, the dataset should be checked for consistency and
allele numbers should be reduced. The rutines is not terribly fast and choosing a sensible starting point can seriously reduce the required number of iterations.
A good starting point would be the allele frequencies based on data from the founders.
The method assumes that the founder population is in Hardy-Weinberg
equilibrium
and uses a not too elegant ``brute force and ignorance'' approach to maximizing the likelihood.
It works, but could be - and maybe even will be at some later date - speeded up.
Since the maximization is only guaranteed to find a local maximum and not a global, it is generally a good idea to start the rutine from different starting points and always try to restart the rutine at the found maximum to see if it changes.
EXAMPLE 4.1.
For chromosome 20...
4.1 Improving computational speed
Since allele frequency estimation based on founders alone is pretty simple and very quick, this section
deals with improving the speed of estimating the allele frequencies by ML.
Choosing a good starting point can seriously reduce the number of iterations needed for finding the maximum.
Pedigrees
5 Single marker QTL analysis
This section deals with methods implemented in PediPet to analyze quantitative traits based
on data from a single genetic marker.
5.1 Haseman-Elston
Haseman-Elston is a non-parametric method of searching for QTL's by using data on either full or half sibs and comparing differences for a trait.
PediPet calculates the single marker IBD score for all pairs of full sibs and outputs the
relevant data as a simple ascii file for use for regression analysis by a standard statistical package like SAS, R or S+. The pairwise IBD scores is calculated using info from the parents when avaliable as described in [S.A.G.E, 1994].
When calculating IBD scores, the results are fairly sensitive to the population frequency of
the marker alleles. Therefore precise estimates of the population frequency
(from another study or alternatively from a maximum likelihood estimation on the same dataset)
should be entered before calculating the IBD scores.
The program gives a t-test statistic comparing if the average ibd-scores among the full sibs is significantly different from 1/2 (the expected ibd scores of two full sibs) and whether the
observed ibd scores are different from 1/4 for half sibs.
5.2 Variance components
The idea behind the variance component approach is more or less the same as for the Haseman-Elston:
If there is an effect of a given marker, it is assumed, that two persons, that are
``genetically alike'' are more likeli to have the same trait value than two persons, which are not
``genetically alike''.
For the Haseman-Elston method this was ...
PediPet can estimate variance components for full sibs in the simple mixed model with no fixed effects:
where sA2 and sI2 are the variance components, A is the matrix of ibd-scores (calculated the same way as for Haseman-Elston) and I is the n×n identity matrix.
and calculate the
The above notation is not correct. Actually it should be y = a + e, where the covariance matrix
of a is A etc.
6 Multipoint QTL analysis
Where the single marker analysis uses the information from a single marker at a time to examine the ....
whether a marker ... influences a trait, multipoint QTL analysis uses the marker data from all
markers on the chromosome to ...
6.1 Haseman-Elston
XXX suggested a method for improving the statistical power of the Haseman-Elston method
and estimating both the QTL position and the effect of the putative QTL
by using information from more than a single marker.
6.2 Variance components
7 Simulating data
PediPet can be used for simulating data from nuclear families of the same size (ie. each family has a
set of parents and the same number of full sibs).
8 Options
This section describes the various options and their effects when running the program.
All options are written in the file options.pp (which is a standard ASCII file) and anything on a line after a semicolon `;' is regarded as a comment and not used by the program.
reducealleles
Automatically recode all alleles when importing.
Marker alleles are given numbers from 1 and up, with 1 corresponding to the lowest
original marker value, 2 to the second-lowest etc.
Generally alleles should be reduced for most of the analysis functions to work since they
assume alleles are numbered from 1 and up.
9 To do
Here's a list of the features I'd like to add to the program besides finishing off the
functions I've already started on. They'll be added in the order I need them myself:
- Reading the linkage parameter file
(I need this badly so this will probably be implemented soon). This will also give me an
easy way to read traits.
- TDT-tests for binary and quantitative traits
- Checking for most probable location of genotyping error in the dataset
- Updating the allele frequency estimation rutine to use partial derivatives
References
- [Boehnke, 1991]
-
Boehnke, M. (1991).
Allele frequency estimation from data on relatives.
Am. J. Hum. Gen, 48:22-25.
- [Lange, 1997]
-
Lange, K. (1997).
Mathematical and Statistical Methods for Genetic Analysis.
Springer-Verlag, New York.
- [Lynch and Walsh, 1998]
-
Lynch, M. and Walsh, B. (1998).
Genetics and Analysis of Quantitative Traits.
Sinauer Associates.
- [S.A.G.E, 1994]
-
S.A.G.E (1994).
Statistical Analysis for Genetic Epidemiology, Release 2.2.
Index (showing section)
- Hardy-Weinberg, 3
- Haseman-Elston, 4
- multipoint, 5
-
- marker
- missing values, 2
-
- options
- reducealleles, 5
-
- quantitative trait
- missing values, 2
-
- simulate
- nuclear families, 5
Footnotes:
1 To be completely implemented sometime soon. Right now it can more or less only produce output for single marker H-E, that is directly useable in any statistical package like SAS/Splus
2 At a later date this consistency checing will be enhanced and
will imclude checking for Mendelian inheritance
File translated from TEX by TTH, version 1.57.