PediPet

© Claus Thorn Ekstrøm 1998-1999

This document was LATEX'd on Jan 22, 1999  and covers PediPet v0.11g

1  Introduction

PediPet is a small program for assisting the researcher when analyzing genetic data.

Currently PediPet runs on DOS and UNIX (Linux and Solaris for Sparc Stations) and should be easily portable to almost any other architecture. PediPet is written in C, is free software and comes with no guarantee whatsoever; use at your own risk - your data might go bad, your troubles might increase and your computer might self-combust.

1.1  Features

2  Importing data into PediPet

As of Jan 22, 1999 , PediPet reads files in the linkage pedigree file format containing only marker data. The data should be a standard ASCII file with each line representing an individual and columns representing variables. The data file must have at least five columns corresponding to family id (an identifier for each family), person id (a unique number), id of father (0 if father is not in the dataset), id of mother (0 if mother is not in the dataset) and sex (male is scored as 1 and female as 2).

Missing genotype data should be scored as two zeroes, and it is not allowed to have information about just a single allele for a marker - either information on both or none must be present.

The data must have the following 5 columns

1 1 0 0 1 ...
1 2 0 0 2 ...
1 3 1 2 2 ...
1 4 1 2 1 ...
1 5 1 2 1 ...

The linkage program does not require the person id to be a unique number, only that it should be unique within each family. By setting the option XXXX (see section 8), PediPet automatically converts all read id's to the form FAM-ID, where FAM is the family id and ID is the within family id.

Furthermore, the data must follow these simple rules:

Quantitative traits ... Missing values for quantitative traits are scored as `.' (the single character period).

3  Consistency checking

PediPet checks consistency of the pedigrees in the dataset by going through the following steps:

  1. Checking that there is no person in the dataset with unknown sex.
  2. Checking that fathers are indeed male and mothers are indeed female.
  3. Checking that spouses are of different sexes.
  4. In the each offspring generation there should only exists 4 different alleles among full sibs if the parents are both heterozygous, 3 different alleles if one parent heterozygous and 1 parent is homozygous and 2 different alleles if both parents are homozygous for the marker.

Before using any of the rutines in the program, the dataset should be checked for inconsistencies, as any inconsistency might get the various functions to miscalculate, hang or possibly even get the program to core dump2.

4  Allele frequency estimation

PediPet can estimate the allele frequency at a given marker for general pedigrees by maximum likelihood as described in [Boehnke, 1991] and [Lange, 1997]. Before calculating the allele frequencies, the dataset should be checked for consistency and allele numbers should be reduced. The rutines is not terribly fast and choosing a sensible starting point can seriously reduce the required number of iterations. A good starting point would be the allele frequencies based on data from the founders.

The method assumes that the founder population is in Hardy-Weinberg equilibrium and uses a not too elegant ``brute force and ignorance'' approach to maximizing the likelihood. It works, but could be - and maybe even will be at some later date - speeded up. Since the maximization is only guaranteed to find a local maximum and not a global, it is generally a good idea to start the rutine from different starting points and always try to restart the rutine at the found maximum to see if it changes.

EXAMPLE 4.1. For chromosome 20...

4.1  Improving computational speed

Since allele frequency estimation based on founders alone is pretty simple and very quick, this section deals with improving the speed of estimating the allele frequencies by ML. Choosing a good starting point can seriously reduce the number of iterations needed for finding the maximum.

Pedigrees

5  Single marker QTL analysis

This section deals with methods implemented in PediPet to analyze quantitative traits based on data from a single genetic marker.

5.1  Haseman-Elston

Haseman-Elston is a non-parametric method of searching for QTL's by using data on either full or half sibs and comparing differences for a trait. PediPet calculates the single marker IBD score for all pairs of full sibs and outputs the relevant data as a simple ascii file for use for regression analysis by a standard statistical package like SAS, R or S+. The pairwise IBD scores is calculated using info from the parents when avaliable as described in [S.A.G.E, 1994].

When calculating IBD scores, the results are fairly sensitive to the population frequency of the marker alleles. Therefore precise estimates of the population frequency (from another study or alternatively from a maximum likelihood estimation on the same dataset) should be entered before calculating the IBD scores.

The program gives a t-test statistic comparing if the average ibd-scores among the full sibs is significantly different from 1/2 (the expected ibd scores of two full sibs) and whether the observed ibd scores are different from 1/4 for half sibs.

5.2  Variance components

The idea behind the variance component approach is more or less the same as for the Haseman-Elston: If there is an effect of a given marker, it is assumed, that two persons, that are ``genetically alike'' are more likeli to have the same trait value than two persons, which are not ``genetically alike''. For the Haseman-Elston method this was ...

PediPet can estimate variance components for full sibs in the simple mixed model with no fixed effects:

y = sA2 A + sI2 I
(1)
where sA2 and sI2 are the variance components, A is the matrix of ibd-scores (calculated the same way as for Haseman-Elston) and I is the n×n identity matrix. and calculate the The above notation is not correct. Actually it should be y = a + e, where the covariance matrix of a is A etc.

6  Multipoint QTL analysis

Where the single marker analysis uses the information from a single marker at a time to examine the .... whether a marker ... influences a trait, multipoint QTL analysis uses the marker data from all markers on the chromosome to ...

6.1  Haseman-Elston

XXX suggested a method for improving the statistical power of the Haseman-Elston method and estimating both the QTL position and the effect of the putative QTL by using information from more than a single marker.

6.2  Variance components

7  Simulating data

PediPet can be used for simulating data from nuclear families of the same size (ie. each family has a set of parents and the same number of full sibs).

8  Options

This section describes the various options and their effects when running the program. All options are written in the file options.pp (which is a standard ASCII file) and anything on a line after a semicolon `;' is regarded as a comment and not used by the program.

reducealleles
Automatically recode all alleles when importing. Marker alleles are given numbers from 1 and up, with 1 corresponding to the lowest original marker value, 2 to the second-lowest etc. Generally alleles should be reduced for most of the analysis functions to work since they assume alleles are numbered from 1 and up.

9  To do

Here's a list of the features I'd like to add to the program besides finishing off the functions I've already started on. They'll be added in the order I need them myself:

References

[Boehnke, 1991]
Boehnke, M. (1991). Allele frequency estimation from data on relatives. Am. J. Hum. Gen, 48:22-25.

[Lange, 1997]
Lange, K. (1997). Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag, New York.

[Lynch and Walsh, 1998]
Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates.

[S.A.G.E, 1994]
S.A.G.E (1994). Statistical Analysis for Genetic Epidemiology, Release 2.2.

Index (showing section)

Hardy-Weinberg, 3
Haseman-Elston, 4
     multipoint, 5

marker
     missing values, 2

options
     reducealleles, 5

quantitative trait
     missing values, 2

simulate
     nuclear families, 5


Footnotes:

1 To be completely implemented sometime soon. Right now it can more or less only produce output for single marker H-E, that is directly useable in any statistical package like SAS/Splus

2 At a later date this consistency checing will be enhanced and will imclude checking for Mendelian inheritance


File translated from TEX by TTH, version 1.57.