1 Introduction

PediPet is a small program for assisting the researcher when analyzing genetic data.

Currently PediPet runs on DOS and UNIX (Linux and Solaris for Sparc Stations) and should be easily portable to almost any other architecture. PediPet is written in C, is free software and comes with no guarantee whatsoever; use at your own risk - your data might go bad, your troubles might increase and your computer might self-combust.

1.1 Features

Uses standard Linkage pedigree format
Recoding of alleles
Single marker allele frequency estimation
Single marker and multi point Haseman-Elston¹
Simulate a single genetic marker and a genetic effect.

2 Importing data into PediPet

As of Jan 22, 1999 , PediPet reads files in the linkage pedigree file format containing only marker data. The data should be a standard ASCII file with each line representing an individual and columns representing variables. The data file must have at least five columns corresponding to family id (an identifier for each family), person id (a unique number), id of father (0 if father is not in the dataset), id of mother (0 if mother is not in the dataset) and sex (male is scored as 1 and female as 2).

Missing genotype data should be scored as two zeroes, and it is not allowed to have information about just a single allele for a marker - either information on both or none must be present.

The data must have the following 5 columns

1 1 0 0 1 ...
1 2 0 0 2 ...
1 3 1 2 2 ...
1 4 1 2 1 ...
1 5 1 2 1 ...

The linkage program does not require the person id to be a unique number, only that it should be unique within each family. By setting the option XXXX (see section 8), PediPet automatically converts all read id's to the form FAM-ID, where FAM is the family id and ID is the within family id.

Furthermore, the data must follow these simple rules:

A person must either both or none parents present in the dataset. It is not allowed to have, say, only the father and not the mother. If data are avaliable on a person and a single parent, the unobserved parent must be included in the dataset to complete the pedigree (with the right family id and sex but with missing values for all other observations).
The file must contain at least 5 columns: family-id, id, father, mother and sex.

Quantitative traits ... Missing values for quantitative traits are scored as `.' (the single character period).

3 Consistency checking

PediPet checks consistency of the pedigrees in the dataset by going through the following steps:

Checking that there is no person in the dataset with unknown sex.
Checking that fathers are indeed male and mothers are indeed female.
Checking that spouses are of different sexes.
In the each offspring generation there should only exists 4 different alleles among full sibs if the parents are both heterozygous, 3 different alleles if one parent heterozygous and 1 parent is homozygous and 2 different alleles if both parents are homozygous for the marker.

Before using any of the rutines in the program, the dataset should be checked for inconsistencies, as any inconsistency might get the various functions to miscalculate, hang or possibly even get the program to core dump².

4 Allele frequency estimation

PediPet can estimate the allele frequency at a given marker for general pedigrees by maximum likelihood as described in [Boehnke, 1991] and [Lange, 1997]. Before calculating the allele frequencies, the dataset should be checked for consistency and allele numbers should be reduced. The rutines is not terribly fast and choosing a sensible starting point can seriously reduce the required number of iterations. A good starting point would be the allele frequencies based on data from the founders.

The method assumes that the founder population is in Hardy-Weinberg equilibrium and uses a not too elegant ``brute force and ignorance'' approach to maximizing the likelihood. It works, but could be - and maybe even will be at some later date - speeded up. Since the maximization is only guaranteed to find a local maximum and not a global, it is generally a good idea to start the rutine from different starting points and always try to restart the rutine at the found maximum to see if it changes.

EXAMPLE 4.1. For chromosome 20...

4.1 Improving computational speed

Since allele frequency estimation based on founders alone is pretty simple and very quick, this section deals with improving the speed of estimating the allele frequencies by ML. Choosing a good starting point can seriously reduce the number of iterations needed for finding the maximum.

Pedigrees

5 Single marker QTL analysis

This section deals with methods implemented in PediPet to analyze quantitative traits based on data from a single genetic marker.

5.1 Haseman-Elston

Haseman-Elston is a non-parametric method of searching for QTL's by using data on either full or half sibs and comparing differences for a trait. PediPet calculates the single marker IBD score for all pairs of full sibs and outputs the relevant data as a simple ascii file for use for regression analysis by a standard statistical package like SAS, R or S+. The pairwise IBD scores is calculated using info from the parents when avaliable as described in [S.A.G.E, 1994].

When calculating IBD scores, the results are fairly sensitive to the population frequency of the marker alleles. Therefore precise estimates of the population frequency (from another study or alternatively from a maximum likelihood estimation on the same dataset) should be entered before calculating the IBD scores.

The program gives a t-test statistic comparing if the average ibd-scores among the full sibs is significantly different from ¹/₂ (the expected ibd scores of two full sibs) and whether the observed ibd scores are different from ¹/₄ for half sibs.

5.2 Variance components

The idea behind the variance component approach is more or less the same as for the Haseman-Elston: If there is an effect of a given marker, it is assumed, that two persons, that are ``genetically alike'' are more likeli to have the same trait value than two persons, which are not ``genetically alike''. For the Haseman-Elston method this was ...

PediPet can estimate variance components for full sibs in the simple mixed model with no fixed effects:

y = s_A² A + s_I² I

(1)

where s_A² and s_I² are the variance components, A is the matrix of ibd-scores (calculated the same way as for Haseman-Elston) and I is the n×n identity matrix. and calculate the The above notation is not correct. Actually it should be y = a + e, where the covariance matrix of a is A etc.

6 Multipoint QTL analysis

Where the single marker analysis uses the information from a single marker at a time to examine the .... whether a marker ... influences a trait, multipoint QTL analysis uses the marker data from all markers on the chromosome to ...

6.1 Haseman-Elston

XXX suggested a method for improving the statistical power of the Haseman-Elston method and estimating both the QTL position and the effect of the putative QTL by using information from more than a single marker.

6.2 Variance components

7 Simulating data

PediPet can be used for simulating data from nuclear families of the same size (ie. each family has a set of parents and the same number of full sibs).

8 Options

This section describes the various options and their effects when running the program. All options are written in the file options.pp (which is a standard ASCII file) and anything on a line after a semicolon `;' is regarded as a comment and not used by the program.

reducealleles
Automatically recode all alleles when importing. Marker alleles are given numbers from 1 and up, with 1 corresponding to the lowest original marker value, 2 to the second-lowest etc. Generally alleles should be reduced for most of the analysis functions to work since they assume alleles are numbered from 1 and up.

9 To do

Here's a list of the features I'd like to add to the program besides finishing off the functions I've already started on. They'll be added in the order I need them myself:

Reading the linkage parameter file (I need this badly so this will probably be implemented soon). This will also give me an easy way to read traits.
TDT-tests for binary and quantitative traits
Checking for most probable location of genotyping error in the dataset
Updating the allele frequency estimation rutine to use partial derivatives

References

[Boehnke, 1991]: Boehnke, M. (1991). Allele frequency estimation from data on relatives. Am. J. Hum. Gen, 48:22-25.
[Lange, 1997]: Lange, K. (1997). Mathematical and Statistical Methods for Genetic Analysis. Springer-Verlag, New York.
[Lynch and Walsh, 1998]: Lynch, M. and Walsh, B. (1998). Genetics and Analysis of Quantitative Traits. Sinauer Associates.
[S.A.G.E, 1994]: S.A.G.E (1994). Statistical Analysis for Genetic Epidemiology, Release 2.2.

Index (showing section)

: Hardy-Weinberg, 3
: Haseman-Elston, 4
: multipoint, 5

: marker
: missing values, 2

: options
: reducealleles, 5

: quantitative trait
: missing values, 2

: simulate
: nuclear families, 5

Footnotes:

¹ To be completely implemented sometime soon. Right now it can more or less only produce output for single marker H-E, that is directly useable in any statistical package like SAS/Splus

² At a later date this consistency checing will be enhanced and will imclude checking for Mendelian inheritance

File translated from T_EX by T_TH, version 1.57.