View
219
Download
0
Embed Size (px)
Citation preview
Parallel Genehunter: Implementation of a linkage analysis package for distributed
memory architectures
Michael MoranCMSC 838T Presentation
May 9, 2003
CMSC 838T – Presentation
Introduction
Goals Link Genes to specific loci in the genome Decrease time and memory requirements through
parallelization
Motivation Locate genes for specific phenotypes Test for inherited diseases and risk factors Gene therapy
CMSC 838T – Presentation
Talk Overview
Introduction Talk Overview Genetic Linkage Problem Previous Work Parallel Genehunter Evaluation Observations
CMSC 838T – Presentation
Genetic Linkage Problem
Sexual Reproduction Offspring created by two haploid gametes Gametes are produced from diploid/polyploid cells
during meiosis
www.blc.arizona.edu/courses/181gh/rick/genetics1/
CMSC 838T – Presentation
Genetic Linkage Problem
Recombination occurs in two ways1. Random segregation of chromatids
2 x 23 human chromosomes
=>
223 possible haploid combinations
Genes on different chromosomes
recombine with probability
www.gen.umn.edu/faculty_staff/hatch/1131/
.5
CMSC 838T – Presentation
Genetic Linkage Problem
Recombination occurs in two ways1. Random segregation of chromatids
2. Crossover between homologous
pairs of chromosomes
Genes on the same chromosome
recombine with probability
depending on their distance and
location on the chromosome
CMSC 838T – Presentation
Genetic Linkage Problem
Given This model of recombination Data for a particular pedigree (family)
Phenotype information for each individual Genetic markers for each individual
Recombination frequencies for each pair of markers
Can we apply probabilistic methods to Reconstruct the inheritance patterns Link phenotypes to the markers
CMSC 838T – Presentation
Previous Work
Fisher, Haldane, Smith, Morton (1935 - 1955)Methods to infer genetic maps using maximum likelihood
estimators
Elston, Stewart (1971)Genetic Linkage Algorithm
Linear in pedigree size Exponential in number of markers
Lander, Green (1987)Genetic Linkage Algorithm
Linear in number of markers Exponential in pedigree size
CMSC 838T – Presentation
Previous Work
Genehunter (2001) Implementation of Lander & Green
Analyzes a pedigree containing n non-founders
The inheritance of a gene by one
non-founder can be summarized
by two bits
The entire pedigree’s inheritance
pattern can be summarized by a
2n bits
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter: Step 1 : For each marker, calculate the probability of each
of the possible inheritance pattern.
Store probabilities in a vector of size 22n
0: grandfather’s chromatid1: grandmother’s chromatid
Pr([0,0]) = .5Pr([0,1]) = .5Pr([1,0]) = 0Pr([1,1]) = 0
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to the left, and to the right
• For two markers’ inheritance vectors, each disagreeing bit requires a crossover event
• The probability of transitioning between inheritance vectors i, j differing in d bits is
M i, j d (1 )2n d
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of
each inheritance pattern conditional on all of the markers to the left, and to the right
• Mi,j = cost of transitioning between inheritance vectors i&j
• P1 , P2 = probability vectors for every inheritance pattern given markers 1 and 2 respectively
• P2|1 = P2 • (M P1)
• Calculate the probabilities of each marker’s inheritance conditional on all others by Markov Chain or FFT convolution
CMSC 838T – Presentation
Previous Work
3 steps of Genehunter: Step 3 : For each marker, calculate the probability of unknown
gene being located at specific locations• Hypothesizes phenotype has a gene located at a particular
location. • By default tries 5 evenly-spaced locations between consecutive
pairs of markers• Calculates PD, the probabilities of each inheritance pattern for
based on this phenotype (as in step 1)• For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m
Space Requirement:O(22n) O(22n-f) exploiting symmetry of f founders
Time Requirement:O(m22n) O(m22n-f) with f founders
CMSC 838T – Presentation
Parallel Genehunter
Approach Parallelize the 3 Genehunter steps separately
Divides each 22n-sized marker vector evenly among the P processors
allows greater distribution of memory than assigning O(m/P) entire vectors to each processor
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 1For each marker, calculate the probability of each of the possible inheritance pattern
Each processor calculates the probabilities for a particular 22n / P inheritance patterns for ever marker
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 2For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right
FFT convolution As in serial genehunter, 22n x 22n matrix-vector multiplication
is replaced FFT-based convolution:1. 2 forward 1D FFTs on 22n-length vectors2. element-by-element multiplication3. inverse FFT
Each 1D FFT is equivalent to a 2D FFT on aP x 22n / P matrix
There are well-known distributed algorithms for this FFT using all-to-all communication.
Dot Product in P2|1 = P2 • (M P1) trivially parallelized: each processor has the same
portion of each vector.
CMSC 838T – Presentation
Parallel Genehunter
Parallelization of step 3For each marker, calculate the probability of unknown
gene being located at specific locations
computing Px|1...i and Px|i+1...m FFTs parallelized as in step 2
Final dot product p = (PD • Px|1...i • Px|i+1...m) parallelized as in step 2
each processor holds all the same portion of each vector
CMSC 838T – Presentation
Evaluation
Experimental Environment
Input data sets 51 family member pedigree {19,21,24}-bit data sets (# bits = 2n-f )
Computing Facilities Cplant Cluster (Sandia National Laboratories)
DEC Alpha EV6 processors Myrinet connection
CMSC 838T – Presentation
Observations
Pro: Performs Genehunter computation exactly
Pro: Effective for “multipoint linkage” of phenotypes
Con: Old-fashioned compared to protein-based methods (?)
Pro: Distributes memory requirements
Pro: More computers allows larger feasible inputs
Con: Experiments based on 1 pedigree
Pro: Efficient parallelization up to 32 or 64 processors
Con: Only allows pedigrees to grow by only 3 or 4 individuals
in equal time
CMSC 838T – Presentation
References
Genetic RecombinationDr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes, (www.clarkson.edu/class/by214/powerpoint)
GenehunterK. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis Through Reduction of Inheritance Space. American Journal of Human Genetics 68, 2001.
Parallel GenehunterG. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory Architectures, Proceedings of the First IEEE Workshop on High Performance Computational Biology, International Parallel and Distributed Computing Symposium, 2002.