23
Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003

Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Parallel Genehunter: Implementation of a linkage analysis package for distributed

memory architectures

Michael MoranCMSC 838T Presentation

May 9, 2003

       

   

CMSC 838T – Presentation

Introduction

Goals Link Genes to specific loci in the genome Decrease time and memory requirements through

parallelization

Motivation Locate genes for specific phenotypes Test for inherited diseases and risk factors Gene therapy

CMSC 838T – Presentation

Talk Overview

Introduction Talk Overview Genetic Linkage Problem Previous Work Parallel Genehunter Evaluation Observations

CMSC 838T – Presentation

Genetic Linkage Problem

Sexual Reproduction Offspring created by two haploid gametes Gametes are produced from diploid/polyploid cells

during meiosis

www.blc.arizona.edu/courses/181gh/rick/genetics1/

CMSC 838T – Presentation

Genetic Linkage Problem

Recombination occurs in two ways1. Random segregation of chromatids

2 x 23 human chromosomes

=>

223 possible haploid combinations

Genes on different chromosomes

recombine with probability

www.gen.umn.edu/faculty_staff/hatch/1131/

.5

CMSC 838T – Presentation

Genetic Linkage Problem

Recombination occurs in two ways1. Random segregation of chromatids

2. Crossover between homologous

pairs of chromosomes

Genes on the same chromosome

recombine with probability

depending on their distance and

location on the chromosome

CMSC 838T – Presentation

Genetic Linkage Problem

Given This model of recombination Data for a particular pedigree (family)

Phenotype information for each individual Genetic markers for each individual

Recombination frequencies for each pair of markers

Can we apply probabilistic methods to Reconstruct the inheritance patterns Link phenotypes to the markers

CMSC 838T – Presentation

Previous Work

Fisher, Haldane, Smith, Morton (1935 - 1955)Methods to infer genetic maps using maximum likelihood

estimators

Elston, Stewart (1971)Genetic Linkage Algorithm

Linear in pedigree size Exponential in number of markers

Lander, Green (1987)Genetic Linkage Algorithm

Linear in number of markers Exponential in pedigree size

CMSC 838T – Presentation

Previous Work

Genehunter (2001) Implementation of Lander & Green

Analyzes a pedigree containing n non-founders

The inheritance of a gene by one

non-founder can be summarized

by two bits

The entire pedigree’s inheritance

pattern can be summarized by a

2n bits

CMSC 838T – Presentation

Previous Work

3 steps of Genehunter: Step 1 : For each marker, calculate the probability of each

of the possible inheritance pattern.

Store probabilities in a vector of size 22n

0: grandfather’s chromatid1: grandmother’s chromatid

Pr([0,0]) = .5Pr([0,1]) = .5Pr([1,0]) = 0Pr([1,1]) = 0

CMSC 838T – Presentation

Previous Work

3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of

each inheritance pattern conditional on all of the markers to the left, and to the right

• For two markers’ inheritance vectors, each disagreeing bit requires a crossover event

• The probability of transitioning between inheritance vectors i, j differing in d bits is

M i, j d (1 )2n d

CMSC 838T – Presentation

Previous Work

3 steps of Genehunter: Step 2 : For each marker, calculate the conditional probably of

each inheritance pattern conditional on all of the markers to the left, and to the right

• Mi,j = cost of transitioning between inheritance vectors i&j

• P1 , P2 = probability vectors for every inheritance pattern given markers 1 and 2 respectively

• P2|1 = P2 • (M P1)

• Calculate the probabilities of each marker’s inheritance conditional on all others by Markov Chain or FFT convolution

CMSC 838T – Presentation

Previous Work

3 steps of Genehunter: Step 3 : For each marker, calculate the probability of unknown

gene being located at specific locations• Hypothesizes phenotype has a gene located at a particular

location. • By default tries 5 evenly-spaced locations between consecutive

pairs of markers• Calculates PD, the probabilities of each inheritance pattern for

based on this phenotype (as in step 1)• For a location between markers i&i+1, p= PD • Px|1...i • Px|i+1...m

Space Requirement:O(22n) O(22n-f) exploiting symmetry of f founders

Time Requirement:O(m22n) O(m22n-f) with f founders

CMSC 838T – Presentation

Parallel Genehunter

Approach Parallelize the 3 Genehunter steps separately

Divides each 22n-sized marker vector evenly among the P processors

allows greater distribution of memory than assigning O(m/P) entire vectors to each processor

CMSC 838T – Presentation

Parallel Genehunter

Parallelization of step 1For each marker, calculate the probability of each of the possible inheritance pattern

Each processor calculates the probabilities for a particular 22n / P inheritance patterns for ever marker

CMSC 838T – Presentation

Parallel Genehunter

Parallelization of step 2For each marker, calculate the conditional probably of each inheritance pattern conditional on all of the markers to the left, and to the right

FFT convolution As in serial genehunter, 22n x 22n matrix-vector multiplication

is replaced FFT-based convolution:1. 2 forward 1D FFTs on 22n-length vectors2. element-by-element multiplication3. inverse FFT

Each 1D FFT is equivalent to a 2D FFT on aP x 22n / P matrix

There are well-known distributed algorithms for this FFT using all-to-all communication.

Dot Product in P2|1 = P2 • (M P1) trivially parallelized: each processor has the same

portion of each vector.

CMSC 838T – Presentation

Parallel Genehunter

Parallelization of step 3For each marker, calculate the probability of unknown

gene being located at specific locations

computing Px|1...i and Px|i+1...m FFTs parallelized as in step 2

Final dot product p = (PD • Px|1...i • Px|i+1...m) parallelized as in step 2

each processor holds all the same portion of each vector

CMSC 838T – Presentation

Evaluation

Experimental Environment

Input data sets 51 family member pedigree {19,21,24}-bit data sets (# bits = 2n-f )

Computing Facilities Cplant Cluster (Sandia National Laboratories)

DEC Alpha EV6 processors Myrinet connection

CMSC 838T – Presentation

Evaluation

Runtimes For 19,21 and 24 bit problems

CMSC 838T – Presentation

Evaluation

Runtimes For 19,21 and 24 bit problems

CMSC 838T – Presentation

Observations

Pro: Performs Genehunter computation exactly

Pro: Effective for “multipoint linkage” of phenotypes

Con: Old-fashioned compared to protein-based methods (?)

Pro: Distributes memory requirements

Pro: More computers allows larger feasible inputs

Con: Experiments based on 1 pedigree

Pro: Efficient parallelization up to 32 or 64 processors

Con: Only allows pedigrees to grow by only 3 or 4 individuals

in equal time

CMSC 838T – Presentation

References

Genetic RecombinationDr. Craig Woodworth, Genetic Recombination in Eukaryotes, Lecture Notes, (www.clarkson.edu/class/by214/powerpoint)

GenehunterK. Markianos, M.J. Daly, & L. Kruglyak. Efficient Multipoint Linkage Analysis Through Reduction of Inheritance Space. American Journal of Human Genetics 68, 2001.

Parallel GenehunterG. Conant, S. Plimpton, W. Old, A. Wagner, P. Fain, & G. Heffelfinger. Parallel Genehunter: Implementation of a Linkage Analysis Package for Distributed-Memory Architectures,  Proceedings of the First IEEE Workshop on High Performance Computational Biology, International Parallel and Distributed Computing Symposium, 2002.

CMSC 838T – Presentation

Questions?