56
Efficient and Accurate Clustering for Large-Scale Genetic Mapping V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez , John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker * ++ § *,++ * *, ¶ § § ++ *,§, ¶ * Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Efficient and Accurate Clustering for Large-Scale Genetic Mapping

V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez ,

John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker

* ++ § ¶

*,++ * *, ¶ §

§++ *,§, ¶ *

Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute

Page 2: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Motivation

• High-throughput sequencing methods have produced a flood of inexpensive genetic information

• Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets

Page 3: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

Data

Page 4: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

cluster

Data

Page 5: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

𝑚8

𝑚15

𝑚2

𝑚1

𝑚9

Linkage group 1

Linkage group 2

cluster

Linkage group 1 Linkage group 2

Data

𝑚11

𝑚6

𝑚13

𝑚3

𝑚12

𝑚10

𝑚4

𝑚7

𝑚5𝑚14

Page 6: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Genetic Mapping Problem

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(missing data)

𝒎𝟏

𝒎𝟐

𝒎𝟖𝒎𝟗

𝒎𝟏𝟓

𝒎𝟑

𝒎𝟒𝒎𝟓

𝒎𝟔

𝒎𝟕𝒎𝟏𝟎𝒎𝟏𝟏𝒎𝟏𝟐

𝒎𝟏𝟑𝒎𝟏𝟒

Linkage group 1

Linkage group 2

cluster

Data

𝑚8

𝑚15

𝑚2

𝑚1

𝑚9

Linkage group 1 Linkage group 2

𝑚11

𝑚6

𝑚13

𝑚3

𝑚12

𝑚10

𝑚4

𝑚7

𝑚5𝑚14

Page 7: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Need for Large-Scale Clustering in Genetic Mapping

• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers

• A major bottleneck is the linkage-group-finding phase

• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers

Page 8: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers

• A major bottleneck is the linkage-group-finding phase

• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers

Our solution: A fast, scalable clustering algorithm tailored to genetic marker data

The Need for Large-Scale Clustering in Genetic Mapping

Page 9: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Standard Approach to Genetic Marker Clustering

cluster

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

Data

𝑚1

𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4𝑚5

𝑚6

𝑚7𝑚10𝑚11𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

Page 10: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Standard Approach to Genetic Marker Clustering

(1)

𝑚1𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

• (2) Cut all edges below a LOD threshold

• (3) The resulting connected components = linkage groups

Page 11: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Standard Approach to Genetic Marker Clustering

(1)

𝑚1𝑚2

𝑚8𝑚9

𝑚15

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Linkage group 2

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

• (2) Cut all edges below a LOD threshold

• (3) The resulting connected components = linkage groups

Page 12: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑁𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+𝑁𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑁𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅

𝑅+𝑁𝑅

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

Page 13: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗

𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

Page 14: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:

𝑳𝑶𝑫(𝒎𝒊,𝒎𝒋) = 𝐥𝐨𝐠𝟏𝟎(𝟏 − 𝟏 𝟑)

𝟐( 𝟏 𝟑)𝟏

𝟎. 𝟓𝟑= 𝟎. 𝟎𝟕𝟒

Formally,

𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Where:

𝑅𝑖𝑗 = number of recombinant offspring

𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗

𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

Page 15: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

(2) Cut all edges below a LOD threshold

(2)

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Page 16: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Standard Approach to Genetic Marker Clustering

(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices

• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked

(2) Cut all edges below a LOD threshold

(3) The resulting connected components = linkage groups

(3)

𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6

𝑚1 A B - - A -

𝑚2 A B A A B A

𝑚3 A A - - - B

𝑚4 A - B - B B

𝑚5 B - B A - A

𝑚6 A A B A - -

𝑚7 - - - A B B

𝑚8 A B A B - A

𝑚9 A B - B - -

𝑚10 B B B - A A

𝑚11 A A A A B B

𝑚12 B - A B A -

𝑚13 B B - A A -

𝑚14 - - - B A A

𝑚15 B - - A A B

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Page 17: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure”

Page 18: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Our Approach: The BubbleCluster Algorithm

Primary assumption: Clusters have a “linear structure”

Key idea: Maintain a set of representative or “sketch” points which reveal the cluster structure

LOD threshold

Page 19: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Input: set of markers, LOD threshold 𝜏, non-missing threshold 𝜂, low-quality threshold 𝑐, cluster size threshold 𝜎

Output: set of clusters C and set of representative points R

Phase I: Build initial set of clusters and set of representative points using high-quality markers (those with at least 𝜂 non-missing entries)

Phase II: Add low quality markers (less than 𝜂 non-missing entries) to intialset of clusters

Phase III: Attempt to merge small clusters with large

The BubbleCluster Algorithm: Overview

Page 20: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

𝒎

Iteration i:find 𝑟𝑀𝐴𝑋 ∶= 𝑟𝑗 for which 𝐿𝑂𝐷(𝑚, 𝑟𝑗) is

maximal;set 𝐶𝑀𝐴𝑋 ≔ 𝐶𝐾 ∈ 𝐶 containing 𝑟𝑀𝐴𝑋

𝐿𝑂𝐷(𝑚, 𝑟𝑗)

𝑟𝑗𝐶1 𝐶2

Page 21: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)

𝒎𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)

𝑟𝑀𝐴𝑋𝐶1 𝐶2

Page 22: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)𝐶 = 𝐶 ∪ {𝑚}

𝒎

𝑟𝑀𝐴𝑋𝐶1 𝐶2

𝐶3

Page 23: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )

𝒎

𝑟𝑀𝐴𝑋

𝐶1𝐶2

𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)

Page 24: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )𝐶𝑀𝐴𝑋 = 𝐶𝑀𝐴𝑋 ∪ {𝑚}

𝒎

𝑟𝑀𝐴𝑋

𝐶1 = 𝐶𝑀𝐴𝑋𝐶2

Page 25: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

The BubbleCluster Algorithm

Page 26: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )Add 𝑚 to representative points of 𝐶𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

Page 27: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋

The BubbleCluster Algorithm

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

Page 28: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋

The BubbleCluster Algorithm

𝒎

𝐶2𝐶1 = 𝐶𝑀𝐴𝑋

𝑟𝑀𝐴𝑋

Page 29: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The BubbleCluster Algorithm

𝒎

𝐶1 𝐶2

If 𝑚 has a LOD score above the threshold to two clusters,

Page 30: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

𝐶𝑁𝐸𝑊 = 𝐶1 ∪ 𝐶2

If 𝑚 has a LOD score above the threshold to two clusters, Then merge the clusters and add m to the merged cluster

𝒎

The BubbleCluster Algorithm

Page 31: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

End of Phase IStop when all markers with at least 𝜂 non-missing entries have been processed

Running time: 𝑶 |𝜢| 𝐥𝐨𝐠𝟐 𝚮 + |𝜢||𝑹|where: |𝛨| = size of high-quality marker set,

|𝑅| = size of representative point set

Phase II: add low-quality markers

Phase III: merge small clusters

Page 32: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

BubbleCluster Parameters

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃)𝑁𝑅𝜃𝑅

0.5𝑅+𝑁𝑅

Highest achievable LOD: lim𝑅→0log10

(1 − 𝜃)𝑁𝑅𝜃𝑅

0.5𝑅+𝑁𝑅= 𝑁𝑅 log10 2

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

Page 33: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

𝜏1

𝜏2

BubbleCluster Parameters

Page 34: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Example LOD: log10(1 − 1 3)

2( 1 3)1

0.53= 0.074

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

BubbleCluster Parameters

Page 35: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

LOD threshold

Non-missing threshold: determines how many markers are high-quality

Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )

𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗

Highest achievable LOD: lim𝑅𝑖𝑗→0log10

(1 − 𝜃𝑖𝑗) 𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗

0.5𝑅+ 𝑅𝑖𝑗

= 𝑅𝑖𝑗log10 2

𝑚𝑖 A B - - A -

𝑚𝑗 A B A A B A

𝜏1

𝜏2

BubbleCluster Parameters

Page 36: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Evaluation Metric: 𝐹-score

Given a golden standard clustering, the 𝐹-score measures the quality of another clustering by comparing it to the golden standard

Range: 0 – 1

The 𝐹-score is a harmonic mean of precision and recallAn 𝐹-score of 1 indicates perfect precision and perfect recall for every golden

standard cluster

Page 37: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Results: BubbleCluster on Real Data Sets

Dataset Size Time F-Score

Barley 64K 15 sec 0.9993

Switchgrass 113K 8.9 min 0.9745

Switchgrass 548K 1.9 hrs 0.9894

Wheat 1.58M 1.22 hrs N/A *

* Results under review at Genome Biology

Page 38: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Comparison of Clustering Algorithms for Simulated Data

ClusteringMethod

12.5K Markers 25K Markers

F-score Time F-score Time

JoinMap 0.99964 14 min 0.99982 46 min

MSTMap 0.99964 4.5 min 0.99982 20 min

PIC 0.47024 11 sec (+ 4min)

0.60782 44 sec (+ 16.5min)

BubbleCluster 0.99964 6 sec 0.99982 15 sec

Simulated data created with Nicholas Tinker’s Spaghetti Software.

Page 39: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

4.4

9.8

21.0

47.2

96.6

237.8

6.7

14.0

31.7

86.4

198.7

475.1

1

2

4

8

16

32

64

128

256

512

1024

12.5 25 50 100 200 400

1E-5

1E-4

1E-3

1E-2

1E-1

1E+0

Ru

nti

me

(s)

Dataset Size (in thousands of markers)

Erro

r (1

-Fs

core

)

error, 65% missing

error, 35% missing

runtime, 65% missing

runtime, 35% missing

Scaling Results for Bubble Cluster on Simulated Data

Page 40: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Effect of the LOD threshold

LOD threshold 5 10 15 20 25 30

F-Score 0.6225 0.9999 0.9999 0.9999 0.9999 0.9999

Time (s) 48.6 67.0 70.9 82.0 106 170

Fixed missing entry threshold, increasing LOD threshold

200K markers, 300 individuals, 35% missing rate

𝜏1𝜏2 vs.

Page 41: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Effect of the missing data threshold

Fixed LOD threshold, increasing non-missing entry threshold

Non-missing threshold

132 166 172 179 186 192

F-Score 0.9999 0.9999 0.9992 0.9930 0.9610 0.8948

Time (s) 82.0 84.6 82.7 83.0 81.7 82.0

200K markers, 300 individuals, 35% missing rate

𝜏 𝜏𝜏vs.

Page 42: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data

𝒎

𝐶1 𝐶2

Page 43: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data

𝒎

𝐶1 𝐶2While remaining highly accurate, we outperform popular existing tools in both runtime and scalability

Clustering Method 12.5K Markers 25K Markers

F-score Time F-score Time

JoinMap 0.99964 14 min 0.99982 46 min

MSTMap 0.99964 4.5 min 0.99982 20 min

PIC 0.47024 11 sec (+ 4 min)

0.60782 44 sec (+ 16.5 min)

BubbleCluster 0.99964 6 sec 0.999982 15 sec

Page 44: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Future Work

• Use representative points as starting point for ordering phase

Page 45: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,
Page 46: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Future Work

• Use representative points as starting point for ordering phase

• Provide a more thorough theoretical analysis of achievable clustering as well as order accuracy given assumptions about error and missing data rates

• Develop efficient and accurate, large-scale genetic mapping software

Page 47: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Thank You

Code for BubbleCluster soon available at: www.ucsb.edu/~veronika

Page 48: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Backup Slides

Page 49: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Choosing LOD and non-missing thresholdGoal: minimize 𝑷(mistake) and maximize 𝑭- score

Let 𝑝 = 𝑃 𝐿𝑂𝐷 𝑚𝑖 , 𝑚𝑗 > 𝐿𝑂𝐷𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑥𝑖 ∈ 𝐶𝑖 , 𝑥𝑗 ∈ 𝐶𝑗 , 𝑖 ≠ 𝑗)

Let 𝜏 = LOD threshold

By definition of the LOD score, 𝑝 =1

10𝜏

Let 𝑛𝑐𝑜𝑚𝑝 = number of LOD comparisons we make

Then, if we want to ensure that (1 − 𝑝)𝑛𝑐𝑜𝑚𝑝< 1 − 𝜀 then we need:

𝜏 > log10(1

1 − (1 − 𝜀) 1𝑛𝑐𝑜𝑚𝑝)

At the same time, we want to include a marker in the high-quality set only if we expect that it will achieve a LOD of 𝜏 or greater with another marker, requiring:

𝑛𝑛𝑚 > 𝜏(1 − 𝜇) log10 2

Where 𝜇 is the missing rate and 𝑛𝑛𝑚 is the number of non-missing entries in the marker

Page 50: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Evaluation Metric for Cluster Quality

F-score (range: 0 to 1)• Given a “golden standard clustering”, the F-score measures the

quality of a clustering as follows:

• The F-score between a golden standard cluster 𝑔 and a test cluster 𝑐is a harmonic mean of precision 𝑃 and recall 𝑅:

𝐹𝑠𝑐𝑜𝑟𝑒 𝑔, 𝑐 =2𝑃𝑅

𝑃 + 𝑅• The overall F-score between the golden standard clustering G and a

test clustering C is a weighted average of the F-scores for each golden standard cluster 𝑔:

𝑜𝑣𝑒𝑟𝑎𝑙𝑙_𝐹𝑠𝑐𝑜𝑟𝑒 𝐺, 𝐶 =1

𝑚

𝑔 ∋ 𝐺

𝑔 ∗ max𝑐 ∋𝐶𝐹𝑠𝑐𝑜𝑟𝑒(𝑔, 𝑐)

Page 51: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

Local Linearity Assumption

Although the LOD score does not obey the triangle inequality, we assume that it does at close ranges and with enough data

𝑳𝑶𝑫(𝒎, 𝒓𝟏)𝑳𝑶𝑫(𝒎, 𝒓𝟐)

𝒎

𝒓𝟏𝒓𝟐𝑳𝑶𝑫(𝒓𝟐, 𝒓𝟏)

𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒎, 𝒓𝟏&&

𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐&&

𝑳𝑶𝑫 𝒎, 𝒓𝟏 > 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐

𝒎 is a new boundary point

Page 52: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Effect of the LOD threshold

The standard approach to clustering can be viewed as single linkage clustering

Time complexity: 𝑂(𝑀2)

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

Page 53: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Effect of the LOD threshold

A high LOD threshold ensures that the only edges that remain in the completely connected graph are between markers that are extremely likely to be genetically linked

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

Page 54: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Effect of the LOD threshold

Example: LOD threshold = 10

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Page 55: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Effect of the LOD threshold

Example: LOD threshold = 8

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1

Page 56: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,

The Effect of the LOD threshold

Example: LOD threshold = 7

LOD 10

LOD 9

LOD 8

𝑚15𝑚2𝑚1 𝑚8 𝑚9

LOD 7

𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14

LOD 6

𝑚1𝑚2

𝑚8𝑚9

𝑚15

Linkage group 2

𝑚3

𝑚4 𝑚5𝑚6

𝑚7𝑚10𝑚11

𝑚12

𝑚13𝑚14

Linkage group 1