20
Puzzles in massively parallel genome biology await GPU solutions Oleksiy Karpenko Bioinformatics Program University of Illinois at Chicago Super Computing 2011 Seattle, WA This presentation is funded by Pan-American Advanced Studies Institute of NSF

Puzzles in massively parallel genome biology await GPU solutions

  • Upload
    vudien

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Puzzles in massively parallel genome biology await GPU solutions

Puzzles in massively parallel

genome biology

await GPU solutions

Oleksiy Karpenko

Bioinformatics Program

University of Illinois at Chicago

Super Computing 2011

Seattle WA

This presentation is funded by Pan-American Advanced Studies Institute of NSF

2

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

3

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

4

DNA is the same

but active genes are different

ne

httpwwwbioapsanlgovscihiimgheartjpg httpwwwsciencephotocomimage190555largeF0014577-The_femur-SPLjpg

bone genesheart genes

5

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

bull the non-coding regions host important regulatory sites

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 2: Puzzles in massively parallel genome biology await GPU solutions

2

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

3

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

4

DNA is the same

but active genes are different

ne

httpwwwbioapsanlgovscihiimgheartjpg httpwwwsciencephotocomimage190555largeF0014577-The_femur-SPLjpg

bone genesheart genes

5

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

bull the non-coding regions host important regulatory sites

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 3: Puzzles in massively parallel genome biology await GPU solutions

3

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

4

DNA is the same

but active genes are different

ne

httpwwwbioapsanlgovscihiimgheartjpg httpwwwsciencephotocomimage190555largeF0014577-The_femur-SPLjpg

bone genesheart genes

5

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

bull the non-coding regions host important regulatory sites

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 4: Puzzles in massively parallel genome biology await GPU solutions

4

DNA is the same

but active genes are different

ne

httpwwwbioapsanlgovscihiimgheartjpg httpwwwsciencephotocomimage190555largeF0014577-The_femur-SPLjpg

bone genesheart genes

5

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

bull the non-coding regions host important regulatory sites

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 5: Puzzles in massively parallel genome biology await GPU solutions

5

Human genome for computer

scientists

hellipACGTTGGATCGAGACATGACGATGhellip

bull 4 letter alphabet of DNA A C G T

bull ~3 GB letters long

bull 2 are genes coding for proteins

bull 98 formerly known as ldquojunk DNArdquo

bull the non-coding regions host important regulatory sites

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 6: Puzzles in massively parallel genome biology await GPU solutions

6

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 7: Puzzles in massively parallel genome biology await GPU solutions

7

1) Transcription factors

httpwwwnaturecomnrijournalv2n9imagesnri887-f2gif

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 8: Puzzles in massively parallel genome biology await GPU solutions

8

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 9: Puzzles in massively parallel genome biology await GPU solutions

9

2) Histone modifications

httpwwwnyasorgimageaxdid=a261ff6e-1e42-442c-8dc6-428a60574df4ampt=634329511969400000

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 10: Puzzles in massively parallel genome biology await GPU solutions

10

3) DNA methylation

httpwwwscqubccawp-contentuploads200608methylation[1]-GIFgif

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 11: Puzzles in massively parallel genome biology await GPU solutions

11

Some known regulatory rules

httpagarizonaeduresearchxwangmilestonesRice_Chipchipjpg

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 12: Puzzles in massively parallel genome biology await GPU solutions

12

Genome site signals detected

individually

hellipACGTTGGATCGAGACATGACGATGhellip

Single transcription

factor binding sites

Single histone modifcation

sites

DNA methylation

sites

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 13: Puzzles in massively parallel genome biology await GPU solutions

13

Different signals for different

tissues

hellipACGTTGGATCGAGACATGACGATGhellip

ne

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 14: Puzzles in massively parallel genome biology await GPU solutions

14

Complexity

Consider a combination of

bull ~2600 transcription factors

bull ~210 unique cell types

bull dozens of histone modifications

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 15: Puzzles in massively parallel genome biology await GPU solutions

15

Genome signals

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 16: Puzzles in massively parallel genome biology await GPU solutions

16

Gene neighborhood matters

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 17: Puzzles in massively parallel genome biology await GPU solutions

17

Look for common patterns

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 18: Puzzles in massively parallel genome biology await GPU solutions

18

Partition the data

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 19: Puzzles in massively parallel genome biology await GPU solutions

19

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

GPU

core

Process in parallel

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology

Page 20: Puzzles in massively parallel genome biology await GPU solutions

20

Take home message

bull Biology as a science is becoming more

computationally oriented

bull Genome biology is massively parallel by

its nature

bull GPUs fit perfectly for solving problems in

genome biology