SNP Allele Designations (Bio::SNP::Inherit)
Christopher BottomsBOSC 2010
5 million data “items”
one CPU: 2+ dayseight CPUs: 1-2 days
SNP ID Sample ID Base1 Base21 1 A A1 2 A A1 3 A G… … … …1 5000 A A2 1 C C… … … …… … … …1106 5000 GG GG
SNP ID Sample ID Base1 Base21 1 A A1 2 A A1 3 A G… … … …1 5000 A A2 1 C C… … … …… … … …1106 5000 GG GG
“Matrix” data file format
SNP ID 1 2 3 … 5000SNP1 AA AA AG … AASNP2 CC GG GG … CG
“Matrix” data file format
SNP ID 1 2 3 … 5000SNP1 AA AA AG … AASNP2 CC GG GG … CG
Using new data format
12 million data itemsone cpu: ~30 min
ID’s fileID Name Group
1 B73 B73
2 B73xZ1 NAMF1
3 Mo17 Control
4 M100 IBM
5 Bob B73xZ1
ID’s fileID Name Group
1 B73 B73
2 B73xZ1 NAMF1
3 Mo17 Control
4 M100 IBM
5 Bob B73xZ1
“Human Parsed” ID’s fileID Name Group A (ID) B (ID) AxB (ID)
1 B73 B73
2 B73xZ1 NAMF1
3 Mo17 Control
4 M100 IBM 1 3
5 Bob B73xZ1 1 2
Lessons learned
Explore other solutions before deciding on parallel processing
File format changes can simplify work
When appropriate, divide workHumans: Complicated but “once-only” taskComputers: Repetitive boring work
AcknowledgementsAdvisors
Mike McMullenSherry Flint-Garcia
Hardware supportArturo Garcia
FundingNational Science Foundation Plant Genome Program
Grant DBI-0820619USDA-ARS
AcknowledgementsProgramming support
You (CPAN)You (stackoverflow.com)You (perlmonks.org)
End
Recommended