Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)

1©Cloudera,Inc.Allrightsreserved.

11April2016,SwissBigDataUserGroupTomWhite|@tom_e_white

PetascaleAnalyJcsinGenomicswithHadoop


• DataScienceTeamatCloudera

• ApacheHadoopCommiLer,PMCMember,ApacheMember

• Authorof“Hadoop:TheDefiniJveGuide”

AboutMe


Whatisgenomics?


Organism Cell Genome



Referencechromosome

Loca4on

7©Cloudera,Inc.Allrightsreserved.“…decodingtheBookofLife”








>read1TTGGACATTTCGGGGTCTCAGATT>read2

AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinforma4cs!


Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnota4on

Pipelines!


##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressedtextfiles(non-spliKable)Semi-structuredPoorlyspecified

Globalsortorder


.fastq .sam .bam .vcf .gvcf .bcf

bwa

.bed .gff .ped

PLINK/SEQ

Avro IDL(schemas)

bedtools

samtoolsbcftools

htslib GATK

File formats(e.g., Parquet

columnar)

ExecutionModel

ApplicationLayer

DataModel

StorageModel

Hadoopdistributedfilesystem

HBasedistributedkey-value

S3cloud

storage

NFS(POSIX)

SNAP, ADAM(bwa, Picard, GATK)

MapReduce(incl. distributed

in-memory)

distributedSQL queries

PLINK/SEQ genomeassembly

distributed graphcomputation (BSP)

CHPC(scheduler)POSIXfilesystem

JavaHPC(Queue)POSIXfilesystem

C++Single-nodeSQLite

It’sfileformatsallthewaydown!


Dedup


/***Mainworkmethod.ReadstheBAMfileonceandcollectssortedinformationabout*the5'endsofbothendsofeachread(orjustoneendinthecaseofpairs).*Thenmakesapassthroughthosedeterminingduplicatesbeforere-readingthe*inputfileandwritingitoutwithduplicationflagssetcorrectly.*/protectedintdoWork(){//buildsomedatastructuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);

finalSAMFileWriterout=newSAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader,true,OUTPUT);finalCloseableIterator<SAMRecord>iterator=headerAndIterator.iterator;while(iterator.hasNext()){finalSAMRecordrec=iterator.next();if(!rec.isSecondaryOrSupplementary()){if(recordInFileIndex==nextDuplicateIndex){rec.setDuplicateReadFlag(true);//Nowtryandfigureoutthenextduplicateindexif(this.duplicateIndexes.hasNext()){nextDuplicateIndex=this.duplicateIndexes.next();}else{//Onlyhappensoncewe'vemarkedalltheduplicatesnextDuplicateIndex=-1;}}else{rec.setDuplicateReadFlag(false);}}recordInFileIndex++;if(!this.REMOVE_DUPLICATES||!rec.getDuplicateReadFlag()){out.addAlignment(rec);

Method

Code


@Option(shortName="MAX_FILE_HANDLES",doc="Maximumnumberoffilehandlestokeepopenwhenspilling"+"readendstodisk.Setthisnumberalittlelowerthanthe"+"per-processmaximumnumberoffilethatmaybeopen.This"+"numbercanbefoundbyexecutingthe'ulimit-n'commandon"+"aUnixsystem.")publicintMAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000;

Dedup

Method

Code

PlaSorm



VariantAnnota4on


It’spipelinesallthewaydown!


VariantAnnota4on

Node1


VariantAnnota4on

Node2


VariantAnnota4on

Node3




VariantAnnota4on

Alignment Dedup Recalibrate QC/Filter



Node1


VariantAnnota4on

Node2

Node3



Node4


Node1

Alignment Dedup QC/FilterVariantCalling

VariantAnnota4on

Node2

Node3

Alignment Dedup QC/Filter

Alignment Dedup QC/Filter

Node4

Recalibrate


WhyAreWeSJllDefiningFileFormatsByHand?

•  InsteadofdefiningcustomfileformatsforeachdatatypeandaccesspaLern…

•  ParquetcreatesacompressedformatforeachAvro-defineddatamodel

•  ImprovementsoverexisJngformats•  ~20%forBAM•  ~90%forVCF


YARN-managedHadoopcluster

Sparkexecutors

Par4alsums

Driver

Applica4oncode

ContEstAlgorithm


HadoopprovideslayeredabstracJonsfordataprocessing

HDFS(filesystem)

YARN(resourcemanagement)

MapReduce Impala(SQL) Solr(search) Spark

ADAMquince GATK …

bdg-form

ats(Avro/Parque

t)

Hbase(NoSQL)Kudu(columnstore)


•  HostedatBerkeleyandtheAMPLab

•  Apache2License•  ContributorsfrombothresearchandcommercialorganizaJons

•  CorespaJalprimiJves,variantcalling

•  AvroandParquetfordatamodelsandfileformats

Spark+Genomics=ADAM


ExecuJngqueryinHadoop:interacJveSparkshell(ADAM)

def inDbSnp(g: Genotype): Boolean = true or false!def isDeleterious(g: Genotype): Boolean = g.getPolyPhen!

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()!val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()!val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)!val genotypesRDD = sc.adamLoad("path/to/genotypes")!

val filteredRDD = genotypesRDD! .filter(!inDbSnp(_))! .filter(isDeleterious(_))! .filter(isFramingham(_))!val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)!

val maf = joinedRDD! .keyBy(x => (x.getVariant, getPopulation(x)))! .groupByKey()! .map(computeMAF(_))!

maf.saveAsNewAPIHadoopFile("path/to/output")!

applypredicates

loaddata

joindata

group-byaggregate(MAF)

persistdata


ExecuJngqueryinHadoop:distributedSQL

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

applypredicates

“load”andjoindata

group-by

aggregate(UDAF)


ADAMpreliminaryperformance


•  DevelopedbytheBroadInsJtute•  CoreisMITlicense,someproprietarytoolsontop

•  Version4hasbeenre-wriLentouseSpark,nowcompeJJvewithADAMforspeed

•  UsesexisJngbiofileformatsforinputandoutput,butSparkRDDsforintermediatedata

GenomeAnalysisToolkit(GATK)


•  KudufillsgapbetweenHDFSandHBase

•  Fastscansandupdateable•  AddnewannotaJonstogenomicsdata(variants)withoutrewriJngwholedataset

•  Key=genomeposiJon

•  RangeparJJoning

KuduforVariantStores


Acknowledgements

UCBerkeleyMaLMassieFrankNothauMichaelHeuer

TamrTimothyDanford

MSSMJeffHammerbacherRyanWilliams

ClouderaUriLasersonSandyRyzaSeanOwen


Thankyou@[email protected]

Documents

Petascale AnalyJcs in Genomics with Hadoopfiles.meetup.com/18369817/SBDUG18-Petascale_Genomics.pdf · 2016-04-15 · SNAP, ADAM (bwa, Picard, GATK) MapReduce (incl. distributed in-memory)