55
Chalk Talk Tandy Warnow Departments of Computer Science and Bioengineering University of Illinois at Urbana-Champaign

Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

ChalkTalk

TandyWarnowDepartmentsofComputerScienceand

BioengineeringUniversityofIllinoisatUrbana-Champaign

Page 2: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

•  Large-scale statistical phylogeny estimation •  Ultra-large multiple-sequence alignment •  Estimating species trees from incongruent gene trees •  Supertree estimation •  Genome rearrangement phylogeny •  Reticulate evolution •  Visualization of large trees and alignments •  Data mining techniques to explore multiple optima

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Page 3: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Applications areas: •  metagenomics •  protein structure and function prediction •  trait evolution •  detection of co-evolution •  systems biology

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Page 4: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Techniques: •  Graph theory (especially chordal graphs) •  Probability theory and statistics •  Hidden Markov models •  Combinatorial optimization •  Heuristics •  Supercomputing

The Tree of Life: Multiple Challenges

Largedatasets:100,000+sequences10,000+genes“BigData”complexity

Page 5: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

Page 6: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

PastWork(highlights)•  GenetreeesLmaLon(theoreLcalresultsunderstochasLcmodelsofsequenceevoluLon)

•  MulLplesequencealignmentonlargedatasets,andco-esLmaLonofalignmentsandtrees

•  PhylogeneLcnetworksandspeciestreesfrommulL-locusdatasets

•  Genomerearrangementphylogeny•  Supertreemethods•  Metagenomics•  HistoricallinguisLcs

Page 7: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Futurework

Theory,methods,andempiricalstudiesfor•  Genome-scalephylogenyesLmaLonaddressingmulLplesourcesforgenetreeheterogeneity

•  Microbiomeanalysis•  Ultra-largemulLplesequencealignmentandtreeesLmaLon

AndapplicaLonsofthesetechniquesoutsidebiology

Page 8: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

Page 9: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

Page 10: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

MajorAreas•  Phylogenomics:SpeciestreeandnetworkesLmaLonusing

wholegenomes(andgenetreeesLmaLoninthecontextofwholegenomes)

•  Mul*pleSequenceAlignment:InferringrelaLonshipsbetweenleeersinmolecularsequences,especiallyonverylargedatasets(upto1,000,000sequences)

•  Metagenomics:Analysisofmolecularsequencesobtainedfromenvironmentalsamples(jointwithMihaiPopandBillGropp)

•  Scalingcomputa*onallyintensivemethodstolargedatasets:CombiningdiscretemathandstaLsLcalmethodstoenablehighlyaccurateanalysisofultra-largedatasets(jointwithChandraChekuriandSaLshRao)

Page 11: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Phylogenomics = Species trees from whole genomes

“NothinginbiologymakessenseexceptinthelightofevoluLon”-Dobhzansky

Page 12: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

phylogenomics

2

gene 999gene 2ACTGCACACCG ACTGC-CCCCG AATGC-CCCCG -CTGCACACGG

CTGAGCATCG CTGAGC-TCG ATGAGC-TC- CTGA-CAC-G

AGCAGCATCGTG AGCAGC-TCGTG AGCAGC-TC-TG C-TA-CACGGTG

CAGGCACGCACGAA AGC-CACGC-CATA ATGGCACGC-C-TA AGCTAC-CACGGAT

gene 1000gene 1

“gene” here refers to a portion of the genome (not a functional gene)

Orangutan

Gorilla

Chimpanzee

Human

I’ll use the term “gene” to refer to “c-genes”: recombination-free orthologous stretches of the genome

Page 13: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Gene tree discordance

3

Orang.Gorilla ChimpHuman Orang.Gorilla Chimp Human

gene1000gene 1

IncompleteLineageSorLng(ILS)isadominantcauseofgenetreeheterogeneity

Page 14: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

IncompleteLineageSorLng(ILS)

•  ConfoundsphylogeneLcanalysisformanygroups:Hominids,Birds,Yeast,Animals,Toads,Fish,Fungi,etc.

•  ThereissubstanLaldebateabouthowtoanalyzephylogenomicdatasetsinthepresenceofILS,focusedaroundstaLsLcalconsistencyguarantees(theory)andperformanceondata.

Page 15: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

. . .

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Page 16: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

StaLsLcalConsistency

error

Data

Page 17: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial
Page 18: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

. . .

Analyzeseparately

Summary Method

MaincompeLngapproaches gene 1 gene 2 . . . gene k

. . . Concatenation

Species

Page 19: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Maximum Quartet Support Species Tree [Mirarab, et al., ECCB, 2014]

• Optimization Problem (NP-Hard):

• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly

8

Find the species tree with the maximum number of induced quartet trees shared with the collection of input gene trees

Set of quartet trees induced by T

a gene tree

Score(T ) =X

t2TQ(T ) \Q(t)

all input gene trees

Page 20: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

ConstrainedMQST(MaximumQuartetSupportTree)

•  Input:SetT= {t1,t2,…,tk}ofunrootedgenetrees,witheachtreeonsetSwithnspecies,andsetXofallowedbiparLLons

•  Output:UnrootedtreeTonleafsetS,maximizingthetotalquartettreesimilaritytoT,subjecttoTdrawingitsbiparLLonsfromX.

Theorems(Mirarabetal.,2014):•  IfXcontainsthebiparLLonsfromtheinputgenetrees(andperhaps

others),thenanexactsoluLontothisproblemisstaLsLcallyconsistentundertheMSC.

•  TheconstrainedMQSTproblemcanbesolvedinO(|X|2nk)Lme.(Weusedynamicprogramming,andbuildtheunrootedtreefromtheboeom-up,basedon“allowedclades”–halvesoftheallowedbiparLLons.)

Page 21: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

200 Estimated Gene Trees

Data: Fixed, moderate ILS rate, 50 replicates per HGT rates (1)-(6), 1 model species tree per replicate on 51 taxa, 1000 true gene trees,simulated 1000 bp gene sequences using INDELible 8, 1000 gene trees estimated from GTR simulated sequences using FastTree-27

7Price, Dehal, Arkin 20158Fletcher, Yang 2009

12

ASTRALisfairlyrobusttoHGT+ILS

Davidsonetal.,RECOMB-CG,BMCGenomics2015

Page 22: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

16

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

4%

8%

12%

16%

10 50 100 200 500 1000number of species

Spec

ies

tree

topo

logi

cal e

rror (

FN)

ASTRAL−IIMP−EST

1000 genes, “medium” levels of recent ILS

Tree accuracy when varying the number of species

Page 23: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial
Page 24: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

ContribuLons(sample)MethodsforesLmaLngspeciestreesfromgenome-scaledata:

•  ASTRAL(Mirarabetal.,BioinformaLcs2014,2015)andASTRID(VachaspaLandWarnow,BMCGenomics2015):polynomialLmemethodsthatarestaLsLcallyconsistentundertheMSC.Bothcananalyzeverylargedatasets(1000speciesand1000genes–ormore)withhighaccuracy.

•  StaLsLcalbinning(Mirarabetal.,Science2014,Bayzidetal.PLOSOne2015)canreducegenetreeesLmaLonerror,andleadtoimprovedspeciestreeesLmaLons(topology,branchlengths,andincidenceoffalseposiLves)

•  BBCA(Zimmermannetal.,BMCGenomics2014)enablesBayesianco-esLmaLonmethodstoscaletolargenumbersofgenes

•  DCM-boosLng(Bayzidetal.,BMCGenomics2014)enablescomputaLonallyintensivemethodstoscaletolargenumbersofspecies

MathemaLcaltheory:

•  RochandWarnow,SystemaLcBiology2015)regardingstaLsLcalconsistencyundertheMSCgivenfinitelengthsequences.

•  Uricchioetal.,BMCBioinformaLcs2016,numberoflocineededtorecoverallthesplitswithhighprobability

Biologicaldataanalyses:

•  Avianphylogenomicsproject(Jarvis,Mirarabetal.,Science2014)

•  ThousandPlantTranscriptomeProject(Wickee,Mirarabetal.PNAS2014)

•  Tarveretal.GenomeBiologyandEvoluLon2016,Mammalianphylogeny

Page 25: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

Page 26: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

CurrentNSFgrants

•  Graph-theore*cmethodstoimprovephylogenomicanalyses(jointwithChandraChekuriandSaLshRao)–NSFCCF-1535977

•  Mul*pleSequenceAlignment:NSFABI-1458652

•  Metagenomics:jointwithMihaiPopandBillGropp.NSFgrantIII:AF:1513629

Page 27: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Metagenomictaxonomiciden*fica*onandphylogene*cprofiling

Metagenomics,Venteretal.,ExploringtheSargassoSea:Scien*stsDiscoverOneMillionNewGenesinOceanMicrobes

Page 28: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

1. Whatisthisfragment?(Classifyeachfragmentaswellaspossible.)

2.WhatisthetaxonomicdistribuLoninthedataset?(Note:helpfultousemarkergenes.)

3.Whataretheorganismsinthismetagenomicsampledoingtogether?

BasicQuesLons

Page 29: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

This talk

•  SEPP (PSB 2012): SATé-enabled Phylogenetic Placement, and Ensembles of HMMs (eHMMs)

•  Applications of the eHMM technique to metagenomic abundance classification (TIPP, Bioinformatics 2014)

Page 30: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

PhylogeneLcPlacement

Input:Backbonealignmentandtreeonfull-lengthsequences,andasetofhomologousquerysequences(e.g.,readsinametagenomicsampleforthesamegene)

Output:Placementofquerysequencesonbackbonetree

PhylogeneLcplacementcanbeusedinsideapipeline,aOerdeterminingthegenesforeachofthereadsinthemetagenomicsample.

Page 31: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

Page 32: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

AlignSequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Page 33: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

AlignSequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 34: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

PlaceSequence

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 35: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

PhylogeneLcPlacement•  Aligneachquerysequencetobackbonealignment

–  HMMALIGN(Eddy,BioinformaLcs1998)–  PaPaRa(BergerandStamatakis,BioinformaLcs2011)

•  Placeeachquerysequenceintobackbonetree–  Pplacer(Matsenetal.,BMCBioinformaLcs,2011)–  EPA(BergerandStamatakis,SystemaLcBiology2011)

Note:pplacerandEPAusemaximumlikelihood

Page 36: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

HMMERvs.PaPaRaAlignments

Increasing rate of evolution

0.0

Page 37: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

One Hidden Markov Model for the entire alignment?

Page 38: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Or2HMMs?

HMM1

HMM2

Page 39: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

HMM1

HMM3 HMM4

HMM2

Or4HMMs?

Page 40: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

SEPPParameterExploraLon

§  Alignmentsubsetsizeandplacementsubsetsizeimpacttheaccuracy,runningLme,andmemoryofSEPP

§  10%rule(subsetsizes10%ofbackbone)hadbestoverallperformance

Page 41: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

SEPP(10%-rule)onsimulateddata

0.0

0.0

Increasing rate of evolution

Page 42: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCGCGAGCGGGGCTTAGAGGGGGTCGAGGGCGGGG• .• .• .ACCT

Fragmentarysequencesfromsomegene

Full-lengthsequencesforsamegene,andanalignmentandatree

Page 43: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarb, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1.  Uses BLAST to assign reads to marker genes 2.  Computes UPP/PASTA reference alignments 3.  Uses reference taxonomies, refined to binary trees using reference

alignment 4.  Modifies SEPP by considering statistical uncertainty in the

extended alignment and placement within the tree

Page 44: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

Abundance Profiling

Page 45: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Highindeldatasetscontainingknowngenomes

Note:NBC,MetaPhlAn,andMetaPhylercannotclassifyanysequencesfromatleastoneofthehighindellongsequencedatasets,andmOTUterminateswithanerrormessageonallthehighindeldatasets.

Page 46: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

“Novel”genomedatasets

Note:mOTUterminateswithanerrormessageonthelongfragmentdatasetsandhighindeldatasets.

Page 47: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

TIPPvs.otherabundanceprofilers

•  TIPPishighlyaccurate,eveninthepresenceofhighindelratesandnovelgenomes,andforbothshortandlongreads.

•  Allothermethodshavesomevulnerability(e.g.,mOTUisonlyaccurateforshortreadsandisimpactedbyhighindelrates).

•  ImprovedaccuracyisduetotheuseofeHMMs;singleHMMsdonotprovidethesameadvantages,especiallyinthepresenceofhighindelrates.

Page 48: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

SEPPandeHMMs

AnensembleofHMMsprovidesabeeermodelofamulLplesequencealignmentthanasingleHMM,andisbeeerableto•  detecthomologybetweenfulllengthsequencesandfragmentarysequences

•  addfragmentarysequencesintoanexisLngalignment

especiallywhentherearemanyindelsand/orsubsLtuLons.

Page 49: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

OurPublicaLonsusingeHMMs•  S.Mirarab,N.Nguyen,andT.Warnow."SEPP:SATé-EnabledPhylogeneLc

Placement."Proceedingsofthe2012PacificSymposiumonBiocompuLng(PSB2012)17:247-258.

•  N.Nguyen,S.Mirarab,B.Liu,M.Pop,andT.Warnow"TIPP:TaxonomicIdenLficaLonandPhylogeneLcProfiling."BioinformaLcs(2014)30(24):3548-3555.

•  N.Nguyen,S.Mirarab,K.Kumar,andT.Warnow,"Ultra-largealignmentsusingphylogenyawareprofiles".ProceedingsRECOMB2015andGenomeBiology(2015)16:124

•  N.Nguyen,M.Nute,S.Mirarab,andT.Warnow,HIPPI:HighlyaccurateproteinfamilyclassificaLonwithensemblesofHMMs.BMCGenomics(2016):17(Suppl10):765

Allcodesareavailableinopensourceformatheps://github.com/smirarab/sepp

Page 50: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Overview•  Theory:combiningprobabilitytheory,graphtheory,andopLmizaLon

•  SimulaLons:evaluaLngmethodsunderstochasLcmodelsofsequenceevoluLon

•  Biologicaldataanalysis:refiningmethodsandenablingdiscovery

•  OpensourcesoOwaredevelopment•  HighperformancecompuLng•  ApplicaLonsoutsidebiology(e.g.,historicallinguisLcs,bigdataproblemsingeneral)

Page 51: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Computational Phylogenomics

NP-hardproblemsLargedatasetsComplexstaLsLcalesLmaLonproblems

MetagenomicsProteinstructureandfuncLonpredicLonMedicalforensicsSystemsbiologyPopulaLongeneLcs

Page 52: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

FutureWork-Phylogenomics•  Beeertheory,addressingimpactofgenetreeesLmaLonerrorandmissingdata

•  Fastgenome-scalephylogeneLctreeesLmaLon(highperformancecompuLng,staLsLcally-basedesLmaLontakingmulLplesourcesofdiscordintoaccount)

•  PhylogeneLcnetworkconstrucLononlargedatasets(staLsLcalmethodswithindivide-and-conquerframework)

•  BeeerstaLsLcalmodelsofsequenceevoluLon,addressingheterotachy

•  Co-esLmaLonofgenetreesandspeciestrees/networks

Page 53: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Futurework-Metagenomics

•  Improvedmarker-basedanalyses,andaddressinggenetreeheterogeneity

•  RigorousmethodsfordetecLngnovelgenesandspecies

•  HighthroughputanalysiswithhighsensiLvity•  Metagenomeassembly•  HPCimplementaLons•  CollaboraLonswithbiologistsandbiomedicalresearchers

Page 54: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Futurework–MulLpleSequenceAlignment

•  Improvedlarge-scaleMSA(e.g.,PASTAandUPP)•  ExtendingstaLsLcalco-esLmaLonoftreesandMSAtolargedatasets(e.g.,NuteandWarnow2016)

•  EfficientandusefulsamplingofMSAs•  MSAesLmaLoninthepresenceofduplicaLonsandrearrangements(e.g.,wholegenomealignment)

•  BeeerHMM+phylogenymodelsthatareusefulforesLmaLngalignmentsandtrees

Page 55: Chalk Talk - Tandy Warnowtandy.cs.illinois.edu/chalk-talk-cmu.pdf · • ASTRAL (Mirarab et al., Bioinformacs 2014, 2015) and ASTRID (Vachaspa and Warnow, BMC Genomics 2015): polynomial

Futurework-Theory

•  Basicalgorithmicchallenges:–  supertrees–  compuLngtreesfromdistancematrices–  usingchordalgraphsfordivide-and-conquer–  Consensustrees

•  Appliedprobability:–  Trade-offbetweendataqualityandquanLty(e.g.,

staLsLcalbinning)–  IdenLfiabilityoftreemodelswithnoisydata–  UnderstandingensemblesofHMMs