10
1 EE381V: Genomic Signal Processing and Data Science EE381V: Genomic Signal Processing and Data Science EE381V: Genomic Signal Processing and Data Science Basic Informa<on Instructor: Haris Vikalo Contact informa<on: [email protected] , (512) 232-7922 Office hours: POB 3.110, Tue/Thus 2:00pm-3:00pm Teaching Assistant: Somsubhra Barik Contact informa<on: [email protected] Office hours: TBD Electronic course site: Canvas hUp://canvas.utexas.edu/ distribu<on of homework assignments, solu<ons, and class slides/notes should be able to access it if you have UT EID and are registered Course website: h>p://users.ece.utexas.edu/~hvikalo/ee381v.html class notes (mirrored from Canvas) and suggested reading final project guidelines 2

EE381V: Genomic Signal Processing and Data …users.ece.utexas.edu/~hvikalo/ee381v/lecture1h.pdfEE381V: Genomic Signal Processing and Data Science ... Genomic Signal Processing and

Embed Size (px)

Citation preview

1

EE381V:GenomicSignalProcessingandDataScience

EE381V:GenomicSignalProcessingandDataScience

EE381V:GenomicSignalProcessingandDataScience

BasicInforma<on

!   Instructor:HarisVikalo!  Contactinforma<on:[email protected],(512)232-7922!  Officehours:POB3.110,Tue/Thus2:00pm-3:00pm

!   TeachingAssistant:SomsubhraBarik!  Contactinforma<on:[email protected]!  Officehours:TBD

!   Electroniccoursesite:Canvas!  hUp://canvas.utexas.edu/!  distribu<onofhomeworkassignments,solu<ons,andclassslides/notes!  shouldbeabletoaccessitifyouhaveUTEIDandareregistered

!   Coursewebsite:h>p://users.ece.utexas.edu/~hvikalo/ee381v.html!  classnotes(mirroredfromCanvas)andsuggestedreading!  finalprojectguidelines

2

2

EE381V:GenomicSignalProcessingandDataScience

!   Textbook:none!  classnotes,readingassignmentswillbedistributedviacoursewebsite,Canvas

!   Suggestedreading:!  R.Durbin,et.al.,BiologicalSequenceAnalysis:Probabilis5cModelsofProteins,

CambridgeUniversityPress,1998.!  N.C.JonesandP.A.Pevzner,AnIntroduc5ontoBioinforma5csAlgorithms,MIT

Press,2004.!  M.Schena,MicroarrayAnalysis,Wiley2003.

!   Grading(tentaJve):! homeworks(30%),midterm(30%),finalproject(40%)

! Homeworksandexams:!  4-5assignments(theorycomponent+programmingcomponent)!  midterm(take-home)

3

BasicInforma<onCont’d

EE381V:GenomicSignalProcessingandDataScience

PrerequisitesandTargetAudience!   Finalproject:eitherexpository(survey)orinnova<ve(research)

!  uptotwostudentscancollaborateonaproject

!  requiredwriUendocuments:(1)proposaland(2)finalreport

!  alistofpossibleprojectswillbeprovidedshortly

!   Prerequisites:!  anundergraduatecourseinprobability!  programmingexperience(MatlaborPython)!  nobiologybackgroundrequired

!   Targetaudience:!  studentsspecializinginsignalprocessing/machinelearning/algorithms/

informa<ontheorywhowanttolearnofapplica<onsinbiology/genomicsandgetexposuretorealdata

!  studentsspecializingincomputa<onalbiology,whowanttostrengthentheirknowledgeofbasicsignalprocessing/machinelearning/informa<ontheory

4

3

EE381V:GenomicSignalProcessingandDataScience

CourseDescrip<on

5

!   CourseDescripJon:Anexplora<onofsignalprocessinganddatasciencesproblemsencounteredintheanalysisofhigh-throughputgenomicsdata!  applica<onstodiagnos<cs(e.g.,viralstrainrecogni<on),studiesofcomplex

diseases(e.g.,cancer),studiesofimmunesystem,phenotypepredic<on,non-invasivepre-nataltes<ng

!   Topicsinclude:!  DNAsequencingandsequencealignment;!  basecallinginhigh-throughputsequencingsystems!  reference-guidedandreference-free(denovo)genomeassembly!  genotypingandsingleindividualhaplotyping(haplotypeassembly);!  RNAsequencingandChiP-Seq;!  DNAmicroarraysandquan<ta<vepolymerasechainreac<onsystems;!  modelingandinferenceforgene<cregulatorynetworks;!  popula<onhaplotyping;phylogeny;!  futuresequencingtechnologies

EE381V:GenomicSignalProcessingandDataScience

!   Signalprocessingand“bigdata”challengesingenomics

!  formula<ngproblems,presen<ngsolu<ons

!   Duality:computaJonandbiology

!  provideabiology/technologybackgroundtomo<vateacomputa<onaltask

!  overviewrelevantcomputa<onaltechniques,derivesolu<ons,analysis

!   FoundaJonsandfronJers

!  welldefinedconven<onalproblemsandgeneralmethodologies

!  contemporarychallenges,futureresearchdirec<ons,etc.

!   Majorthemes:

!  enablingbiotechnologies:modeling,algorithms,analysisofperformance

!  cellularsystems:computa<onalmethodsforinferringtheirstructureandunderstandinghowtheyfunc<on

6

GoalsfortheTerm

4

EE381V:GenomicSignalProcessingandDataScience

ABI Prism ® 310 Genetic Analyzer Affymetrix GeneChip ® Roche LightCycler ®

DNA Sequencing DNA Microarrays DNA Amplification: QPCR systems

Theme#1:EnablingTechnologies

7

EE381V:GenomicSignalProcessingandDataScience 8

Theme#1:EnablingTechnologiesCont’d•  DetecJonandquanJficaJonofmolecules:highprecision(quan<ta<ve

polymerasechainreac<on--QPCR)orhighthroughput(DNAmicroarrays)

•  QPCR:highprecision(quanJfiessmall#ofDNAmolecules)–  K.MullisandF.Faloona,“SpecificsynthesisofDNAinvitroviaapolymerase-catalyzed

chainreac<on,” MethodsEnzymol(1987).

–  invitroreplica<on(amplifica<on)ofDNAmolecules

–  applica<onstodiagnos<cs(viralandbacterialdetec<on),cancermarkers

iden<fica<on,gene<cfingerprin<ng(asinforensics),etc.

•  DNAMicroarrays:highthroughput(screens10,000sofmolecules)–  M.Schena,D.Shalon,R.W.Davis,P.O.Brown:“Quan<ta<vemonitoringofgene

expressionpaUernswithacomplementaryDNAmicroarray” Science(1995).

–  massivelyparallelbiosensorarrays

–  usedforstudiesofgene<cdiseases,drugdiscovery,genotyping(thespecificgenomeofanindividual),gene<cpathwaydiscovery,etc.

5

EE381V:GenomicSignalProcessingandDataScience 9

Theme#1:EnablingTechnologiesCont’d•  QPCR,DNAmicroarrays:detect/quan<fyDNAmoleculesofknownstructure

•  DNAsequencingsystems:iden<fyunknownstructure

•  High-throughputsequencingisrevolu<onizingresearchandmedicine

•  rou<nesequencingtasksgenera<ngmassiveamountsofdata

•  computa<onallychallenging“bigdata”problems

Sangersequencing:1977–1990s

2ndgenera<onsequencing:since2007

3rdgenera<onsequencing:since2010

EE381V:GenomicSignalProcessingandDataScience 10

Theme#1:EnablingTechnologiesCont’d•  Drama<cimprovementinaffordability:

6

EE381V:GenomicSignalProcessingandDataScience 11

Theme#2:CellularSystems

10bases

=3.4nm

2nm

DNA RNA ProteinTranscrip<on Transla<on

•  Informa<onflowinacell(tradi<onalview:CentralDogma):

•  Informa<on(signal)iscarriedbymolecules.

EE381V:GenomicSignalProcessingandDataScience 12

Theme#2:CellularSystems•  Previouslymen<onedbiotechnologiesinterrupttheinforma<onflowand

soprovideinsightintothecellularstructureandfunc<ons

Sequences

Mechanisms

•  Moreover,studythetemporalchangesintheinforma<onflow;givesinsight

inregula<onmechanisms,biologicalnetworkstructure,etc.

7

EE381V:GenomicSignalProcessingandDataScience 13

Theme#2:CellularSystems

GenefindingDNA

SequencingandGenomeassembly

Regulatorymo<fdiscovery

Compara<vegenomics

Evolu<onarytheory

ACATGCTATACGTGATAAAGAGGATATATATCATAT

ATATGATTT

Databaselookup

Geneexpressionanalysis Clusterdiscovery

Regulatorynetworksinference

Emergingnetworkproper<es

Proteinnetworkanalysis

SEQUENCES

INTERACTIONS

EE381V:GenomicSignalProcessingandDataScience 14

SignalProcessingandDataScienceTasks•  Datasciencetasksonsequencingdatacanbecategorizedasfollows:

•  Tocompletethosetasks,werelyonavarietyoftools:

•  sta<s<calsignalprocessingandmachinelearning

•  combinatorialalgorithms

•  informa<ontheory

8

EE381V:GenomicSignalProcessingandDataScience 15

ExampleApplica<on#1:SequenceAssembly•  Sequencing:determiningtheorderofnucleo<desinatargetDNAstring

•  Shotgunsequencing:assemblethetargetfromoverlappingshortreads

•  denovo:nosideinforma<on,onlythereadsareavailable

•  reference-guided:relyonapre-exis<ngreferencesequence

EE381V:GenomicSignalProcessingandDataScience 16

ExampleApplica<on#1:SequenceAssembly

•  Reference-guidedassemblyreliesonmappingthereadsontoareference;sequencealignment/mappingisafundamentalfirststep

•  dynamicprogrammingsolu<ons(Viterbi,forward-backwardalgorithms)

•  es<ma<oninHiddenMarkovModels(EMalgorithm)

•  datacompressionconcepts(Burrows-Wheelertransform)

•  Reference-free(denovo)assembly

•  greedymerging+extensionoftheoverlappingfragments

•  findingEulerianpathinthedeBruijngraph•  condi<onsforerror-freereconstruc<on

9

EE381V:GenomicSignalProcessingandDataScience 17

ExampleApplica<on#2:HaplotypeAssembly•  Inmanyapplica<ons,therearemul<pletargetsequencesofinterestthat

cannotbeseparatedpriortosequencing

•  haplotypeassembly,viralquasispeciesreconstruc<on,bacterialcommuni<es,immunecellrepertoire

•  Thesimplestone:haplotypeassemblyfordiploids

•  reconstructvariablepartsofchromosomepairs

EE381V:GenomicSignalProcessingandDataScience 18

ExampleApplica<on#2:HaplotypeAssembly•  Shotgunsequencingforhaplotypeassembly:

•  Datamodel:shortreadsobtainedbysampling(withreplacement)fromacomplementarypairofbinarystrings

•  thetaskistoreconstructthepairofstrings

10

EE381V:GenomicSignalProcessingandDataScience 19

ExampleApplica<on#2:HaplotypeAssembly•  Methodsforsolvingthehaplotypeassemblyproblem

•  (correla<on)clustering

•  communica<on-theore<ctechniques:decodingnoisycodewordstransmiUedoverabinaryerasurechannel

•  low-ranksparsematrixcomple<on/factoriza<on

•  Analysisoffundamentallimitsofperformance(accuracy,dataredundancy)

•  Informa<on-theore<ctools

EE381V:GenomicSignalProcessingandDataScience 20

•  RecentIEEEspecialissues(canbeaccessedviaIEEEXplore):

•  IEEESignalProcessingMagazine,SpecialIssueonSignalProcessinginGenomics

andProteomics,vol.29,no.1,January2012.

•  IEEETransac<onsonInforma<onTheory,SpecialIssueonMolecularBiologyand

Neuroscience,vol.56,no.2,February2010.

•  IEEEJournalofSelectedTopicsinSignalProcessing,SpecialIssueonGenomicand

ProteomicSignalProcessing,vol.2,no.3,June2008.

•  IEEESignalProcessingMagazine,SpecialIssueonSignalProcessinginGenomics,

vol.24,no.1,January2007.

•  IEEETrans.onSignalProcessing,SpecialIssueonGenomicSignalProcessing,vol.

54,no.6,June2006.

RecentSpecialIssuesinEE/CSCommunity