30
RNA-Seq in Galaxy: Tuxedo protocol Igor Makunin, UQ RCC, QCIF

RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

RNA-Seq inGalaxy:Tuxedoprotocol

IgorMakunin,UQRCC,QCIF

Page 2: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Acknowledgments

GenomicsVirtualLab:gvl.org.auGalaxyfortutorials:galaxy-tut.genome.edu.auGalaxyAustralia:galaxy-aust.genome.edu.au

Contributorsandparticipants:

Page 3: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Planfortoday

Galaxy

DatatypesusedinRNA-Seq analysis

RNA-Seq practical

Galaxyworkflow

Page 4: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

High-throughputsequencing

Bigscalesequencing• 100,000,000ssequences,orreads,perexperiment• sequencingofa(random)library• lowcostpernucleotide

Populartechnologies:• illumina• ion/proton• PacBio

Emergingtechnologies• OxfordNanopore MinION

AnalysisofNGSdataBigdatasetsComputationallyintensiveDedicatedtoolsanddatatypesExtensiveuseofpublicdata

Storage

Computationalresources

Publicdata

Knowledgeandskills

Tools

Page 5: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Galaxy:howdoesitlooklike

WorkingwindowUpload

Historymenu

Page 6: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

GalaxyhistorysystemHistorymenuRefresh

Source:http://galaxyproject.github.io/training-material/topics/introduction/tutorials/galaxy-intro-history/tutorial.html

Page 7: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

PublicGalaxyservers

Advantageoftheregistration:• accesstohistoriesoverlongtime• multiplehistories• abilitytouseGalaxyfromdifferentdevices• biggerquotas(onsomeservers)• ftp

• IndependentregistrationoneveryGalaxyserver

• Differenttools,differentuserpolicy

• DatacanbemovedbetweenGalaxyservers

Galaxyservers:usegalaxy.orgusegalaxy.eu

galaxy-tut.genome.edu.au

galaxy-aust.genome.edu.au

Page 8: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

GalaxyAustralia

Lessjobsonweekends

Jobsperday

galaxy-aust.genome.edu.au

Workernodes:16CPUs,64GBRAM

49TbVolumestorage(userdata)

Designedforagenomescaleresearch>1,600registeredusers

Upto16CPUs60GBRAMperjobUpto12concurrentjobsperuserUpto1Tbperuser

Page 9: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

TuxedoprotocolGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols

VisualisealignmentwithIGV

FASTQFASTA

GFF BAM

Genomebrowser

Page 10: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

FASTQformat

@SRR3145.19ILLUMINA-C32_FC:3:1:80:12/1TAGCAGCACATCATGGTTTACATCGTATGC+IIHIDIIIIIIIIIIIIIHIHIIIIIDGIB

Namealwaysstartswith@SequenceAlwaysstartswith+;mayhavenameEncodedPhred qualityscore

single-endreads paired-endreads

Terminology: read isasequencewithqualityscorevaluesproducedbyasequencingmachine

Commonoutputformat:FASTQ compressedwithgzip,e.g.SRR3145_1.fq.gz

MultiplereadsinasingleFASTQfileEachreadisdescribedbyfourlines

Page 11: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

FASTQPhred qualityscore

Quality+Offset

39+33=72

ASCII(72):H

Range:~0to~40

Phred 10:accuracy90%Phred 20:accuracy99%Phred 30:accuracy99.9%Phred 40:accuracy99.99%

Valuesareencodedbycharacters

Advantage:asinglecharacterisusedinsteadofatwo-digitnumber

APhredqualityscoreisameasureofthequalityoftheidentificationforeverynucleotide.

@S391ILLUMINA_FC:3:80:12/1TAGCAGCACATCATGGTTTAC+IIHIDIIIIIIIIIIIIIHIH

Page 12: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

ASCIItable

Page 13: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Phred qualityscoreencodingOffset33- SangerOffset64- oldillumina

Source:https://en.wikipedia.org/wiki/FASTQ_format

Qual.=40Offset=3340+33=73ASCII(73):I

Page 14: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

FASTQqualityscoreinGalaxyManyoldillumina datasetshaveaproprietarydataencoding(offset64)CurrentlymostNGSdatasetsusetheSangerencoding(offset33)

GalaxyBydefaultGalaxyassign‘fastq’datatypetouploadedFASTQfiles.Inthiscasetheoffsetisnotspecified,andmanytoolsdonotrecognizethedata

fastqillumina – oldillumina qualityscoreencoding(offset64,illumina 1.3+)fastqsanger – newillumina 1.8+/SangerqualityscoreencodingSometoolsinGalaxynowworkonlywithfastqsanger datatype

Solution:- specifyfastqsanger orfastqillumina datatype duringupload- changetheformatviaAttributes>Datatype- useNGS:QCandmanipulation>FASTQGroomertool

Page 15: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

TuxedoprotocolGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols

VisualisealignmentwithIGV

FASTQFASTA

GFF BAM

Genomebrowser

Page 16: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

ReferencegenomesGenomeReferenceConsortium:…aconsensusrepresentationofthegenome.

FASTAformat

ThehumanreferencesequenceGRCh37(hg19)containsthemitochondrialgenome,22autosomes,chrX,chrY,9haplotypechromosomes,39unplacedcontigs,and20unlocalized contigs.

Genomesarebig.GRCh38.p10totalnon-Nbases:3,080,585,178

Genomesmayhavemanyassemblyversions(releases,build):mm9,mm10

Usethesameassemblyversionforthereferencesequenceandgeneannotations.

Orderofsequences/contigs mightbeimportantforsometools.

“chr1”and“1”arenotidenticalforsometools.http://hgdownload.cse.ucsc.edu/gbdb/hg19/html/description.html

Page 17: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

GeneannotationsCoordinate-based:linkedtoaparticulargenomeassembly,e.g.,hg19

GFF(GeneralFeatureFormat)formatconsistsofonelineperfeature,eachcontaining9columnsofdata,plusoptionaltrackdefinitionlines.Popularversions:GTF(=GFF2),GFF3tab-separatedfields

##gff-version3ctg123. mRNA 13009000.+.ID=mrna0001;Name=sonichedgehogctg123 .exon 1300 1500.+.ID=exon00001;Parent=mrna0001ctg123. exon 1050 1500.+.ID=exon00002;Parent=mrna0001ctg123.exon 3000 3902.+.ID=exon00003;Parent=mrna0001ctg123. exon 5000 5500.+.ID=exon00004;Parent=mrna0001ctg123.exon 7000 9000.+.ID=exon00005;Parent=mrna0001

http://asia.ensembl.org/info/website/upload/gff3.html

seqid

source

type start end

score

strand

phase'0','1'or'2'

attributes

The first line must be a comment that identifies the version

bothare1-based

Page 18: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

IntervalsCoordinate-based:linkedtoaparticulargenomeassembly,e.g.,hg19

BEDformat,upto12columnsofdata(UCSCTableBrowser),plusoptionaltrackheaderlines.tab-separatedfields

GFF3##gff-version3ctg123. mRNA 13009000.+.ID=mrna0001;Name=sonichedgehog

BEDctg12312999000sonichedgehog .+

chromchromStart

chromEndscore

strandname

1-based

0-based

Page 19: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Aligners

Alignersmapreadstoareferencesequence.

Alignersuseproprietaryindexfilesformapping.bwa index hg19.fa

Galaxy-qld providesindicesforseveralgenomeassemblies(hg19,hg38,mm9,mm10etc.)

Galaxyusersalsocanuseacustomreferencesequenceforalignment.Inthissituationthealignercreatesindicesinatemporaryworkingdirectoryforeveryjob.

ContactGalaxy-qld adminsifyouplantorunmanyalignmentjobswithacustomgenome.Wecanaddgenomeindicestotheserver.

OnlyforBWA Onlyforhg19

Gappedalignment

Page 20: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Alignments:SAMandBAM50xcoverageofthehumangenomewithreadlength100bp:~1,500,000,000readsCompressedsizeofsuchalignmentcanbe>100Gb.

SAM:SequenceAlignment/Map.Plaintextformat.BAM:binary(compressed)versionofthealignmentformat.

SAMcoordinatesare1-basedBAMcoordinatesare0-based

BAMsareindexedforrapidaccess.Usefulforalignmentvisualization.

Itisalwaysgoodtohaveaheader!@HD VN:1.0 SO:queryname@RG ID:igGroup SM:igSmpl LB:igL1 PL:ILLUMINA@SQ SN:chr2L LN:23011544@PG ID:TopHat VN:2.0.14

CL:/mnt/galaxy/tools/tophat/2.0.14/iuc/package_tophat_2_0_14/536f7bb5616d/bin/tophat --num-threads5….

Readgroups Canhandlemultiplesamplesinalignment

Page 21: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

SAMformat

11mandatorycolumnsandoptionalfieldswiththeTAG:TYPE:VALUEformat

Page 22: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

VisualizationofBAMs

AlignmentonIGV

Galaxy servers can act as a track hub

Itispossibletoaddmultipletracks:BAMs,geneannotations,knownvariants…

Page 23: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Genomebrowsers

IntegrativeGenomicsViewer,IGVEfficientgenomeviewerdevelopedbytheBroadInstitute.Installableonpersonalcomputers.Canaddacustomgenome.

UCSCGenomeBrowserAbigserverintheUS.TableBrowserfordataanalysis(intersection)SupportdataexporttoGalaxyCustomsessions(cansaveyourtracks)liftOver toolPublictrackhubs

Page 24: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

RNA-Seq withtheCufflinkspackageGVLBasicRNA-Seq GalaxytutorialTrapnell etal.(2012)NatureProtocols

Visualisealignments

Datamanipulation

D.melanogasterTwoconditionsThreereplicatesDataforchr4

Page 25: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

SetupfortheworkshopGVLwebsite:gvl.org.au

3

RegisteronGalaxy-tut:galaxy-tut.genome.edu.au

BasicGalaxytutorial

Page 26: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Galaxyisaworkflowengine

Selecttoolorinputdataset

Addname,comments

ToolboxNoodle

Input

AGalaxyworkflowisaseriesoftoolsanddatasetactionsthatruninsequenceasabatchoperation

Emailnotification

Page 27: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Galaxyworkflow

Page 28: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

CreateaGalaxyworkflow

Fromhistory

Fromscratch

Page 29: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Exercise

WewillcreateaGalaxyworkflowforRNA-Seq analysiswithoutreplicates:tophat2>Cuffdiff >Filter

Page 30: RNA-Seqin Galaxy: Tuxedo protocol · Public Galaxy servers Advantage of the registration: • access to histories over long time • multiple histories • ability to use Galaxy from

Acknowledgments

GenomicsVirtualLab:gvl.org.auGalaxyfortutorials:galaxy-tut.genome.edu.auGalaxyAustralia:galaxy-aust.genome.edu.au

Contributorsandparticipants: