65
Analysis of RNA sequencing data sets using the Galaxy environment Dr. Orr Shomroni Dr. Andreas Leha 14-15.09.2017

Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Analysis of RNA sequencing data sets using the Galaxy environment

Dr. Orr ShomroniDr. Andreas Leha

14-15.09.2017

Page 2: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Outline

Day 1 topics:

● Introduction to RNA-seq workflow and Galaxy

● Sequence read formats and quality assessment

● Read alignment to the genome and quantification of expression

Day 2:

● Experiment design

● Analysis of differential expression

● Functional enrichment analysis of differentially expressed (DE) genes

Page 3: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Analysis of RNA sequencing data setsPart I – from base to count

Dr. Orr ShomroniMicroarray and Deep-sequencing core facility

14.09.2017

Page 4: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

https://ycl6.gitbooks.io/rna-seq-data-analysis/rna-seq_analysis_workflow.html

The RNA-seq workflow

Page 5: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

● Differentially expressed genes across several conditions of an experiment

● “Simple” – two conditions:● Wild type vs. gene knockout mouse● Healthy person vs. cancer patient● Control vs. treatment with drug

● Complexity can increase arbitrarily:● Many conditions, confounding factors, time course experiments, etc.

RNA-seq workflow I – Hypothesis (a.k.a. the research question)

Page 6: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

● Important to ensure (statistical) validity of results● Depends on the hypothesis:

● Cell cultures or animals/patients?● Phenotypic effect mild or severe?● Inclusion of non-coding RNA?● ...

● Affects choice of protocols for culturing, RNA extraction, sample preparation, sequencing, bioinformatics and esp.

number of replicates per condition!→ Involve statistician/bioinformatician from the beginning!

RNA-seq workflow I – Experimental design

Page 7: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

www.yeastern.com

RNA-seq workflow I – RNA purification

● RNA extraction● Trizol / ready-to-use kits

● Requires 100 ng to 1 microG of cell material

● RNA integrity number (RIN) – designed to estimate the integrity of total RNA samples

● If RIN is high enough, continue to library preparation

Page 8: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

www.abmgood.com

RNA-seq workflow I – library preparation

● Library preparation should be carried out by experienced technicians

● For simple differential expression analysis, we recommend mRNA sequencing

● Cheaper

● RiboZero step to remove rRNA can result in some contamination

Page 9: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

www.illumina.com

RNA-seq workflow I – Sequencing

Different technologies, but Illumina's sequencing by synthesis (SBS) approach usually used for RNA-seq→ cycle-specific fluorescence intensities

Page 10: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow I – Sequencing processing

● Post-processing of intensity values

● basecalling: convert sequence of intensities to nucleotide sequences (“reads”)

● demultiplexing: assign reads to samples based on their adapter sequences (“barcodes”)

→ Sample-specific sequence read files

● Fragments can be sequenced from one or both ends→ “unpaired”/”single-end” vs. “paired-end”

● RNA-seq often run with single-end

Page 11: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

biocluster.ucr.edu

RNA-seq workflow II – FASTQ – the sequencing read file format

● “Raw” reads from sample-specific fragments

● Per-base quality information (Phred score 33)

Page 12: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – FASTQ processing

Steps towards identifying differential expression of genes between samples:

1) Quality assessment of raw reads

2) Alignment of reads to the genome

3) Quantification of gene expression

QC of Raw Reads

Read Alignment

QuantificationHow can I do that on my own?

Page 13: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy

● Open source, web-based platform for data intensive biomedical research developed at Penn State and Johns Hopkins University

● Many (NGS) bioinformatics tools available as “plug-ins”

● “Container-based” – server runs in a container that can be installed and customized on other systems→ many instances of Galaxy running worldwide

● User works on “histories” of data and processes, data can be shared with other users

● Galaxy@GWDG: https://galaxy.gwdg.de/

Page 14: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy – practical I

● Open https://galaxy.gwdg.de/ and login with your GWDG/course account

Page 15: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy – practical I

Uploading data into Galaxy – a sandbox example:● Go to www.ensembl.org● Click “Downloads”, then “Download data via FTP”● Click on “GTF” for Human Gene sets● Download “Homo_sapiens.GRCh38.90.gtf.gz” to your PC● Go back to Galaxy● Click “Get Data”, then “Upload File from your computer”● Choose local file from your PC (check “Download” folder)● If successful, close the window● Optional: rename history (click on “unnamed history”)

Page 16: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

You should see this: Your history should look like this:

Page 17: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical I

● Uploading data may be time-consuming

● Galaxy allows importing data from public repositories and sharing data with other users

● We shared a data set from a published study:

Published January 2017

Page 18: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical I

● “Shared Data” → “Data Libraries” → “RNA_Seq_CourseData” → “Raw Data”

● 3 control condition samples (“GFP...”), 3 overexpression samples (“PCDH7...”)

● Click any of the files to inspect data

● Add all files to your history; several options:

● Individually open files and click “to History” (slow)

● Mark files in folder view and click “to History” (fast)

● Mark whole folder and click “to History” (fast)

● Import into existing history, go to Main menu and click the eye symbol for one of the samples

Here, we demonstrate that overexpression of PCDH7 potently synergizes with lung cancer drivers, including mutant KRAS and EGFR, inducing transformation of human bronchial epithelial cells (HBEC) and promoting tumorigenesis in vivo.

Page 19: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

You should see this:

Page 20: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Zoom in to see FastQ file features

← base quality information

← read length

← read nucleotide sequence

Page 21: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – essential questions about quality control

● How many reads should I have?

● >=25 million reads required for representative transcriptome profile of model organisms such as human and mouse

● PCR introduces many (uninformative) duplicates

● How good are the reads?

● Assess signal-to-noise ratio of sequencing

● Determine proportion of ambigous bases (“N”)

● Identify fraction of adapters, contamination, etc.

Page 22: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – Phred scores reflecting on basecall accuracy

How good are the bases/reads? → Phred scale: logarithmic scale of basecall accuracy

Common threshold for good quality

Phred Quality Score Probability of Incorrect Basecall Basecall accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

Page 23: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – Quality control indices

Further quality indices:● Distribution of nucleotide frequencies across the sequences● GC content per sequence● Fraction of “N”● Length distribution of sequences● Sequence duplication level● Amount of overrepresented sequences and short (6-8 bp) stretches of

nucleotides (“k-mers”)● Adapter content → “trimming” may be required

Page 24: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – FastQC: A quality control tool for high throughput sequence data

Systematically assess quality for NGS samples in Galaxy

→ FastQC

● Open source tool● Runs on all platforms● Assess various quality parameters including contamination by adapters● Allows to provide contamination sequences by user● Generates intuitively interpretable output and visualization

Page 25: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow II – FastQC per base quality scores

Page 26: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical II – Quality control with FastQC

● General Sequencing → Quality Control → FastQC and read the description

● Click “Multiple datasets” andselect all FASTQ files from your history

● Click “Execute”

Page 27: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical II – Quality control with FastQC

● Execution calls several instances of the FastQC program, which are “scheduled” by the server→ execution time depends on file size, number of files, number of users and server load

● After a few minutes you should see FastQC results in your history (hit refresh symbol if not)

● As soon as any job is finished you can inspect the results → choose Webpage, then eye symbol

● Scroll through the Webpage→ we are here to answer your questions!

● FastQC RawData contains detailed reports

Page 28: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Goal: determine the origin of sequenced reads w.r.t. the genome

http://www.nature.com/nbt/journal/v27/n5/fig_tab/nbt0509-455_F2.html

Page 29: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Sequence alignment:Re-arrangement of two or more biological sequences to identify corresponding nucleotides/amino acids

Example:

sequence 1: ACATCGAsequence 2: ACTAGCTA

possible alignment:

ACATCG--AAC-TAGCTA

Page 30: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Terminology:● match: two residues in a position match ● mismatch: residue is substituted by different residue● gap: residue(s) is/are inserted or deleted

ACATCG--AAC-TAGCTA

match

mismatch

insertion

deletion

Page 31: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Quality of an aligment: alignment score: sum of quality of position matches

Example: position scores: match=+1, mismatch=-1, gap=-1

possibility 1: possibility 2:

score: 5*1 + 4*(-1)=1 score: 5*1 + 5*(-1)=0

A C A T C G - - AA C - T A G C T A

A C A T C - G - - AA C - T - A G C T A

Page 32: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Global vs local aligment:● Global: align sequences end-to-end● Local: find optimal placement of (sub)sequence(s) within longer

sequence

Page 33: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Application of sequence alignment:● Homology detection: identify best match of a sequence to many

sequences in a database → e.g. NCBI BLAST

● Identify conserved sites via multiple alignments of related protein sequences → e.g. EMBL-EBI Clustal Omega

● Short read alignment (“mapping”): Identify origin of a sequence w.r.t. a genomic reference sequence→ e.g. Bowtie, BWA, TopHat, STAR, HiSAT, ...

Page 34: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Reference sequence:complement of DNA sequences (genome) or mRNA sequences (transcriptome) from an organism

● usually provided as (multi-)Fasta file containing one sequence per chromosome/transcript

● completeness and complexity depends on organism's genome project advance:

● human genome (GRCh38.p11): 24 almost complete chromosomal sequences + mitochondrial genome + ~170 orphan regions, ~6% undetermined nucleotides (“N”)

● western clawed frog: 400,000 “scaffolds” with ~12% “N”

Organism Assembly Length (Mb)

Chromosomes Genes

Human (Homo sapiens) GRCh38.p11 3253.85 22 chromosomes, 2 sex chromosomes and non-nuclear mitochondrial DNA

60298

African clawed frog (Xenopus laevis)

Xenopus_laevis_v2

2718.43 18 chromosomes, non-nuclear mitochondrial DNA

36776

Page 35: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

#Organism/Name SubGroup Size (Mb) Chrs Organelles Plasmids AssembliesLocusta migratoria Insects 5759.8- - - 1Orycteropus afer Mammals 4444.08- 1- 1Chrysochloris asiatica Mammals 4210.11- 1- 1Parhyale hawaiensis Other Animals 4023.76- - - 1Elephantulus edwardii Mammals 3843.98- - - 1Apodemus sylvaticus Mammals 3758.14- - - 1Dasypus novemcinctus Mammals 3631.52- 1- 1Procavia capensis Mammals 3602.18- - - 1Monodelphis domestica Mammals 3598.44 9 1- 1Carlito syrichta Mammals 3453.86- 1- 1Myodes glareolus Mammals 3443.07- - - 1Pongo abelii Mammals 3441.24 24 1- 2Cervus elaphus Mammals 3438.62 35- - 1Dryococelus australis Insects 3416.45- - - 1Pan paniscus Mammals 3286.64 24 2- 1Choloepus hoffmanni Mammals 3286.01- - - 1Loxosceles reclusa Other Animals 3262.48- - - 1Homo sapiens Mammals 3253.85 48 1- 61

Page 36: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Transcriptome sizes are substantially smaller, e.g. human transcriptome:● 20,338 coding genes● 22,521 non-coding genes

● 5,363 small non-coding● 14,720 long non-coding● 2,222 misc non-coding

Total number of transcripts can be much higher:● 200,310 gene transcripts

Page 37: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Goal: determine (optimal) mapping of each sequencing read to reference genome/transcriptome

@SRR2549634.1 SEB9BZKS1:279:C4JALACXX:8:1101:1292:2222/1NCCCCTTGGTCACCTTGCTTGATTATCGTAGCACCTTTGGGGACGGACTTC

@SRR2549634.2 SEB9BZKS1:279:C4JALACXX:8:1101:1771:2249/1GTTAGATGCAACTCTTGGCCATAAATCGGCACATTCCTTACCGACTGGACC

@SRR2549634.3 SEB9BZKS1:279:C4JALACXX:8:1101:4645:2229/1NGAATGGTATGTTGCTGGACCTCAGAAGGATGTTCAAAACCACAGTCAATG

@SRR2549634.4 SEB9BZKS1:279:C4JALACXX:8:1101:4518:2229/1NTGGATCCTCAAATCCCACCACATCCATCCAAGGATCATGATTAAAAGCGT

@SRR2549634.5 SEB9BZKS1:279:C4JALACXX:8:1101:5231:2241/1NTGGGTATTCACTGAAAGCTTCAACACACATTGGCTTAGATGGAACGAACT

@SRR2549634.6 SEB9BZKS1:279:C4JALACXX:8:1101:5383:2243/1TGGGTGTAGACATCTTCAACACCAGCCAATTGCAACAACTTTTTGACAGCT

@SRR2549634.7 SEB9BZKS1:279:C4JALACXX:8:1101:7221:2245/1TGGAAATGTTGTCCAGAGTTATCTGGATGATCTAACGTGGGGTTATTGTTT

@SRR2549634.8 SEB9BZKS1:279:C4JALACXX:8:1101:8304:2249/1GCCAGACAGAGGTTTTTCAAATTAGGAAATGTTTGAGCCAATGTGGAAATT

@SRR2549634.9 SEB9BZKS1:279:C4JALACXX:8:1101:9168:2233/1NCTATTTTCATCATCTGATTGAAAAAAAACATTGAAAATATACTCATCATT

@SRR2549634.10 SEB9BZKS1:279:C4JALACXX:8:1101:9915:2241/1NGTGGACAAGATTCTTGGAGCCTTACCCTTGTGTGGACCCATACCGAAGTG

Page 38: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

● Mapping = always local alignment● Reads from RNA can span exons

→ “spliced” (gapped) alignment necessary

Page 39: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

● Exact alignment of each read to each genome position is very slow → efficient algorithms make use of precomputed tables of short word occurrences in the reference sequence (“hashing”,”indexing”)

● Example:

ACATCGAT consists ofACA CAT ATC TCG CGA GAT

words of length 3AAAAACAAGAAT…

ACA…

TTT

occurrences in the genomechr1:12345-12347,...chr3:9876-9874,...

---chrX:81838-81840

…chr13:123-125,...

…chr1:2435-2437,...

Page 40: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Galaxy@GWDG provides three read alignment tools:● RNA STAR* –

● Advantage: one of the most sensitive, precise, versatile and fast read alignment programs● Disadvantage: memory-intensive

● HISAT2** - fast and sensitive, can be run on a laptop● TopHat*** - fast splice junction mapper, uses Bowtie2 and then analyzes the mapping

results to identify splice junctions between exons● genome indexes precomputed for human and mouse

*Dobin et al., Bioinformatics, 2013**Kim et al., Nature Methods, 2015***Kim et. al., Genome Biology, 2013

Page 41: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical part III – short read alignment

● Transcriptomics → Mapping → HISAT2

● Select “unpaired reads”

● Choose one(!!!) of the six FASTQ files

● Select “Homo_sapiens...” as a reference genome

● Click “Execute”

● When job is scheduled click on HISAT2 again and read the description

Note: mapping will take a while (~30min.)!

Page 42: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical part III – short read alignment

Page 43: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Visualization of alignments as stacked read sequences:

Page 44: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

More flexible: Genome browsers● Visualization of reads, splice patterns, mutations etc.● Integration of annotation, public data, known SNPs etc.● UCSC online genome browser: genome.ucsc.edu● Downloadable and usable from Galaxy:

IGV from Broad Institute* software.broadinstitute.org/software/igv/

*Robinson et al., Nature Biotechnology, 2011

Page 45: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

The RNA-seq workflow III – Short read alignment

Page 46: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Read coverage: # of reads matching a position/region● Allows statements about gene expression level (RNA-seq)● High coverage helps to identify genomic variants● Depends on sequencing depth

Page 47: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

SAM = Sequence Alignment/Map format● Human-readable standard format for alignment characterization● Contains general information on alignment program/parameters and

reference sequence used ● One entry per alignment with information on location, quality and more● BAM = Binary (compressed) version● samtools: popular tool for SAM/BAM file manipulation

Page 48: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Page 49: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Page 50: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Several metrics allow statements about the total sample alignment quality:● Total number of mapped reads (→ coverage) and fraction of reads

mapping to the genome...● ...uniquely: evidence for particular gene/transcript● ...multiply: paralogs, CNV, ribosomal RNA, ...● ...not at all: contamination, genomic DNA, ...

● # mismatches● # novel splice junctions● ...

Page 51: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA-seq workflow III – Short read alignment

Example mapping output:

● Click on the finished job and inspect the mapping statistics

● Click the info icon to assess information on the job details including version of the software used

Page 52: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical part III – short read alignment

● Start IGV on your system (search on Desktop)● Open “.bat” file

● Choose “Human Hg38” as a reference genome

● Go to the locus field and enter “PCDH7”

Page 53: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical part III – short read alignment

● Shared Data → Data Libraries → RNA_seq_CourseData → Aligned Files

● Import all alignment (“BAM”) files into your history

● Go to main view (“Analyze Data”)

● Click on any of the alignment files from GFP and click “display with IGV local”

● Click on any of the alignment files from PCDH7 and click “display with IGV local”

● Go to IGV, zoom in on the first exon of PCDH7

● Right-click on the data tracks and choose “Collapsed”

Page 54: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV - quantification of expression

Gene expression quantificationGoal: estimate the gene expression level from counting reads overlapping annotated genes

discoveringthegenome.org

Page 55: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV – quantification of expression

● Annotations are often available from genome project websites or Ensembl

● Standard format for annotations is the general feature format (GFF) or gene transfer format (GTF)● Tab-delimited files with information on gene structures● 10 fields including flexible “Attributes”

Page 56: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV – quantification of expression

● The file we down-/uploaded earlier is an annotation in GTF format for the human genome

Page 57: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV - quantification of expression

Standard procedure: count number of reads that overlap features (here: exons of a gene) and summarize on meta-feature (here: gene) level

Page 58: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV - quantification of expression

Questions and pitfalls when counting mapped reads

● Consider multiply mapped reads?● Count on gene or exon/transcript

level?● How to count partially mapping

reads?● How to treat overlapping features?● ...

Page 59: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNAseq-workflow IV - quantification of expression

● Galaxy@GWDG provides featureCounts* tool for fast and flexible quantification

● Transcriptomics → Counting → featureCounts and read the description

● Click “Multiple datasets” and select all imported alignment files

● load the annotation file (the GTF file) from your history

● Click “Execute”

quantification should take between 1 to 10 min.

*Liao et al., Bioinformatics, 2014

Page 60: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential
Page 61: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical part IV – gene expression quantification

● When any dataset is finished, click on eye symbol

● Copy identifier of a gene with >1000 reads assigned and paste it into Ensembl search window

● Optional: rename files according to alignment input

Page 62: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA workflow addendum – Summary of quality from multiple samples

● Quality assessment of 6 samples – easy enough to do one by one● What about more?

● Solution: MultiQC● Supports summary logs from multiple software, including FastQC, STAR, Bowtie2,

featureCounts, etc.

● Generates a single HTML file, summarizing all results in a single, interactive report

Page 63: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

RNA workflow addendum – Summary of quality from multiple samples

Page 64: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Galaxy practical addendum – quality summary (FastQC)

Page 65: Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred score 33) RNA-seq workflow II – FASTQ processing Steps towards identifying differential

Questions?

Galaxy practical addendum – quality summary