Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred...

Analysis of RNA sequencing data sets using the Galaxy environment

Dr. Orr ShomroniDr. Andreas Leha

14-15.09.2017

Outline

Day 1 topics:

● Introduction to RNA-seq workflow and Galaxy

● Sequence read formats and quality assessment

● Read alignment to the genome and quantification of expression

Day 2:

● Experiment design

● Analysis of differential expression

● Functional enrichment analysis of differentially expressed (DE) genes

Analysis of RNA sequencing data setsPart I – from base to count

Dr. Orr ShomroniMicroarray and Deep-sequencing core facility

14.09.2017

https://ycl6.gitbooks.io/rna-seq-data-analysis/rna-seq_analysis_workflow.html

The RNA-seq workflow

● Differentially expressed genes across several conditions of an experiment

● “Simple” – two conditions:● Wild type vs. gene knockout mouse● Healthy person vs. cancer patient● Control vs. treatment with drug

● Complexity can increase arbitrarily:● Many conditions, confounding factors, time course experiments, etc.

RNA-seq workflow I – Hypothesis (a.k.a. the research question)

● Important to ensure (statistical) validity of results● Depends on the hypothesis:

● Cell cultures or animals/patients?● Phenotypic effect mild or severe?● Inclusion of non-coding RNA?● ...

● Affects choice of protocols for culturing, RNA extraction, sample preparation, sequencing, bioinformatics and esp.

number of replicates per condition!→ Involve statistician/bioinformatician from the beginning!

RNA-seq workflow I – Experimental design

www.yeastern.com

RNA-seq workflow I – RNA purification

● RNA extraction● Trizol / ready-to-use kits

● Requires 100 ng to 1 microG of cell material

● RNA integrity number (RIN) – designed to estimate the integrity of total RNA samples

● If RIN is high enough, continue to library preparation

www.abmgood.com

RNA-seq workflow I – library preparation

● Library preparation should be carried out by experienced technicians

● For simple differential expression analysis, we recommend mRNA sequencing

● Cheaper

● RiboZero step to remove rRNA can result in some contamination

www.illumina.com

RNA-seq workflow I – Sequencing

Different technologies, but Illumina's sequencing by synthesis (SBS) approach usually used for RNA-seq→ cycle-specific fluorescence intensities

RNA-seq workflow I – Sequencing processing

● Post-processing of intensity values

● basecalling: convert sequence of intensities to nucleotide sequences (“reads”)

● demultiplexing: assign reads to samples based on their adapter sequences (“barcodes”)

→ Sample-specific sequence read files

● Fragments can be sequenced from one or both ends→ “unpaired”/”single-end” vs. “paired-end”

● RNA-seq often run with single-end

biocluster.ucr.edu

RNA-seq workflow II – FASTQ – the sequencing read file format

● “Raw” reads from sample-specific fragments

● Per-base quality information (Phred score 33)

RNA-seq workflow II – FASTQ processing

Steps towards identifying differential expression of genes between samples:

1) Quality assessment of raw reads

2) Alignment of reads to the genome

3) Quantification of gene expression

QC of Raw Reads

Read Alignment

QuantificationHow can I do that on my own?

Galaxy

● Open source, web-based platform for data intensive biomedical research developed at Penn State and Johns Hopkins University

● Many (NGS) bioinformatics tools available as “plug-ins”

● “Container-based” – server runs in a container that can be installed and customized on other systems→ many instances of Galaxy running worldwide

● User works on “histories” of data and processes, data can be shared with other users

● Galaxy@GWDG: https://galaxy.gwdg.de/

Galaxy – practical I

● Open https://galaxy.gwdg.de/ and login with your GWDG/course account

Galaxy – practical I

Uploading data into Galaxy – a sandbox example:● Go to www.ensembl.org● Click “Downloads”, then “Download data via FTP”● Click on “GTF” for Human Gene sets● Download “Homo_sapiens.GRCh38.90.gtf.gz” to your PC● Go back to Galaxy● Click “Get Data”, then “Upload File from your computer”● Choose local file from your PC (check “Download” folder)● If successful, close the window● Optional: rename history (click on “unnamed history”)

You should see this: Your history should look like this:

Galaxy practical I

● Uploading data may be time-consuming

● Galaxy allows importing data from public repositories and sharing data with other users

● We shared a data set from a published study:

Published January 2017

Galaxy practical I

● “Shared Data” → “Data Libraries” → “RNA_Seq_CourseData” → “Raw Data”

● 3 control condition samples (“GFP...”), 3 overexpression samples (“PCDH7...”)

● Click any of the files to inspect data

● Add all files to your history; several options:

● Individually open files and click “to History” (slow)

● Mark files in folder view and click “to History” (fast)

● Mark whole folder and click “to History” (fast)

● Import into existing history, go to Main menu and click the eye symbol for one of the samples

Here, we demonstrate that overexpression of PCDH7 potently synergizes with lung cancer drivers, including mutant KRAS and EGFR, inducing transformation of human bronchial epithelial cells (HBEC) and promoting tumorigenesis in vivo.

You should see this:

Zoom in to see FastQ file features

← base quality information

← read length

← read nucleotide sequence

RNA-seq workflow II – essential questions about quality control

● How many reads should I have?

● >=25 million reads required for representative transcriptome profile of model organisms such as human and mouse

● PCR introduces many (uninformative) duplicates

● How good are the reads?

● Assess signal-to-noise ratio of sequencing

● Determine proportion of ambigous bases (“N”)

● Identify fraction of adapters, contamination, etc.

RNA-seq workflow II – Phred scores reflecting on basecall accuracy

How good are the bases/reads? → Phred scale: logarithmic scale of basecall accuracy

Common threshold for good quality

Phred Quality Score Probability of Incorrect Basecall Basecall accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

RNA-seq workflow II – Quality control indices

Further quality indices:● Distribution of nucleotide frequencies across the sequences● GC content per sequence● Fraction of “N”● Length distribution of sequences● Sequence duplication level● Amount of overrepresented sequences and short (6-8 bp) stretches of

nucleotides (“k-mers”)● Adapter content → “trimming” may be required

RNA-seq workflow II – FastQC: A quality control tool for high throughput sequence data

Systematically assess quality for NGS samples in Galaxy

→ FastQC

● Open source tool● Runs on all platforms● Assess various quality parameters including contamination by adapters● Allows to provide contamination sequences by user● Generates intuitively interpretable output and visualization

RNA-seq workflow II – FastQC per base quality scores

Galaxy practical II – Quality control with FastQC

● General Sequencing → Quality Control → FastQC and read the description

● Click “Multiple datasets” andselect all FASTQ files from your history

● Click “Execute”

Galaxy practical II – Quality control with FastQC

● Execution calls several instances of the FastQC program, which are “scheduled” by the server→ execution time depends on file size, number of files, number of users and server load

● After a few minutes you should see FastQC results in your history (hit refresh symbol if not)

● As soon as any job is finished you can inspect the results → choose Webpage, then eye symbol

● Scroll through the Webpage→ we are here to answer your questions!

● FastQC RawData contains detailed reports

RNA-seq workflow III – Short read alignment

Goal: determine the origin of sequenced reads w.r.t. the genome

http://www.nature.com/nbt/journal/v27/n5/fig_tab/nbt0509-455_F2.html

Sequence alignment:Re-arrangement of two or more biological sequences to identify corresponding nucleotides/amino acids

Example:

sequence 1: ACATCGAsequence 2: ACTAGCTA

possible alignment:

ACATCG--AAC-TAGCTA

Terminology:● match: two residues in a position match ● mismatch: residue is substituted by different residue● gap: residue(s) is/are inserted or deleted

ACATCG--AAC-TAGCTA

mismatch

insertion

deletion

Quality of an aligment: alignment score: sum of quality of position matches

Example: position scores: match=+1, mismatch=-1, gap=-1

possibility 1: possibility 2:

score: 5*1 + 4*(-1)=1 score: 5*1 + 5*(-1)=0

A C A T C G - - AA C - T A G C T A

A C A T C - G - - AA C - T - A G C T A

Global vs local aligment:● Global: align sequences end-to-end● Local: find optimal placement of (sub)sequence(s) within longer

sequence

Application of sequence alignment:● Homology detection: identify best match of a sequence to many

sequences in a database → e.g. NCBI BLAST

● Identify conserved sites via multiple alignments of related protein sequences → e.g. EMBL-EBI Clustal Omega

● Short read alignment (“mapping”): Identify origin of a sequence w.r.t. a genomic reference sequence→ e.g. Bowtie, BWA, TopHat, STAR, HiSAT, ...

Reference sequence:complement of DNA sequences (genome) or mRNA sequences (transcriptome) from an organism

● usually provided as (multi-)Fasta file containing one sequence per chromosome/transcript

● completeness and complexity depends on organism's genome project advance:

● human genome (GRCh38.p11): 24 almost complete chromosomal sequences + mitochondrial genome + ~170 orphan regions, ~6% undetermined nucleotides (“N”)

● western clawed frog: 400,000 “scaffolds” with ~12% “N”

Organism Assembly Length (Mb)

Chromosomes Genes

Human (Homo sapiens) GRCh38.p11 3253.85 22 chromosomes, 2 sex chromosomes and non-nuclear mitochondrial DNA

African clawed frog (Xenopus laevis)

Xenopus_laevis_v2

2718.43 18 chromosomes, non-nuclear mitochondrial DNA

#Organism/Name SubGroup Size (Mb) Chrs Organelles Plasmids AssembliesLocusta migratoria Insects 5759.8- - - 1Orycteropus afer Mammals 4444.08- 1- 1Chrysochloris asiatica Mammals 4210.11- 1- 1Parhyale hawaiensis Other Animals 4023.76- - - 1Elephantulus edwardii Mammals 3843.98- - - 1Apodemus sylvaticus Mammals 3758.14- - - 1Dasypus novemcinctus Mammals 3631.52- 1- 1Procavia capensis Mammals 3602.18- - - 1Monodelphis domestica Mammals 3598.44 9 1- 1Carlito syrichta Mammals 3453.86- 1- 1Myodes glareolus Mammals 3443.07- - - 1Pongo abelii Mammals 3441.24 24 1- 2Cervus elaphus Mammals 3438.62 35- - 1Dryococelus australis Insects 3416.45- - - 1Pan paniscus Mammals 3286.64 24 2- 1Choloepus hoffmanni Mammals 3286.01- - - 1Loxosceles reclusa Other Animals 3262.48- - - 1Homo sapiens Mammals 3253.85 48 1- 61

Transcriptome sizes are substantially smaller, e.g. human transcriptome:● 20,338 coding genes● 22,521 non-coding genes

● 5,363 small non-coding● 14,720 long non-coding● 2,222 misc non-coding

Total number of transcripts can be much higher:● 200,310 gene transcripts

Goal: determine (optimal) mapping of each sequencing read to reference genome/transcriptome

@SRR2549634.1 SEB9BZKS1:279:C4JALACXX:8:1101:1292:2222/1NCCCCTTGGTCACCTTGCTTGATTATCGTAGCACCTTTGGGGACGGACTTC

@SRR2549634.2 SEB9BZKS1:279:C4JALACXX:8:1101:1771:2249/1GTTAGATGCAACTCTTGGCCATAAATCGGCACATTCCTTACCGACTGGACC

@SRR2549634.3 SEB9BZKS1:279:C4JALACXX:8:1101:4645:2229/1NGAATGGTATGTTGCTGGACCTCAGAAGGATGTTCAAAACCACAGTCAATG

@SRR2549634.4 SEB9BZKS1:279:C4JALACXX:8:1101:4518:2229/1NTGGATCCTCAAATCCCACCACATCCATCCAAGGATCATGATTAAAAGCGT

@SRR2549634.5 SEB9BZKS1:279:C4JALACXX:8:1101:5231:2241/1NTGGGTATTCACTGAAAGCTTCAACACACATTGGCTTAGATGGAACGAACT

@SRR2549634.6 SEB9BZKS1:279:C4JALACXX:8:1101:5383:2243/1TGGGTGTAGACATCTTCAACACCAGCCAATTGCAACAACTTTTTGACAGCT

@SRR2549634.7 SEB9BZKS1:279:C4JALACXX:8:1101:7221:2245/1TGGAAATGTTGTCCAGAGTTATCTGGATGATCTAACGTGGGGTTATTGTTT

@SRR2549634.8 SEB9BZKS1:279:C4JALACXX:8:1101:8304:2249/1GCCAGACAGAGGTTTTTCAAATTAGGAAATGTTTGAGCCAATGTGGAAATT

@SRR2549634.9 SEB9BZKS1:279:C4JALACXX:8:1101:9168:2233/1NCTATTTTCATCATCTGATTGAAAAAAAACATTGAAAATATACTCATCATT

@SRR2549634.10 SEB9BZKS1:279:C4JALACXX:8:1101:9915:2241/1NGTGGACAAGATTCTTGGAGCCTTACCCTTGTGTGGACCCATACCGAAGTG

● Mapping = always local alignment● Reads from RNA can span exons

→ “spliced” (gapped) alignment necessary

● Exact alignment of each read to each genome position is very slow → efficient algorithms make use of precomputed tables of short word occurrences in the reference sequence (“hashing”,”indexing”)

● Example:

ACATCGAT consists ofACA CAT ATC TCG CGA GAT

words of length 3AAAAACAAGAAT…

ACA…

occurrences in the genomechr1:12345-12347,...chr3:9876-9874,...

---chrX:81838-81840

…chr13:123-125,...

…chr1:2435-2437,...

Galaxy@GWDG provides three read alignment tools:● RNA STAR* –

● Advantage: one of the most sensitive, precise, versatile and fast read alignment programs● Disadvantage: memory-intensive

● HISAT2** - fast and sensitive, can be run on a laptop● TopHat*** - fast splice junction mapper, uses Bowtie2 and then analyzes the mapping

results to identify splice junctions between exons● genome indexes precomputed for human and mouse

*Dobin et al., Bioinformatics, 2013**Kim et al., Nature Methods, 2015***Kim et. al., Genome Biology, 2013

Galaxy practical part III – short read alignment

● Transcriptomics → Mapping → HISAT2

● Select “unpaired reads”

● Choose one(!!!) of the six FASTQ files

● Select “Homo_sapiens...” as a reference genome

● When job is scheduled click on HISAT2 again and read the description

Note: mapping will take a while (~30min.)!

Visualization of alignments as stacked read sequences:

More flexible: Genome browsers● Visualization of reads, splice patterns, mutations etc.● Integration of annotation, public data, known SNPs etc.● UCSC online genome browser: genome.ucsc.edu● Downloadable and usable from Galaxy:

IGV from Broad Institute* software.broadinstitute.org/software/igv/

*Robinson et al., Nature Biotechnology, 2011

The RNA-seq workflow III – Short read alignment

Read coverage: # of reads matching a position/region● Allows statements about gene expression level (RNA-seq)● High coverage helps to identify genomic variants● Depends on sequencing depth

SAM = Sequence Alignment/Map format● Human-readable standard format for alignment characterization● Contains general information on alignment program/parameters and

reference sequence used ● One entry per alignment with information on location, quality and more● BAM = Binary (compressed) version● samtools: popular tool for SAM/BAM file manipulation

Several metrics allow statements about the total sample alignment quality:● Total number of mapped reads (→ coverage) and fraction of reads

mapping to the genome...● ...uniquely: evidence for particular gene/transcript● ...multiply: paralogs, CNV, ribosomal RNA, ...● ...not at all: contamination, genomic DNA, ...

● # mismatches● # novel splice junctions● ...

Example mapping output:

● Click on the finished job and inspect the mapping statistics

● Click the info icon to assess information on the job details including version of the software used

● Start IGV on your system (search on Desktop)● Open “.bat” file

● Choose “Human Hg38” as a reference genome

● Go to the locus field and enter “PCDH7”

● Shared Data → Data Libraries → RNA_seq_CourseData → Aligned Files

● Import all alignment (“BAM”) files into your history

● Go to main view (“Analyze Data”)

● Click on any of the alignment files from GFP and click “display with IGV local”

● Click on any of the alignment files from PCDH7 and click “display with IGV local”

● Go to IGV, zoom in on the first exon of PCDH7

● Right-click on the data tracks and choose “Collapsed”

RNAseq-workflow IV - quantification of expression

Gene expression quantificationGoal: estimate the gene expression level from counting reads overlapping annotated genes

discoveringthegenome.org

RNAseq-workflow IV – quantification of expression

● Annotations are often available from genome project websites or Ensembl

● Standard format for annotations is the general feature format (GFF) or gene transfer format (GTF)● Tab-delimited files with information on gene structures● 10 fields including flexible “Attributes”

RNAseq-workflow IV – quantification of expression

● The file we down-/uploaded earlier is an annotation in GTF format for the human genome

Standard procedure: count number of reads that overlap features (here: exons of a gene) and summarize on meta-feature (here: gene) level

Questions and pitfalls when counting mapped reads

● Consider multiply mapped reads?● Count on gene or exon/transcript

level?● How to count partially mapping

reads?● How to treat overlapping features?● ...

● Galaxy@GWDG provides featureCounts* tool for fast and flexible quantification

● Transcriptomics → Counting → featureCounts and read the description

● Click “Multiple datasets” and select all imported alignment files

● load the annotation file (the GTF file) from your history

quantification should take between 1 to 10 min.

*Liao et al., Bioinformatics, 2014

Galaxy practical part IV – gene expression quantification

● When any dataset is finished, click on eye symbol

● Copy identifier of a gene with >1000 reads assigned and paste it into Ensembl search window

● Optional: rename files according to alignment input

RNA workflow addendum – Summary of quality from multiple samples

● Quality assessment of 6 samples – easy enough to do one by one● What about more?

● Solution: MultiQC● Supports summary logs from multiple software, including FastQC, STAR, Bowtie2,

featureCounts, etc.

● Generates a single HTML file, summarizing all results in a single, interactive report

RNA workflow addendum – Summary of quality from multiple samples

Galaxy practical addendum – quality summary (FastQC)

Questions?

Galaxy practical addendum – quality summary

Galaxy environment Dr. Orr Shomroni Analysis of RNA ... · Per-base quality information (Phred...

Documents

RNA$seq(dataanalysis(tutorial(...QC(and(pre$processing(• Firststep(in(QC:((– Look(atquality(scores(to(see(if(sequencing(was(successful(• Sequence(datausually(stored(in(FASTQ(format:

MedChem 401 RNA Viruses These viruses contain a genome composed of RNA Remember plus-sense vs. minus-sense RNA genomes Remember RNA-dependent RNA polymerases

RNA polymerase #1 General properties E. coli RNA polymerase Eukaryotic RNA polymerases

A look at the methods behind whole‐genome single ... · ‐ I’ll explain the dataformat for a fastq file in the next slide ‐ These fastq files are what gets transferred over

MuG - FASTQ Pipelines Documentation

TRANSCRIPTION RNA is transcribed from a DNA template. DNA RNA polymerase RNA transcript RNA PROCESSING In eukaryotes, the RNA transcript (pre- mRNA) is

RNA-seq data analysis Project QI LIU. From reads to differential expression Raw Sequence Data FASTQ Files Unspliced Mapping BWA, Bowtie Mapped Reads SAM/BAM

Differential expression analysis of RNA Seq data using DESeq2 · Di erential expression analysis of RNA{Seq data using DESeq2 4 3.2 Quality control commands After the FASTQ les have

HEREDITARY TEST REQUISITION Fast Processing Prenatal ... · Blood-Adult(2 x 4ml EDTA) 4005 ACG ... charged as a solo. Reporting and data exchange .fastq .bam .vcf. Raw data (.fastq

Benasque RNA 2012: RNA Motifs

Cornell University · 2800. 2600- 2200 L Analyzer: RNA Quality RNA Property Summary RNA Concentration (ng/uL) 2056057 25S/18S RNA Quality Number 7.8 RNA Property Summary RNA Concentration

World Bank Documentdocuments.worldbank.org/curated/en/574861468749948925/pdf/multi0page.pdf · PHRED Policy and Human Resources Development Fund PIP Portfolio Improvement Plan SIGFIP

Illumina Output Fastq format and Quality control · Fastq format – fasta with qualities • p = the probability that the corresponding base call is wrong • Qualities – p = 0.1

RNA PROCESSING AND RNPs. RNA Processing Very few RNA molecules are transcribed directly into the final mature RNA. Most newly transcribed RNA molecules

RNA-Regulation: RNA Interference

Nature Genetics: doi:10.1038/ng · simulated mutations. Reported CADD phred score is a phred-like rank score based on whole genome CADD raw scores. CADD raw scores were ranked among

skewer run2 SW041 1-trimir2.fastq FastQC Report · 2015. 5. 21. · Filename skewer_run2_SW041_1-trimmed-pair2.fastq File type Conventional base calls Encoding Sanger / Illumina 1.9

RNA 6 - RNA MLP Back-propagation5

From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline

Transcription: RNA synthesis, RNA processing, “Algorithms ... · Transcription: RNA synthesis, RNA processing, Intron-exon structure Promoters: Regulation, Chromatin structure,