Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017
https://abc.med.cornell.edu/
Friederike Dündar with Luce Skrabanek & Ceyda Durmaz
Day 1: Introduction into high-throughput sequencing [many general concepts!]
1. RNA isolation & library preparation
2. Illumina’s sequencing by synthesis
3. raw sequencing reads • download
• quality control
4. experimental design
RNA-seq is popular, but still developing
Reuter et al. ( 2015). Mol Cell. Goodwin, McPherson & McCombie (2016). Nat Gen, 17(6), 333–351
“RNA%seq)is)not$a$mature$technology.)It$is$undergoing$rapid$evolution$of)biochemistry)of)sample)preparation;)of)sequencing)platforms;)of)computational)pipelines;)and)of$subsequent$analysis$methods$that$include$statistical$treatments)and)transcript)model)building.)“) ENCODE&consortium&
“Analysis paralysis” • basically no generally
accepted standard reference
• myriad tools ! highly complex & specialized “pipelines”
“The (…) flexibility and seemingly infinite set of options (…) have
hindered its path to the clinic. (…) The fixed nature of probe sets with
microarrays or qRT-PCR offer an accelerated path (…) without the lure
of the latest and newest analysis methods.”
Byron et al., 2016
Byron et al. Nat Rev Genetics (2016)
What to expect from the class
Sample type & quality
Sequencing • Read length • PE vs. SR • Sequencing errors
Experimental design • Controls • No. of replicates • Randomization
Library preparation • Poly-A enrichment
vs. ribo minus • Strand information
Bioinformatics • Aligner • Normalization • DE analysis strategy
• Expression quantification • Alternative splicing • De novo assembly needed • mRNAs, small RNAs • ….
Biological question
NOT COVERED: • novel transcript
discovery • transcriptome
assembly • alternative splicing
analysis (see the course notes for
references to useful reviews)
RNA-seq workflow overview
Sequencing
• Fragmentation• mRNA enrichment• Library preparation
Bioinformatics
• Cluster generation• Sequencing by synthesis• Image acquisition
Total RNA extractionRNA
cDNA with adapters
cells
fragments
Quality control of RNA extraction
28S:18S ratio
avoid degraded RNA junk
RNA-seq library preparation
double-stranded cDNA synthesis followed by the same steps asdone for DNA-seq, i.e. end-repair, ligation of dsDNA adapters, andPCR amplification (Fig. 2). A major drawback of this method isthat strand information is not preserved. For this reason, manydifferent methods have been developed for strand-specific RNA-seq, which fall into two main classes. One class of methods relieson marking one strand by chemical modification. These modifica-tion methods essentially follow the standard protocol with theexception of these marking steps. A well-known example is thedUTP second strand marking method, in which dUTP is incorpo-rated in the second strand, preventing this strand from beingamplified by PCR, thus leading to the exclusive amplification ofthe first strand [22]. The second class relies on attaching differentadapters in a known orientation relative to the 5' and 3' ends ofthe original mRNA. A major representative of this class is theIllumina RNA ligation method, in which adapters are ligated tothe RNA in a sequential manner. Other methods of this class are
the “not so random” (NSR) method, which relies on first andsecond-strand cDNA synthesis from degenerate primers tailedwith adapter sequences [23] or the SMART method, in whichreverse transcription from an adapter-tailed primer is followed bytemplate switching at the 5' end of the RNA template. SMART isbased on the fact that reverse transcriptases can add three non-templated C residues to which an oligonuceotide with threeterminal G's can bind; this oligonucleotide will then serve as atemplate for continued cDNA synthesis [24]. A variant is theSMART-RNA ligation (hybrid) method, in which first strand cDNAsynthesis is preceded by 3' adapter ligation as in the RNA ligationmethod.
RNA-seq protocols are technically more challenging than DNA-seq protocols and are often biased procedures. Common types ofbias include low complexity (many reads with the same startingpoint), uneven coverage across different regions of transcriptionunits, and antisense artifacts in the case of stranded libraries.
3’ adapter ligation
random priming and reverse transcription
second strand synthesis
PCR
end repair, A-addition, adapter ligation
PCR
UUU
UU UU
reverse transcription
PCR
rRNA depletion/mRNA enrichment
fragmentation
5’ adapter ligation
end repair, A-addition, adapter ligation
U U
Fig. 2 – The most common RNA-seq protocols fall in three main classes. (A) Classical Illumina protocol. Random-primed double-stranded cDNA synthesis is followed by adapter ligation and PCR. (B) One class of strand-specific methods relies on marking onestrand by chemical modification. The dUTP second strand marking method follows basically the same procedure as the classicalprotocol except that dUTP is incorporated during second strand cDNA synthesis, preventing this strand from being amplified byPCR. Most current transcriptome library preparation kits follow the dUTP method. (C) The second class of strand-specific methodsrelies on attaching different adapters in a known orientation relative to the 5' and 3'ends of the RNA transcript. The Illuminaligation method is a well-know example of this class and is based on sequential ligation of two different adapters. Most currentsmall RNA library preparation kits follow the RNA ligation method.
E X P E R I M E N T A L C E L L R E S E A R C H 3 2 2 ( 2 0 1 4 ) 1 2 – 2 016
classical Illumina protocol (unstranded)
Van Dijk et al. (2014). Experimental Cell Research, 322(1), 12–20. doi:10.1016/j.yexcr.2014.01.008
sequential ligation of two different adapters
RNA extraction poly(A) enrichment or
ribo-depletion
dUTP stranded library preparation
QC!
RNA-seq workflow overview
Sequencing
Total RNA extraction
RNA
cDNA with adapters
cells
fragments
flowcell with primers
http://informatics.fas.harvard.edu/test-tutorial-page/
Cluster generation
bridge amplification denaturation cluster generation removal of complementary
strands ! identical fragment copies remain
http://informatics.fas.harvard.edu/test-tutorial-page/
Sequencing by synthesis
5"
Illumina Sequencing Workflow
1. extend 1st base 2. read 3. deblock
repeat for 50 – 100 bp generate base calls
Image from Illumina
labelled dNTP
Typical biases of Illumina sequencing • sequencing errors • miscalled bases • PCR artifacts (library preparation)
• duplicates (due to low amounts of starting material) • length bias • GC bias
Figure from Love et al. (2016). Nat Biotech, 34(12). More details & refs in course notes (esp. Table 6).
RNA-seq-specific
sample-specific problems!
General sources of biases(not inherently sample-specific)
inclusion of multi-mapped reads exclusion of multi-mapped reads
• issues with the reference • CNV • mappability
• inappropriate data processing
RAW SEQUENCING READS Let the data wrangling begin!
Images
Raw reads
Aligned reads
Read count table
Normalized read count table
List of fold changes & statistical values
Downstream analyses on DE genes
FASTQC
Base calling & demultiplexing
Mapping
Counting
DE test & multiple testing correction
Normalizing
Filtering
.tif
.fastq
Bioinformatics workflow of RNA-seq analysis
.sam/.bam
.txt
.Robj
.Robj, .txt
Raw reads
Bustard/RTA/OLB, CASAVA
STAR
featureCounts
DESeq2, edgeR
DESeq2, edgeR, limma
Customized scripts
Where are all the reads? International Nucleotide Sequence Database Collaboration
Sequence Read
Archive GenBank DDBJ
ENA
http://www.ncbi.nlm.nih.gov/genbank/ http://www.ddbj.nig.ac.jp/intro-e.html
https://www.ebi.ac.uk/ena/
The SRA is the main repository for publicly available DNA and RNA sequencing data of which three instances are maintained world-wide.
Let’s download!
• We will work with a data set submitted by Gierlinski et al.
• they deposited the sequence files with SRA – we will retrieve it via ENA (https://www.ebi.ac.uk/ena/)
• accession number: ERP004763 ls
mkdir
wget
cut
grep
awk
Course notes @ https://chagall.med.cornell.edu
See Section 2 (Raw Data) for download instructions etc.
FASTQ file format = FASTA + quality scores
1 read " 4 lines!
2.2 Storing sequencing reads: FASTQ format
Example data Throughout the course, we will be working with sequencing reads from the most compre-hensive RNA-seq dataset to date that contains mRNA from 48 replicates of two S. cerevisiae populations:wildtype and snf2 knock-out mutants (Gierlinski et al., 2015; Schurch et al., 2015). All 96 samples weresequenced on one flowcell (Illumina HiSeq 2000); each sample was distributed over seven lanes, which meansthat there are seven technical replicates per sample. The accession number for the entire data set (consistingof 7 x 2 x 48 raw read files) is ERP004763.
? Use the information from the file ERP004763 sample mapping.tsv to download all FASTQ filesrelated to the biological replicates no. 1 of sample type “SNF2” as well as of sample type “WT”.Try to do it via the command line and make sure to create two folders (e.g., SNF2 rep1 andWT rep1) of which each should contain seven FASTQ files in the end.
2.2 Storing sequencing reads: FASTQ format
Currently, raw reads are most commonly stored as FASTQ files. However, details of the file formats may varywidely depending on the sequencing platform, the lab that released the data, or the data repository. Fora more comprehensive overview of possible file formats of raw sequencing data, see the NCBI’s file formatguide: https://www.ncbi.nlm.nih.gov/books/NBK242622/.
The FASTQ file format was derived from the simple text format for nucleic acid or protein sequences, FASTA.FASTQ bundles the sequence of every single read produced during a sequencing run together with the qualityscores. FASTQ files are uncompressed and quite large because they contain the following information for everysingle sequencing read:
1. @ followed by the read ID and possibly information about the sequencing run
2. sequenced bases
3. + (perhaps followed by the read ID again, or some other description)
4. quality scores for each base of the sequence (ASCII-encoded, see below)
Again: be aware that this is not a strictly defined file format – variations do exist and may cause havoc!
Here’s a real-life example snippet of a FASTQ file downloaded from ENA:⌥ ⌅1 $ zcat ERR459145.fastq.gz | head2 @ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/13 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC4 +5 @7 <DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA ’5?B@D6 @ERR459145 .2 DHKW5DQ1 :219: D0PT7ACXX :2:1101:2652:2237/17 GCAGCATCGGCCTTTTGCTTCTCTTTGAAGGCAATGTCTTCAGGATCTAAG8 +9 @@;BDDEFGHHHHIIIGBHHEHCCHGCGIGGHIGHGIGIIGHIIAHIIIGI
10 @ERR459145 .3 DHKW5DQ1 :219: D0PT7ACXX :2:1101:3245:2163/111 TGCATCTGCATGATCTCAACCATGTCTAAATCCAAATTGTCAGCCTGCGCG⌃ ⇧
! For paired-end (PE) sequencing runs, there will always be two FASTQ files – one for the forwardreads, one for the backward reads.Once you have downloaded the files for a PE run, make sure you understand how the origin ofeach read (forward or reverse read) is encoded in the read name information as some downstreamanalysis tools may require you to combine the two files into one.
? 1. Count the number of reads stored in a FASTQ file.
2. Extract just the quality scores of the first 10 reads of a FASTQ file.
3. Concatenate the two FASTQ files of a PE run.
c� 2015 Applied Bioinformatics Core | Weill Cornell Medical College Page 11 of 66
1 2 3 4
1. @Read ID and sequencing run information 2. sequence 3. + (additional description possible) 4. quality scores
Base quality score
2.2 Storing sequencing reads: FASTQ format
Example data Throughout the course, we will be working with sequencing reads from the most compre-hensive RNA-seq dataset to date that contains mRNA from 48 replicates of two S. cerevisiae populations:wildtype and snf2 knock-out mutants (Gierlinski et al., 2015; Schurch et al., 2015). All 96 samples weresequenced on one flowcell (Illumina HiSeq 2000); each sample was distributed over seven lanes, which meansthat there are seven technical replicates per sample. The accession number for the entire data set (consistingof 7 x 2 x 48 raw read files) is ERP004763.
? Use the information from the file ERP004763 sample mapping.tsv to download all FASTQ filesrelated to the biological replicates no. 1 of sample type “SNF2” as well as of sample type “WT”.Try to do it via the command line and make sure to create two folders (e.g., SNF2 rep1 andWT rep1) of which each should contain seven FASTQ files in the end.
2.2 Storing sequencing reads: FASTQ format
Currently, raw reads are most commonly stored as FASTQ files. However, details of the file formats may varywidely depending on the sequencing platform, the lab that released the data, or the data repository. Fora more comprehensive overview of possible file formats of raw sequencing data, see the NCBI’s file formatguide: https://www.ncbi.nlm.nih.gov/books/NBK242622/.
The FASTQ file format was derived from the simple text format for nucleic acid or protein sequences, FASTA.FASTQ bundles the sequence of every single read produced during a sequencing run together with the qualityscores. FASTQ files are uncompressed and quite large because they contain the following information for everysingle sequencing read:
1. @ followed by the read ID and possibly information about the sequencing run
2. sequenced bases
3. + (perhaps followed by the read ID again, or some other description)
4. quality scores for each base of the sequence (ASCII-encoded, see below)
Again: be aware that this is not a strictly defined file format – variations do exist and may cause havoc!
Here’s a real-life example snippet of a FASTQ file downloaded from ENA:⌥ ⌅1 $ zcat ERR459145.fastq.gz | head2 @ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/13 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC4 +5 @7 <DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA ’5?B@D6 @ERR459145 .2 DHKW5DQ1 :219: D0PT7ACXX :2:1101:2652:2237/17 GCAGCATCGGCCTTTTGCTTCTCTTTGAAGGCAATGTCTTCAGGATCTAAG8 +9 @@;BDDEFGHHHHIIIGBHHEHCCHGCGIGGHIGHGIGIIGHIIAHIIIGI
10 @ERR459145 .3 DHKW5DQ1 :219: D0PT7ACXX :2:1101:3245:2163/111 TGCATCTGCATGATCTCAACCATGTCTAAATCCAAATTGTCAGCCTGCGCG⌃ ⇧
! For paired-end (PE) sequencing runs, there will always be two FASTQ files – one for the forwardreads, one for the backward reads.Once you have downloaded the files for a PE run, make sure you understand how the origin ofeach read (forward or reverse read) is encoded in the read name information as some downstreamanalysis tools may require you to combine the two files into one.
? 1. Count the number of reads stored in a FASTQ file.
2. Extract just the quality scores of the first 10 reads of a FASTQ file.
3. Concatenate the two FASTQ files of a PE run.
c� 2015 Applied Bioinformatics Core | Weill Cornell Medical College Page 11 of 66
base error probability p,
e.g. 10e-4!
Phred score, e.g.: 40
-10 x log10(p)
“FASTQ score”, e.g.: (
turn score into ASCII symbol
http://www.ascii-code.com/
Base quality scores
Indicator.[6] The Illumina manual[7] (page 30) states the following: If a read ends witha segment of mostly low quality (Q15 or below), then all of the quality values in thesegment are replaced with a value of 2 (encoded as the letter B in Illumina'stext-based encoding of quality scores)... This Q2 indicator does not predict a specificerror rate, but rather indicates that a specific final portion of the read should not beused in further analyses. Also, the quality score encoded as "B" letter may occurinternally within reads at least as late as pipeline version 1.6, as shown in thefollowing example:
@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB
An alternative interpretation of this ASCII encoding has been proposed.[8] Also, in Illuminaruns using PhiX controls, the character 'B' was observed to represent an "unknown qualityscore". The error rate of 'B' reads was roughly 3 phred scores lower the mean observedscore of a given run.
Starting in Illumina 1.8, the quality scores have basically returned to the use of theSanger format (Phred+33).
For raw reads, the range of scores will depend on the technology and the base caller used,but will typically be up to 41 for recent Illumina chemistry. Since the maximum observedquality score was previously only 40, various scripts and tools break when they encounterdata with quality values larger than 40. For processed reads, scores may be even higher.For example, quality values of 45 are observed in reads from Illumina's Long ReadSequencing Service (previously Moleculo).
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40 -5....0........9.............................40 0........9.............................40 3.....9.............................40 0.2......................26...31........41
S - Sanger Phred+33, raw reads typically (0, 40)X - Solexa Solexa+64, raw reads typically (-5, 40)I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above).L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Color space
For SOLiD data, the sequence is in color space, except the first position. The quality valuesare those of the Sanger format. Alignment tools differ in their preferred version of thequality values: some include a quality score (set to 0, i.e. '!') for the leading nucleotide,others do not. The sequence read archive includes this quality score.
FASTQ format - Wikipedia, the free encyclopedia
5 of 8
also see Table 2 in the course notes image from https://en.wikipedia.org/wiki/FASTQ_format
• each base has a certain error probability (p) • Phred score = -10 x log10(p)• Phred scores are ASCII-encoded, e.g., “!” COULD represent Phred score 33
Quality control of raw reads: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are: • Import of data from BAM, SAM or FastQ files (any variant) • Providing a quick overview to tell you in which areas there may be
problems • Summary graphs and tables to quickly assess your data • Export of results to an HTML based permanent report • Offline operation to allow automated generation of reports without
running the interactive application
$ mat/software/FastQC/fastqc
$ mat/software/anaconda2/bin/multiqc
not specific for RNA-seq data!
EXPERIMENTAL DESIGN How to avoid spurious signals and drowning in noise
How deep is deep enough?
Goals that require more, longer, and possibly paired-end reads: • quantification of lowly expressed genes • identification of genes with small changes between conditions • investigation of alternative splicing/isoform quantification • identification of novel transcripts, chimeric transcripts • de novo transcriptome assembly
for DGE (logFC~ 2) in mammals: 20 – 50 mio SR, 75 bp
Why do we need replicates?
“Samples are our windows to the population, and their statistics are used to estimate those of the population.” Martin Krzywinski & Naomi Altman
Goal: Identify differences in expression for every gene.
…and “differences” should preferably be due to our experiment, not noise!
doi:10.1038/nmeth.2613
Invest in replicates!
• recommended: 6 biological replicates per condition for DGE of strongly changing genes (logFC >= 2) [based on insights from the fairly simple yeast transcriptome]
6
Outlier fraction
The poor correlations shown by some replicates are the result of a small proportion of genes with atypical read counts. These outliers can be identified by comparing each gene’s expression in an individual replicate with the trimmed mean across all repli-cates. Specifically, the 𝑛𝑡 largest and smallest values are trimmed from the set of rep-licates for a gene before calculating the mean (�̅�𝑔;𝑛𝑡) and standard deviation (𝑠𝑔;𝑛𝑡). Genes are then identified as outliers if |𝑥𝑔𝑖 − �̅�𝑔;𝑛𝑡| > 𝑛𝑠𝑠𝑔;𝑛𝑡 , where 𝑛𝑠 is a constant. Fig. 2b shows the fraction 𝑓𝑖 of all genes identified as outliers for each replicate 𝑖, for 𝑛𝑡 = 3 and 𝑛𝑠 = 5. As expected, the anomalous replicates with high outlier fraction in Fig. 2b correspond well with the poorly correlating replicates in Fig. 2a. Increasing 𝑛𝑡 and/or decreasing 𝑛𝑠 enhances the outlier fraction in replicates already identified as anomalous; for example, reducing the standard deviation limit to 𝑛𝑠 = 3 boosts the outlier fraction in these replicates by a factor ~2–3.
Gene read depth profiles
The atypical total read counts of outlier genes are, in our case, the result of an atypi-cal, strongly non-uniform, read depth gene profile. An example of the difference of the read depth profiles between clean and “bad” replicates is shown in Fig. 3 for the gene YHR215W. The distribution of reads from the example “bad” replicate (WT rep-licate 21, red line) is much less uniform than the mean read depth from the other “clean” replicates (black line) with distinct peaks in the distributions that are 50 bp long. These suspicious features are found universally in all outlier genes in “bad” rep-licates and can be also seen in some (but not all) non-outlier genes in these replicates. It is likely that the cause of this atypical read distribution is uneven priming during the PCR amplification step of the library preparation, prior to the sequencing. The level of gene non-uniformity in each replicate can be quantified with a reduced chi-squared statistic for each gene in each replicate, defined as
Fig. 3. Read depth profiles of YHR215W (PHO12). The black line indicates the mean read counts from all “clean” WT replicates for a given genomic position in the gene YHR215W (the set of “clean” replicates is defined in Section 3.1). The grey lines show the mean read depth plus/minus one standard deviation. The red line illustrates the read depth profile from a single example “bad” replicate (WT replicate 21). The block diagram at the bottom shows the simple gene structure of YHR215W.
log2
Cou
nts
10.16
10.18
10.20
10.22
10.24
10.26
condition 1 condition 2
Gene X
Gierliński et al. (2015). Bioinformatics, 31(22), 3625–3630. & Schurch et al. (2016) RNA.
Replicates Technical replicates
library prep
sequencing lane
RNA extraction
library prep sequencing lane
sequencing lane sequencing lane
RNA from an independent growth of cells/tissue
RNA extraction
RNA extraction
Biological replicates
sequencing lane library prep
sequencing lane
sequencing lane sequencing lane
also see course notes and Blainey et al. (2014) Nature Methods, 1(9) 879–880.
Gilad & Mizrahi-Man (2015). F1000Research 4:121
“Once$we$accounted$for$the$batch$effect$[i.e.,)mouse)and)human)samples)being)sequenced)on)two)different)machines])(…),)the)comparative)gene)expression)data)no)longer)clustered)by)species,)and)instead,)we)observed)a$clear$tendency$for$clustering$by$tissue.”))
Batch effects can happen everywhere
“Overall,)our)results)indicate)that)there)is)considerable$RNA$expression$diversity$between$humans$and$mice,)well)beyond)what)was)described)previously,)likely)reIlecting)the)fundamental)physiological)differences)between)these)two)organisms.)“)
Lin, Lin, and Snyder (2014). PNAS 111:48
ENCODE’s* study design was not optimal
* not just ENCODE: see e.g. Leek et al. (2010) Nat Rev Gen 11(10) 733-739 or Jaffe & Irizarry (2014) Genome Biol 15(R31) 1–9
A very good read (including the reviews and comments that discuss many scientific as well as ethical issues: https://f1000research.com/articles/4-121/v1
Tissue was confounded with (at least): • sequencer • sex • age • tissue handling
human data: deceased organ donors mouse data: 10-week-old littermates
Avoiding bias
Block what you can, randomize what you cannot.
Completely randomized design
Restricted randomized design
WEIGHT �
Blocked & randomized design
What factors are of interest? Which ones might introduce noise? Which nuisance factors do you absolutely need to account for?
Krzywinski & Altman (2014) Nature Methods 11(7)
Auer & Doerge (2010). Genetics, 185(2), 405–16.
Make sure the sequencing core multiplexes all samples!
Typical RNA-seqset-up • keep the technical nuisance
factors (harvest date, RNA extraction kit, sequencing date…) to a minimum
• cover only as much of the biological variation as needed (just keep possible restrictions about your conclusions in mind for later)
Summary Day 1 • RNA-seq analysis is not a completely solved issue – but
DE analysis on a gene level is decently mature and the field seems to gravitate towards some sort of standard
• no analysis tool can enforce (or replace!) common sense and knowledge about the biology behind the experiment
• crap in, crap out • more replicates are often better investments than more
reads • FastQC and multiQC are great tools to detect possible
technical nuisance factors