Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Introduction to Cancer Genomics
Gene expression data analysis – part I
David GfellerComputational Cancer Biology
Ludwig Center for Cancer [email protected]
1
Overview
1. Basic understanding of RNA-Seq data processing.
2. Differential expression.
3. Dimensionality reduction.
2
Examples of R code
Goals
• Help you understand what can be done with a computer -> programming logic
• Give you some basic idea of how to ask the computer to perform some tasks -> syntax.
• Show you a few examples of gene expression data analysis in R that you could reuse for your projects (see also practical).
3
Gene expression experiments
• Microarrays: Chip with DNA probes that will pair with DNA (retro-transcribed RNA) in a sample. Intensity is measured as a light signal.– Very popular in (2000-2010)
• RNA-Seq: Directly count how many transcripts (mRNA molecules) originate from each gene in a sample.– Increasingly replacing microarrays for gene
expression analyses 4
RNA-Seq
Gene expression => Quite easy (count the reads)Gene fusion => More difficult (especially for new fusion events)Splicing => More difficult (especially for poorly annotated isoforms)
RNA fragmentation
Adaptors + amplification
Sequencing
Map to reference transcriptomeRNA
ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…
Reverse transcription
Gene A Gene B
>100M reads
5
1 - Typical output of RNA-Seq
Raw sequences:
- Fastq format (sequence of the reads + quality information)
- Bam format (compressed)
- Sra format (compressed)
Processed data:
- Counts: Number of reads mapping to each gene/transcript.
6
How to think about these data in a
computer
N genes
S s
am
ple
s
Sample1: gene1: 254; gene2: 1284; gene3: 7234; …
Sample2: gene1: 5; gene2: 362; gene3: 0; …
Sample3: gene1: 8902; gene2: 2199; gene3: 722; …
Each expression value corresponds to a scalar. Each sample corresponds to
a vector. All samples form a matrix (M)
M[s,n] corresponds to expression of gene “n” in sample “s” 7
Computers like numbers
In R:- Scalar (numeric)- Vector (array)- Matrix (multidimensional arrays, e.g. S x N)
Gene expression data are “naturally” digitalized, which makes them especially appropriate to use with computers
Many other biological objects can be digitalized as vectors or matrices:
- Protein/DNA sequences <-> vectors of letters/numbers- Protein structures <-> vectors/matrices of 3D coordinates- Interactions <-> N x N matrix with 1’s and 0’s- Image <-> matrix of pixel (1/0 for two-color image)- Set of measurements <-> vector of values
8
How to think about these data in a computer
In R, once you load your data into a matrix (M), you can very easily:- Print one specific column: M[,2]- Print one specific line: M[1,]- Plot the correlation of two genes: plot(M[,5], M[,7])- Make operations on lines or columns.
9
Let’s practice…
• Create a empty directory ‘Tutorial_Gfeller’ and
‘Tutorial_Gfeller/Data’
• Download the file: GSE93722_RAW.tar at:
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE93722
• Put it in Tutorial_Gfeller/Data/ and uncompress it and uncompress
the zip files. Each of the files corresponds to the gene expression
profiling of a melanoma sample.
• Open Rstudio.
• Set the working directory (Session -> Set Working Directory) to
‘Tutorial_Gfeller’.
• Create a new Rscript file (File-> New File -> R script); this is where
you will write your code and save it in ‘Tutorial_Gfeller’ as ‘file’.R.
10
Let’s load the data…• Each GSMxxx corresponds to one sample• First have a look at the files in a Excel (or any text editor). To
start with, we will focus on the ‘expected_count’ column• The command to load file is read.delim():
m1 <- read.delim("Data/GSE93722_RAW/GSM2461003_LAU125.genes.results.txt")
• Then execute the command in the Console (pasting it or command+Enter).– Now you can look at the elements of m1 (e.g., for the first line, type m1[1,] in the console). Does it correspond to the first line of the file?
– With dim(m1)you can check the dimensions of m1.11
Name of the object that will store the data.
Path to the file to be loaded
Let’s load the data
• Load the other files into m2 (GSM2461007_LAU1255), m3 (GSM2461009_LAU1314) and m4 GSM2461005_LAU355).
• Build a matrix taking the fifth column in each file:
12
M <- matrix(nrow=4, ncol=dim(m1)[1])
M[1,] <- m1[,5]
Initialize an empty
matrix with the correct dimension
In the first line, put
the 5th column of m1
• Do the same with m2, m3 and m4 (M[2,] <- m2[,5], if
you had many files, we would do a loop, see exercises).• Try to query any entry of your matrix (e.g., M[3,5]). Do
you get the expected number?
Genes have (many) names…
13
• In these files, we have Ensembl gene Ids… We want to convert them to Common Gene names.– We need a file with the mapping (two columns, one for
Ensembl IDs, one for gene names).
• Go to: https://www.ensembl.org/biomart/martview/– Select Ensembl Genes 94, then Human genes.– In Attributes, Select GENE: -> Gene stable ID and
EXTERNAL: -> HGNC symbol.– Click on Results, then ‘Unique results only’, and ‘Go’ to
save to a local file (put the file in Tutorial_Gfeller/Data).
Then in R• Open the file:
mapping <- read.delim("Data/mart_export.txt")
• Use the match() function to find the position in mapping of all the genes for which you have expression data in m1:i <- match(m1[,1], mapping[,1])
• Then build a vector with the gene namesgene <- as.character(mapping[i,2])N <- length(gene)
• Verify that the mapping is correct by checking a few examples
14
Computers like simple and sequential calculations
• Additions/subtractions and multiplications/divisions– You need to decompose any problem into a set of “simple” operations.– You need to tell the computer about every step of your calculations
(e.g., loop over all entries in one column).
Example: Find the average expression of a gene (e.g., EGFR) across samples
15
How to do it on a computer
M =
1) Have a matrix “M” with all expression values and a vector “gene” with the name of the genes (columns of M).
2) Find the column corresponding to your gene: n <- which(gene == "EGFR")
3) Initialize a scalar: av <- 0
4) Go through each element of the column: S <- dim(M)[1]
for(s in 1:S){ av <- av + M[s,n] }
5) Normalize your value: av <- av/S
M[,n]
gene = EGFR
16
How programming languages work
• The exact commands will change between programming languages (R, python, perl, C, matlab), but the logic remains the same (“grammar”).
• Learning the syntax (“words”) can be done with many online resources.
• In these two days, we will focus on R, since it is very convenient for graphical visualization of the data.
– Many built-in functions (e.g., mean()), but important to understand the logic.
17
Typical output of RNA-Seq
Raw sequences:
- Fastq format (sequence of the reads + quality information
- Bam format (compressed)
- Sra format (compressed)
Processed data:
- Counts: Number of reads mapping to each gene/transcript.
18
Computational analyses
• Alignments• Isoforms (splicing)
• Low complexity regions (repeats)• Variable regions (TCR, MHC)
• Sequencing errors
• Poorly annotated regions / genomes
Gene A Gene B
Map to reference transcriptome
ACCTAG…CGGTAA…ATGGCA…TGGGAC…TATAGG…
>100M reads
19
What else needs to be considered
• Longer genes have more reads.
Gene A Gene B
If you want to compare expression between genes, you need to renormalize
by gene length,
• Different samples can have different total number of reads (e.g., different sequencing depth).
Gene A Gene B
Gene A Gene B
Sample 1
Sample 2
If you want to compare expression between samples, you need to renormalize by total
number of reads,
How to do it (naïve way)
N <- dim(M)[2]M.norm <- matrix(nrow=S, ncol=N) # Initialize an empty matrixfor( s in 1:S ){
tot <- 0for (n in 1:N){
tot <- tot+M[s,n] # Compute the sum over row s}for (n in 1:N){
M.norm[s,n] <- M[s,n]/tot # Normalize row s}
}M.norm <- M.norm*1000000 # Avoid having too small numbers
M =
10’362’09312’482’5467’542’733
21
A few names commonly used• Raw counts: Number of reads mapping to a gene
• Scaled counts: After renormalization by total number of counts in the sample.
• Reads Per Kilobase Million (RPKM): Divide by the total number of reads and then by the gene length.
• Transcripts Per Kilobase Million (TPM): Divide by gene length and then normalize across all genes (i.e. sum of TPMs of all genes is the same for all samples)
Multiply by 1’000’000 to have numbers that are
easier to read.
Scaled counts vs TPM vs RPKM
• TPM are increasingly used.• The sum is always equal to 106 in TPM.• The two values (TPM vs RPKM) are equivalent,
up to a renormalizing factor.
• Scaled counts are enough to compare the same gene in different samples. TPM/RPKM are required to compare different genes.
23
Studying expression of some gene in two types of samples
M[,n]
1) Define the groups: G1 <- c(1,2); G2 <- c(3,4)
2) Find the column corresponding to the gene: n <- which(gene== "CD19")
3) Take the mean over the blue box: av1 <- 0; for(s in G1) { av1 <- av1 + M.norm[s,n] }; av1 <- av1/length(G1)
4) Take the mean over the red box: av2 <- 0; for(s in G2) {av2 <- av2 + M.norm[s,n] }; av2 <- av2/length(G2)
5) Compare expression.6) With more samples you can do statistics (T-test, boxplot,… see exercises).
G1
G2
24
2 - Differential expressionEx
pres
sion
leve
l
S1 S2
Samples
How can we quantify these differences?
25
Differential expression
• Log fold change:– High expression genes can show big differences in counts
(10’000 to 20’000), compared to low expression genes (10 to 20), even if they experience the same relative change.
– Better to use logarithms.– 10 -> 20 = log2 fold change of 1 = 10’000 -> 20’000.
– But warning: log transforms amplify changes in low expression!!!
26
Differential expression• P-value:
– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
Expr
essio
n le
vel
Expr
essio
n le
vel
Differences in the mean values are not enough! 27
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
1
2
1’000
2’000
Depending on your random model, the first case may be more likely to appear “by chance”.
28
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
1
2
1’000
2’000 Advanced statistical methods have been developed to estimate P-values in RNA-Seq data!
29
Differential expression
• P-value:– Give a statistical significance, but not trivial to estimate.
Expr
essio
n le
vel
Gene 1
Gene 5 Gene 6
Gene 4
Gene 3
Gene 2
Gene 8
Gene 7
Gene 11
Gene 10
Gene 9
Many genes (20’000) => many testing => Higher chances that the differences are just due to chance.
30
Tools for differential expression• Accurate estimation of P-values aim at considering these
different issues in testing the hypothesis that the expression values come from the same distribution or have the same mean in two conditions.
• Consider the multiple testing problem.
gene Log-fold change
P-value adjusted
P-value
Tools in R:- EdgeR- DESeq2
20’000 genes
mean
P=
31
How to show your results?
How to plot this in your computer?
1) Select genes with P_adj >= 0.05:ind1 <- which( P[,5] >= 0.05 )
2) Plot these pointsplot( P[ind1, 2], P[ind1, 3] )
3) Select genes with P_adj < 0.05:ind2 <- which( P[,5] < 0.05 )
4) Plot these pointspar(new=T) # This is to overlay the graphsplot( P[ind2, 2], P[ind2, 3], col=“red” )
P_adj < 0.05
P_adj >= 0.05
gene Log-fold change
P-value adjusted
P-valuemean
32
P=
3 - Visualizing high-dimensional data
• Each sample can be considered as a point in a very high dimensional space (N dimensions).
• In this high-dimensional space, are some samples more similar to each other?– Replicates– Similar cell types– Cancer subtypes
33
Example in 3D (i.e. 3 genes)
Gene 1
Gene 3
Gene 2
S1
S2
S3Visually, you can see that:- S1, S3, S4 are similar to
each other.- S2, S5 are similar to each
other.
Can you quantify it?
S4
S5
- Distance- Angle (correlation)
34
Distances - How would you do it on a computer?
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5
S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)
35
d12 <- 0for(i in 1:3){
d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)
Here we used the “**” for taking the
square of a number and the sqrt()
function for square root.
• Very hard to visualize• You can still compute distances
What if you have 20’000 genes?
d12 <- 0for(i in 1:length(S1)){
d12 <- d12 + (S1[i]-S2[i])**2}d12 <- sqrt(d12)
This is a big advantage of using programming languages, compared to Excel (or manual calculations…)
36
Visualization
• Distances are still not very intuitive…• If you have many points (S), the number of
pairwise distances is S(S-1)/2…
• Idea: Project the data in 2D, so that it represents optimally the raw data (gene expression profiles) in the N-dimensional space.
37
2D projection – the good choice
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5S2
S5PC2
PC1
PC2
PC1
S1S3
S4
In 2D
38
2D projection – the bad choice
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5 S2
S5PC2
PC1
PC2
PC1
S1S3
S4
In 2D
39
Principle Component Analysis (PCA)
• How to select your 2D plan on which to project the data?
Gene 1
Gene 3
Gene 2
S1
S2
S3
S4
S5
PC2
PC1
- Intuitive idea: Take axes with the largest variance or dispersion (Principal Components).
- The math behind is not simple (eigenvalue decomposition of covariance matrix) but does not depend on the number of genes (dimension).
- You do not need to understand the math to use it.
40
How to do it on your computer• In R, use function prcomp (stats package).
S1 <- c(5, 6, -1)S2 <- c(-2, 5, 3)S3 <- c(5.5, 6.5, -1.3)S4 <- c(4, 6.5, -0.3)S5 <- c(-2.2, 5.3, 3.1) x <- c(S1[1], S2[1], S3[1], S4[1], S5[1])y <- c(S1[2], S2[2], S3[2], S4[2], S5[2])z <- c(S1[3], S2[3], S3[3], S4[3], S5[3])
• Plot the data in 3Dlibrary(rgl) # you may also need to run: install.packages("rgl")plot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))
Orlibrary(scatterplot3d) # you may also need to run: install.packages("scatterplot3d")scatterplot3d(x,y,z, xlim=c(-10,10), ylim=c(-10,10), zlim=c(-10,10))
• Make a PCA analysismat <- t(matrix(c(S1, S2, S3, S4, S5), nrow=3))pca = prcomp(mat)plot(pca$x[,1], pca$x[,2])
See practical this afternoon41
Each point in space
Coordinates along x, y, z axes
Make a matrix with each point in one line
Putting labels on the plot
• For a better visualization, you can add margins and put labels:
42
plot(pca$x[,1], pca$x[,2], xlim=c(min(pca$x[,1]-1), max(pca$x[,1]+1)), ylim=c(min(pca$x[,2]-0.1), max(pca$x[,2]+0.1)))
text(x=pca$x[,1]+0.2, y=pca$x[,2]+0.05, labels =c("S1", "S2", "S3", "S4", "S5"))
Now let’s look at the tumor expression data
• Run:pca = prcomp(M.norm)# Plot the samples along the two first componentsplot(pca$x[,1], pca$x[,2])
What do you see?Does it make sense in light of expression of CD19?
43
Principle component analysis – some discussions
PC1
- The axis with the largest variance do not necessarily reflect the structures in the data.
- In PCA, the principle components are always orthogonal (linear method).
- It is often useful to make sure the mean of the samples is at 0.
Gene 1
Gene 2
PC1 44
Many refinements/alternatives
• In PCA, only select a subset of genes (high expression, high variability,…).
• Multi-dimensional scaling (MDS). Plot the points in 2D so that distances in the original space are best preserved (R package “cmdscale”).
• Stochastic Neighbor Embedding (tSNE). Very popular these days (R package “tsne”).
• UMAP: Recent algorithm
Non-linear techniques (not a simple projection).
All these techniques are fully unsupervised: they do not need to know what your data are, which cluster you should expect,…
45
How to choose?• Start with PCA.
• If you know what your samples are (e.g., different cell types), you can try to play a bit with parameters (e.g., choice of genes, choice of algorithm) to have meaningful clusters.
Find optimal parameters that best capture the signal in your data.=> Allows you to discover new things
Overfit your data:Þ See only what you want to see (even
if it is not there).Þ Prevents from seeing anything new
46
Where to access gene expression data
• GEO: – Largest collection of gene expression data (microarray,
RNA-Seq).– Often has counts (not only raw data).
• ENA (European Nucleotide Archive):– Large collection of raw RNA-Seq data (bam files).
• ArrayExpress: functional genomics data
See exercises this afternoon 47
Where can we access cancer gene expression data
• TCGA: large collection of tumor RNA-Seq, Exome-Seq, methylation, clinical information,…
> 10’000 patients with sequenced tumors
See exercises tomorrow48
General remarks about programming
• Computers like numbers and simple operations– Need to decompose complex tasks into simple steps.
• Learning a programming language takes time, but you do not need to know everything before starting.– First understand the logics, then use books or online
resources for the syntax.
• Data analysis takes time– Analyzing large datasets is often more challenging than
producing them…
49
General remarks about programming
• Many ways of making many mistakes!!!–We all do mistakes– You need to check your outputs when you write a
code• If you do a normalization on matrix rows, check that the
row sums are truly equal.
• If there is something incoherent in your output, always go back to find the mistakes (do not impute to noise…), even if the data come from a bioinformatics “expert”.
50
General remarks about programming
• In the beginning, it is a big investment to write a script, rather than using Excel.
• But in the long range, it allows you to go much faster and quickly analyze many datasets without having to redo everything each time.
• Many analyses cannot be done in Excel, while R provides many packages that you can use.
51
How to get support for bioinformatics
analyses of gene expression data
• Sequencing facility:
– GTF (Keith Harshman)
– Standard pipelines for normalizing and PCA
• Bioinformatics core facility (Delorenzi) or Vital-
IT.
• Very specific analyses: groups working in
computational biology.
52
Questions?
53