Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Microarray data analysis with Chipster16.-17.4.2008
Jarno TuimalaEija Korpelainen
Program – an analysis workflow
Day 1.Basic functionality of Chipster (Eija)Data import (Eija)Quality control (Jarno)Normalization (Jarno)
• Describing the experimentFiltering and missing value considerations (Jarno)
Day 2.Statistical testing (Jarno)Clustering and visualization (Jarno)Annotation (Eija)Promoter analysis (Eija)Experimental design (Jarno) – if time allows
Demo data
Affymetrix• Kidney cancer• 8 controls, 9 cancer patients
Agilent• Acute leukemia• 7 controls, 7 FLT mutated
Illumina• Teratozoospermia• 5 controls, 8 affected
Introduction to microarraysIntroduction to microarrays
2
Research using microarrays
Plan!• Experimental design
Laboratory work• Extract, label, hybridize
Computer work• Scanning, image analysis• Bioinformatics
Laboratory work• Confirmation
Publish• Submit data to public databases
Introduction to Chipster
ChipsterGoal: Easy access to leading analysis tools such as those developed in theR/Bioconductor project
Features• Easy to use graphical user interface• Comprehensive selection of tools• Support for different array types (Affymetrix, Agilent, Illumina, cDNA)• Compatible with Windows, Linux and Mac OS X• Easy to install and update• Wizards and workflows• Interactive graphics • Transparency (as opposed to “black box”)• Alternative annotations for Affymetrix arrays• Automatic tracking of performed analyses
http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdfhttp://chipster.csc.fi
How does it work?
internet
front end
SSL
SOAP
international Web ServicesANALYSIS VISUALISATION
CSC desktop
client Java Web Startinstalls and updates client automatically
Corona/Murska
analyser
security
3
Aleksi KallioJarno TuimalaTaavi HupponenMika Rissanen, Janne Käki, Mikko Koski, Petri Klemelä
All the pilot usersDepartment of computer science (HY)Dario Greco (HY)Prof. Olli Yli-Harja’s group (TUT)GeneCruiser team (MIT Broad Institute)
Tekes/SA SYSBIO-program
Acknowledgements
Data ToolsVisu
aliza
tion
Phenodata – describing your experiment
Phenodata file is created during normalizationFill in the group column with numbers describing your experimental setup
• e.g. 1 = healthy control, 2 = cancer sample• necessary for the statistical tests to work
If you bring in previously created normalized data and phenodata:• Choose ”import directly” in the import tool• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”
If you brought in normalized data and need to create phenodata for it:• Utilities/ Generate phenodata (fill in the chiptype parameter!)• Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”• Fill in the group column
4
Visualizing the data
Data visualization panel• Maximize and redraw for better viewing
Two types of visualizations1. Interactive visualizations produced by the client program
• Select the visualization method from the pulldown menu of the data visualization panel
• Save by right clicking on the image2. Static images produced by R/Bioconductor, Weeder, etc
• Select from Analysis tools/ Visualisation• View by double clicking on the image file• Save by right clicking on the file name and choosing ”Export”
Interactive visualizations by the clientSpreadsheetHistogramScatterplot3D scatterplotExpression profilesClustered profilesHierarchical clusteringSOM clusteringArray pseudo-image
Available actions:Change titles, colors etcZoom in/outSelect and annotate genes using the MIT GeneCruiser
5
Static images produced by R/Bioconductor
Volcano plotBox plotHistogramHeatmapVenn diagramIdiogramChromosomal positionCorrelogramDendrogramQC stats plotRNA degradation plotK-means clusteringSOM-clustering
Automatic tracking of analysis history Running many analyses simultaneouslyYou can have max 5 analysis jobs running at the same timeUse Task manager to
• view parameters, status,…• cancel jobs
6
Workspace – continue later/elsewhere
Saving your workspace allows you to continue later• File/ Save workflow• File/ Load workflow
Currently it is possible to have only one workspace saved at the time
If you would like to continue your work on another computer, youneed to transfer the workspace-snapshot -folder to the corresponding location
• C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot
Workflow – reusing your analysis pipeline
Creates a ”macro” that can be applied to another normalized dataset and phenodata
Choose a dataset, and workflow recordsthe analysis steps that lead to thatdataset
You can give the workflow a meaningfulname (ending .bsh), but it has to belocated in the chipster-scripts folderunder nami-work-files
You can run a workflow on anothercomputer by making it visible to Chipsterwith ”Reload workflows from disk”
You can change parameters directly to the workflow file
Wizard– autopilot for analysis Wizard for Affymetrix data
Ready-made workflow to find differentiallyexpressed genes
• Normalization• Phenodata creation• Statistical test• Hierarchical clustering
7
Importing files
Affymetrix CEL-files are imported to Chipster automatically
Other files are imported using the Import tool
Import tool, step 1
Define• Header• Footer• Title row• Delimiter
Import tool, step 2
Define columnsModify flags
Importing Agilent files
Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeUID) Annotation (ControlType)
https://extras.csc.fi/biosciences/chipster-manual/data-formats.html
8
Exercise
Exercise I
1. Import the demo data of your favorite type in ChipsterAffymetrixAgilent
2. Save the workspace3. Have lunch (back at 13.00)
Quality control
Quality control tools
Quality control -tools• Affymetrix basic
RNA degradation + Affy QC• Affymetrix RLE & NUSE (might take a long time to run)
Fits a model to expression values• Agilent
MA-plot + density plot + boxplotVisualization – dendrogramStatistics - NMDS
9
Affymetrix I
Quality control tools are run on raw data (CEL files).• Dendrogram and NMDS on normalized data
Affymetrix II
Agilent General QC – dendrogram and NMDS
10
Scatterplots Heatmaps (this took an hour to calculate)
QC-tools in Chipster
Quality control• Affymetrix basic• Affymetrix RLE and NUSE• Agilent 2-color
Visualization• Dendrogram• Heatmap• Correlogram
Statistics• NMDS
Normalization
11
What is normalization?
Normalization is the process of removing systematic variation from the data.Typically you would normalize your data so that all the chips become comparable.
Methods
Affymetrix• Background correction + expression estimation + summarization• RMA (default) uses only PM probes, fits a model to them, and gives out
expression values after quantile normalization and median polishingAgilent
• Background correction + averaging duplicate spots + normalization
After normalization the expression values are always expressed on log2-scale
AffymetrixMethods: MAS5, Plier, RMA, GCRMA, Li-Wong
• MAS5 is the older Affymetrix method, Plier is a newer one• RMA is the default, and works rather nicely if you have more than a
few chips• GCRMA is similar to RMA, but takes also GC% content into account• Li-Wong is the method implemented in dChip
Variance stabilization makes the variance over all the chips similar
• Works only with MAS5 and Plier, since all others output log2-tranformed data by default (and thus corrected for the same phenomenon)
Custom chiptype• If you want to use reannotated probes (they are really assigned to
the genes where they belong), select one from this menu.
Agilent I
Background correction• Background treatment
None, Subtract, Edwards, Normexp• Background offset
0 or 50Normalize chips
• None, median, loessNormalize genes (not typically used)
• None, scale (to median), quantileChiptype
• A must setting!
12
Agilent IIBackground treatment typically generates many negative values that are coded as missing values after log2-transformation.
• Usual subtract option does this• Using normexp + offset 50 will generate no negative values,
and gives rather good estimates (best method reported)Loess removes curvature from the data (suggested)
Checking normalization
Exercise
Exercise II
Normalize your dataset• Use two different normalization schemes
Describe the experiment (fill in phenodata)Check the quality of your dataset
• Is there difference between the normalization schemes• If there is, select the better one, and continue with it
13
Filtering
Gene filtering
Removing probes for genes that are• Not expressed• Expressed at constant level (not changing)
Often a good idea, and necessary before multiple testing correction can be adequately applied
• Some controversy on this…
Non-specific filtering• Expression, flags, SD, …
Specific filtering• Statistical testing
Non-specific filtering
Often used for removing bad quality data:• Intensity value too low• Intensity value saturated• Appearance of the spot is abnormal
Typically, non-changing genes are also removedThese can be removed using
• Filter by standard deviation• Filter by interquartile range• Filter by expression
Specific filtering
Selecting genes that are associated with some phenotypeTypically involves statistical testing
Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.
• Both tell a slightly different story. Fold change ignores knowledge of variability, p-value ignores the size of the effect.
• Take both into account by combining the filters.• Filter on expression value (what is biologically significant)
and test for differences (what is statistically significant)
14
Unspecific filtering in Chipster
Pre-processing• Filter by expression
• Select the upper and lower cut-offs• Select the number of chips this rule has to fulfilled on• Select whether to return genes inside or outside the range
• Filter by SD• Select the percentage of genes to filter out
• Filter by interquartile range (IQR)• Select the IQR
• Filter by coefficient of variation (CV)• Median is used for filtering on CV (cannot be changed)
Utilities1. Calculate descriptive statistics2. Filter using a column
Venn diagram
Select three datasets in ChipsterRun the Venn diagram tool from Visualization tool category
SD CV
IQR
Exercise
Exercise III
Filter your dataset using unspecific filtering• Use two different schemes• Compare the schemes using Venn diagram• Are there any common genes?
15
Statistics
Some terminology
Usually tests for comparing means of two or more groups are used
• Variance might be of interest too, but in practise this is never done.
Parametric tests (assume data normally distributed)• Typically used for microarray data
Non-parametric tests (assume no normality)
P-value• Risk of saying that there is a difference when there really isn’t• Traditionally 0.05 is used as a cut-off for significance• False discovery range is a p-value corrected for multiple tests (more on
this later)
Mean and variance, an example for 1 gene
-6 -4 -2 0 2 4 6
0.0
0.1
0.2
0.3
0.4
density.default(x = x1)
N = 100000 Bandwidth = 0.08956
Den
sity
-10 -5 0 5 10
0.0
0.1
0.2
0.3
0.4
density.default(x = y1)
N = 100000 Bandwidth = 0.09006
Den
sity
Statistical testing
Needs replication (>2 chips per group)• Replication makes it possible to estimate uncertainty or variability in the
measurements. This is typically measured by standard deviation.Comparing means (parametric tests)
• One-group tests• Compare to a known mean• Example: One-sample t-test
• Two-group tests• Compare two groups’ means• Example: Two-sample t-test
• Several group tests• Compare several groups’ means• Example: Analysis of variance (ANOVA)
• Two or more groups, two or more factors• Compare means in the groups according to both factor simultaneously• Example: multiple linear regression (linear modeling in Chipster)
16
t-test
Compares means of two groups• If the p-value is small that means that there is a difference between the groups.• If the p-value is large (>0.05), there is no difference between the groups.• p-value is a risk of saying that there is a difference when there actually isn’t.
A test for every gene is run separately -> thousands of tests and p-values
SExxt 21 −=
ANOVA
A generalization of t-test.Compares means of several groups.Tells whether the means are different, but not which means differ from each other.
• For this you can use post-hoc tests (not implemented in Chipster) or linear modelling (implemented in Chipster)
A test for every gene is run separately -> thousands of tests and p-values
Multiple testing correction I
After getting the results for all the genes, p-values are adjusted for the number of tests conducted.When making several comparisons using the same test, some of the results will be chance findings.
• Example: if p threshold is 0.05, every 20th significant result might be due to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).
This can be corrected for (to some extent) by using a multiple testing correction.
• Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.
• Thus, FDR can be much higher than p-value, and the results can still be meaningful and worth investigating.
Multiple testing correction II
The ranking of the genes does not change after multiple testing correction!
• If you know that you can validate, say, 10 genes, then there’s no difference if you select the most significant genes before orafter the multiple testing correction.
• If there are no significant genes left after multiple testing correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.
17
Gene set test (”global test”)
A typical result of an microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Are pathways associated with our endpoints of interest?
• Is there a difference in nucleotide metabolism between 5-FU-treated cancer patients and their healthy controls?
Works on the expression values data.
Gene enrichment analysis
A typical result of an microarray experiment is a list of differentially expressed genes.Biologically, grouping these genes in pathways or functional categories would be more interesting.Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories.Works on the gene list.
Statistical tests in Chipster
Statistics• One sample tests
• Are the genes expressed at all (different from 0)?• Two group tests• Several group tests• Linear modeling
Visualization• Volcano plot
Exercise
18
Exercise IV
To find differentially expressed genes, run a suitable statistical test for your (filtered) data set.Are these expressed genes enriched into some KEGG pathways?
• There is a separate test for this.
Clustering
Clustering methods
Hierarchical clusteringNon-hierarchical clustering
• K-means• QT-clustering• Self-organizing maps
Classification aka class prediction• K-nearest neighbor (KNN)
Unsupervised v. supervised
19
Hierachical clustering
Two phases:• Pick a distance measure
• Euclidean distance• Standard / Pearson correlation
• Pick the dendrogram drawing method• Average linkage
Average linkage example
Hierarchical clustering - heatmapK-means clustering
Finds K clusters from the data.User has to specify the number of clusters (K).
20
K-means clustering Clustering in Chipster
Clustering• Hierarchical
• Includes reliability checking of the resulting tree with bootstrapping
• K-meansStatistics
• PCA (principal component analysis)• NMDS (non-metric multidimensional scaling)
Exercise
Exercise V
Cluster your differentially expressed genes using hierarchical clustering
21
Annotation
Annotation
Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc.Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project).Not available for all chiptypes, but required by certain analyses (annotation, gene enrichment analysis, promoter analysis)For Affymetrix: either built-in or GeneCruiserFor other chiptypes: built-in
Alternative CDF environments for Affy
CDF if a file that links individual probes to their location in genes (probesets)Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genesAlternative CDFs (custom chiptype in Affymetrix normalization) fixes this problemAfter using the alt CDFs, you can’t use gene set enrichment or promoter analysis tools
• No annotation files exist for alt CDFs
Promoter analysis
22
Promoter analysis with Chipster
Promoter sequences = sequences upstream of annotated transcription start site of RefSeq genes (from UCSC Golden Path)
Pattern discovery: Weeder• looks for common sequence motifs in a set of promoters
Pattern matching: ClusterBuster• looks for clusters of known transcription factor binding sites using the
JASPAR matrices
Promoters from genes with similar expression patterns
Pattern discovery
Program to find common motifs- Tool comparison: Nature Biotech. (2005) 23:137 => Weeder
WeederEnumerates all oligos of given length, determines which appear in a significant fraction of seqs, ranks them according to statistical significancePavesi et al (2004) Nuc Acids Res. Jul (W199-203)
Species (human, mouse, rat, yeast) [human]• Background frequency files (oligo count of intergenic regions of a given organism)
Promoter size (short, medium, long) [short] Analyze strands (single, both) [single] Motif appears more than once per sequence (yes, no) [no] Number of motifs to return (1-100) [10] Percentage of sequences the motif should appear in (1-100) [50] Transcription factor binding site size (small, medium) [small]
• Small= 6 (1 mismatch allowed) and 8 (2 mismatches allowed)• Medium= 10 (3 mismatches allowed)
23
Collection of known binding motifs for TFs (Genomatix, Transfac, JASPAR)
Program to scan the sequence for binding sites
Pattern matching
TTTTTATA
ClusterBuster
Looks for clusters of transcription factor binding sitesUses the JASPAR open access matrix database
• http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl
Frith et al (2003) Nuc Acids Res, 31(13):3666-8
Species (human, mouse, rat, yeast) [human] Promoter size (short, medium, long) [short]Cluster score threshold [5]Motif score threshold [6]Expected distance between motifs in a cluster [35]Range for counting nucleotide frequencies [100]Pseudocounts [0.375]
ClusterBuster output
24
Exercise
Exercise VI
Search your list of differentially expressed genes for binding sites of known transcription factors
Extra material Linear modeling in Chipster
25
Linear model
Y = a + bx1 + cx2 + dx1x2• Like a normal multiple regression• Intercept (a) is included by default• Can contain both main effects (b, c) and interaction effects (d)
Linear modeling in Chipster can take into account at most three main effect, their interactions, one technical replication level, and one level of pairing
• This is enough for all the experiments I’ve encountered in GEO so far.
• Technical replication: one biological sample is hybridized on more than one array
• Pairing: before-after –type of setting. Measurements available just prior to treatment and after it from exactly the same cell culture flasks.
Setting up the model I
All columns (max. three) in the phenodata can be either tested as linear (is there a trend towards higher numbers?) or as a factor (are there differences between the groups?).
• With 2 groups there’s no difference in these settings.
1 2 3 1 2 3
linear factor
Liner modeling tool
Columns 1…3• Main effects
Column 4• Technical repl.
Column 5• Pairing
One main effect – 3 groups
linear
factor
26
Setting up the model
If you want to include more than one main effect, you need to add new columns to you phenodata.
Two main effects – both have two groups
No interactions
Two-way interactions, with significant genes returned for all effects (main effects and interactions)
Pairing or technical replication
All samples in the same pairing or replication groups are coded with the same number. Different groups are coded with a running number.
Result files
A model matrix and one result file are saved.
27
Experimental design
Some things to ponder
Bad experimental design is bad science!• Wasted money• More animal or human suffering• Unreliable results
The main aspects of experimental design are• Randomization and balancing (often neglected)• Replication (usually rather well handled)• Blocking (not even known of)• Factorial experiments (sometimes considered)
You also need to consider• Sample size• Controls (direct or indirect measurements)
Before running the experiment
Define the principal hypothesis to test. Everything cannot be tested!
• ”I run this experiment for comparing two treatments on Arabidopsis. Now coming to think of it, these plants were of different age. Can you also test for the effect of it?”
Which are the main sources of variability? They need to be taken into account in the experimental design!
• Laboratory personnel (more than one person involved?)• Chips (from more than one batch?)• Biological samples (inter- or intraindividual variability?)• Hybridization conditions (is the method standardized?)• Day (often the greatest source or variation)
• Intermingled with variation from chips, biological samples, etc. if not properly taken into account
Replication
Techical replication:• Take a sample per animal, and hybridize every sample to
several chips. Biological replicate:
• Take a sample per animal, and hybridize every sample to one chip.
Replication does not mean taking repeated measurements from the same experimental units. That typically generates a time series.Technical replication, when analysed as a biological replicate is a pseudoreplicate. Pseudoreplication generates more problems than it solves.
28
Balancing
Balancing means that there should be an equal number of experimental units is all groups.Balanced designs are statistically more powerful than unbalanced designs.Example:
• In the study of breast cancer, 30 individuals were recruited frothe cancer cohort, and 30 individuals as their health controls (balanced for the disease).
• 60 Affymetrix chips are available for hybridizing these samples.Affymetrix station only takes 8 chips at a time, so 4 cancer patients and 4 healthy controls are randomly picked to be hybridized in every batch (balanced for day effect).
• Two laboratory technicians are making the hybridizations. Both process 30 samples, half being cancer patients and half healthy controls (balanced for the technician).
Randomization
Randomization is a way to control for effects of factors not explicitely taken care of in the experimental design.In randomization experimental units are randomly allocated to treatment groups.
• Sixty cell culture vials are randomly divided into control and treatment groups. They retain their places in the incubator regardless of the group (completely randomized trial).
Random does not mean haphazard. Randomization takes some effort. Use e.g., dice, playing cards, random number generator, random number tables, etc. for randomization. In the best case the randomization is blind. The experimenter must not be able to identify the samples before the whole experiment has been concluded.
Completely randomized design
12114
21123
12122
22211
DCBARow #
Let’s divide 16 samples into two groups of equal size. I’ve created a random number table on the right.Reading the table from the top left to the bottom right, the cell culture vials are assigned to two groups.We might then arrange the vials on the tray in the same order and put the tray in the incubator.
Blocking
Blocking is arranging experimental units into similar groups. Blocking is used for controlling for factor that can not be manipulated, but are known.Example:
• While studying a responce to a drug treatment, both males and females were recruited for the study. Responce might depend on sex, so individuals were first divided into two groups according to their sex, and then randomly assigned to treatment groups (randomized block design).
29
Factorial designs
In factorial design several factors are manipulated at the same time.Better to analyze together than separately, because factorial design allows one to assess the possible interaction.Example:
• Cells were treated with vitamin-C and hydroxen peroxide. Culturing cell alone with either chemical leads to missing the interaction where vitamin-C prevents peroxide induced cell death to some extent.
Main effects: vitamin-C and peroxideInteraction: vitamin-C * peroxide
Sample sizeWe need to use a sufficient amount of samples to reach reliable conclusions. Using too small or too big sample size is a waste of resources.Founding out the correct sample size for DNA microarray experiments is tricky. Use of previous experiments for the same chip type and biological material is often needed.In epidemiological studies estimating the sample size is a must. It might be hard to get published otherwise.To estimate the sample size, we need an estimate of
• Effect size• Variability• Desired false positive rate• Desired false negative rate
Sample size – a comparsion of two experiments Sample size – a rule of thumb
In statistics, variability in intrincically associated with statistical significance. The lower the variability of replicates, the higher the significance.Doubling sample size halves variance making the detection of differences easier.
30
Direct or indirect measurements?
Reference Sample ReferenceSample Sample 2Sample 2
An example of a better…good design
Comparing two groups of samples.• 20 samples in each group (40 in total).
• You’re interested in comparing the two states (diseased, health).
• Interindividual variability (due to sex) can be expected.• Using Affymetrix chips (all from the same batch).• You’re doing all the wetlab work.
Hybridize (randomly ordered):• 12122211• 22112112 1=healthy• 21211212 2=diseased• 22221111 1=male• 12211212 2=female