View
214
Download
0
Category
Tags:
Preview:
Citation preview
Alistair Chalk, 2008
Gene Expression
• Goals• To understand current high throughput strategies for measuring
gene expression
• To understand quality control and normalisation in gene expression data
• To understand the factors behind choice of gene expression measurement strategies
• To identify downstream analysis methods for gene expression data
Alistair Chalk, 2008
Microarrau analysis workflow
• Standard microarray analysis workflow
Alistair Chalk, 2008
Gene Expression
• Data Collection• How do you measure the ‘transcriptome’??
• Arrays• Affymetrix (3’ arrays)• Illumina• cDNA arrays• Boutique arrays
• Deep sequencing• 454 (Roche)
• Solexa (Illumina)
• SOLiD (ABI)
• Helicos single cell sequencing
• More to come, $1000 genome is the goal
Alistair Chalk, 2008
Gene Expression
• Data Collection• All these platforms (arrays) generate large amounts of data.
• Arrays: 50 000 to 70 000 data points (500 000 at probe level)
• Large numbers of samples or patients increase this, 10, 50, 100 fold.
• Data Analysis• Single comparisons, gene by gene, sample by sample.
• Clustering of genes or samples - common patterns or trends.
• Multivariate approach, identifying combinations of genes – biomarkers, gene ‘signatures’.
Alistair Chalk, 2008
Gene Expression
• Affymetrix arrays• Arrays for every model organism .. And more.
• Human, mouse, rat, chicken, zebrafish, Drosophila, C.elegans, Maize, Arabidopsis.
• Are really 3’ UTR arrays! .. As most probes are placed here for specificity.
• About 50-80K probesets per array.
• More than 0.5 million probes.
• Reproducible results, standard procedures.
Alistair Chalk, 2008
Gene Expression
• Illumina ‘bead’ arrays
• Similar attributes to Affy arrays
• More even hybridisation, as this occurs in solution.
Alistair Chalk, 2008
Gene Expression
• Affymetrix exon arrays• Arrays for every every predicted exon on the
genome• Human, mouse, rat so far
• 1.4 Million probesets
• More probes, but noisier
• Probe set regions have variable length
• Short probe regions (~exons) make probes harder to design
Alistair Chalk, 2008
Gene Expression
• Affymetrix exon junction arrays• Arrays for every predicted exon-junction on
the genome• Experimental – release late 2008
• Potentially noisy – junctions are quite similar so cross hybridisation an issue
Alistair Chalk, 2008
Gene Expression
cDNA arrays• Older technology.
• Low density
• Issues with evenness of spots.
• Important to control for technical variation within the array (blocks)
• Messy dye swaps used to control for labelling variations
• Plagued by spatial effects and inconsistencies.
Alistair Chalk, 2008
Gene Expression
• Boutique arrays• New array manufacturers (Nimblegen, etc) allow custom arrays
to be made.
• Suitable for focused approaches where the knowledge from research is turned into an array to survey a particular process or signature.
• Examples• Alternative splicing (transcription)
• Human promoters (CHIP-ChIP)
• Genes expressed in a particular cancer subtype
• miRNA
• How to analyse these datasets?
• Which method do I use to measure sample?
Alistair Chalk, 2008
Gene Expression
• ‘Deep’ sequencing• Fantastic advances in
sequencing technologies.
• Shotgun approach.
• Millions of ‘reads’ possible.
• Profiling communities.
• Metagenomics
• Analysis of this kind of data is still under development
• Reference genome• What to do with novel
genomes? Homologies?• Virtual microarrays!
Curves for the various samples show the number of orthologous groups seen in each per megabase of sequencing. Parentheses indicate the lower
bound of the total number of orthologous groups for each sample.
"Comparative Metagenomics of Microbial Communities," Science 308: 554-557 (2005).
Sequencing technology development
Alistair Chalk, 2008
Gene Expression
• Choice of technology• Cost vs exploratory value
• Most exploratory methods are most expensive and less assessible• Deep sequencing
• Exon arrays
• Exon-junction arrays
• “standard” Microarrays are cheap!• Illumina ref6/8
• Affymetrix U133 plus 2.0
• Study type and platform choice• How much is known about the system already?
• Well defined systems are less likely to discover new genes• New/difficult to handle cell types more likely to contain novel genes
not on standard microarrays
• Can also combine multiple expression techniques
Alistair Chalk, 2008
Complex transcriptome sampling strategyAffymetrix exon array: Exon expressionIllumina beadarray: Transcript expression
Illumina CAGE: Transcript start site expression
Highly supported genesLow cost70+ samples~20k transcriptsHumanRef-8_V2
High coverage of the exons of genes and predicted genesModerate cost, 12+ samples1.4 million probe setsHuEx1.0ST
Discovery based TSSHigh cost, 4 samples~27M 27bp tags
Illumina beadarray: Genotyping
HapMap Low cost2+ samples>620,000 markers, median marker spacing 2.7kbHuman610-Quad
hOSC Expanded neurospheres
High cost / few samplesLow cost / many samples
Discovery basedWell defined
Alistair Chalk, 2008
Complex transcriptome sampling strategyAffymetrix exon array: Exon expressionIllumina beadarray: Transcript expression
Illumina CAGE: Transcript start site expression
Sample known genes and well defined variants
Sample known and predicted genes and exons
Unbiased TSS discovery and annotation
Illumina beadarray: Genotyping
Sample known genotype variants
Interchangable Gene Model- Ensembl- RefSeq- AceView- Vega
“Complexity is anxiety”
Alistair Chalk, 2008
Analysis• What do you want to do to your data.
• Quality Control of array data – Is your data good enough?• Any outliers? Strange values?
• Differential gene expression (DE or DGE) between two or more samples.
• Normalisation
• Gene expression values
• Multiple comparisons
• A list of differentially expressed genes
• Identify common pathways differentially expressed
• Identify common TFs that bind to DE genes
• Identify biomarkers indicative of a particular trait, process.• What does the expression of a biomarker look like?
• Identify gene signatures to allow you to classify a particular sample.• Is there a particular pattern of expression associated with disease?
Alistair Chalk, 2008
Quality Control
• Plot your data!• Raw image• Raw values
Alistair Chalk, 2008
Quality Control – Short Read Data
• Bioconductor ShortRead package
• Currently Solexa/Illumina
• Will soon include• SOLID
• Short sequence data contains errors
• Currently few standards for this type of data beyond “matches to the genome with <=2 mismatches”
• Data needs to be filtered!
Alistair Chalk, 2008
Short Read Data – genome alignment
• Data is useless until mapped to a known genome to identify transcribed elements
• Correct genome build
• Including splice junctions, ribosomal RNA, databases of likely contaminates
• Many methods for fast short read alignment• Vendor specific
• ELAND – closed source
• Open source strategies• Mapq, etc (many new methods being published)• Helicos – open source platform
• Short sequence data contains errors
• Currently few standards for this type of data beyond “matches to the genome with <=2 mismatches”
Alistair Chalk, 2008
Differential Gene Expression
Normalisation• In order to make comparisons between chips of the same
experiment possible, every chip has to be normalised
• Normalisation is an attempt to eliminate all the non-biological variation in microarray experiments without affecting the biological variations
• There is a danger that some or even most of the biological information will also be removed in normalisation
• It is good to keep in mind that the amount of normalisation should be minimised to avoid this
Alistair Chalk, 2008
Differential Gene Expression
Median centering• The median intensity of every chip is brought to the same value
• Achieved by calculating the median of the log ratios for one microarray and producing the centered data by subtracting this median from the log ratio of every gene
• Median centering does not change the spread of the data, which also means that the original information content is not altered.
• Only applicable for linear data
• Median centering is done by quantile normalisation.
Alistair Chalk, 2008
Differential Gene Expression
• Quantile normalisation• Based on the assumption that the distribution of the intensity
values is similar on every chip.
• Holds well for genes with low expression, but not necessarily well for highly expressed genes.
Alistair Chalk, 2008
Differential Gene Expression
Distribution of Intensities• Most genes expressed at low levels
• Taking the log of these values brings them closer to a normal distribution
• There is still a skewed distribution!
Alistair Chalk, 2008
Differential Gene Expression
Distribution of Intensities• Signal and noise give the characteristic shape of the gene
expression intensities.
• Now we can look for gene expression (RMA, LiWong etc)
• … and then differential expression.
abundances + noise = observed values
Alistair Chalk, 2008
Differential Gene Expression
• Probes and Transcript assignment• Probes are sequences that hybridise to specific transcripts
• Have varying efficacy and specificity• Often change between chip versions• Do not always target the newly discovered genes
• Transcript models• Change between genome releases• Assignment of probe to transcript is done by the manufacturer
(traditionally quite badly)• Assignment is in the format of CDF files, so this can be changed to
suit
Alistair Chalk, 2008
Differential Gene Expression
• Statistical Testing• Select an appropriate statistical test, T-test, ANOVA.
• Select a significance threshold for the p-value
• Form the pair of hypotheses you want to compare
• Calculate the test statistic
• Find out p-value which corresponds to the test statistic
• Draw conclusions
• Assumes normality of distributions so that they can be tested.
• It’s the NULL hypothesis, that there is NO difference in the means of the distributions being compared, which is being tested.
• How to apply this to many genes from a microarray experiment?
Alistair Chalk, 2008
Differential Gene Expression
p-values• The p-value is usually associated with a statistical test, and it is
the risk that the null hypothesis is rejected, when it is actually true.
• When a cutoff for the p-value is decided, the values below the threshold are considered statistically significant, and the values above the threshold are considered not statistically significant.
• Often a threshold of 0.05 is used. This means that every 20th time, by chance alone, the difference between groups is statistically significant, when it really isn't.
Alistair Chalk, 2008
Differential Gene Expression
Multiple testing correction and the False Discovery Rate
• When many analyses are performed on a data set, many results will meet the arbitrary significance level by chance alone.
• Often the p-value is corrected to account for this problem.
• Bonferroni correction is the most simple correction.• The original p-value is multiplied by the number of comparisons to
create a new corrected p-value.
• FDR controls for the expected proportion of false positives• The FDR adjusted p-values are often called q-values
• If all genes with a q-value below a threshold, eg: 0.05, are selected as differentially expressed, then the expected proportion of false discoveries in the selected group is controlled to be less than the threshold value, in this case 5%
Alistair Chalk, 2008
Differential Gene Expression
• The distribution for each gene is compared between samples
• A p-value is given to that particular t-test comparison.
• This is probably corrected for FDR.• Genes are ranked by p-value,• And a cut off is usually imposed for
significance .. Say 0.05.
This produces ….• A list of “top 100” genes • Lacks any connections or structure • Where to start/stop?• Unconscious bias to “favourite” genes • How do I control “false discovery” without
also controlling “discovery”?
Prediction of disease specific exon skipping events
GenomeGraphs Bioconductor package (with modifications for gene model selection).
Alternative splicing - FIRMA
Probe variation
Probe intensities
Gene expression
Residuals
Alt splicing?
Red – hOSC ???Green – hOSC ???Black – Non hOSC
Red – hOSC PDGreen – hOSC ControlBlack – Non hOSC
Alternative splicing – Other methods
• Other (bioconductor) methods for exon analysis• aroma.affymetrix
• exonmap
• Easy as this: Exon analysis in xmap to find a set of differentially expressed genes
• library(exonmap)
• raw.data <- read.exon()
• raw.data@cdfName <- “exon.pmcdf”
• x.rma <- rma(raw.data)
• pc.rma <- pc(x.rma,“group”,c(“a”,”b”))
• keep <- (abs(fc(pc.rma)) > 1) & tt(pc.rma)< 1e-4
• sigs <- featureNames(x.rma)[keep]
• This ease of analysis is RARE (appreciate it!)• Additional exon analysis is available here
• http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323405
Alistair Chalk, 2008
Differential Gene Expression
What next (according to DAVID)?• Identify enriched biological themes, particularly GO
terms• Discover enriched functional-related gene groups• Cluster redundant annotation terms• Visualize genes on BioCarta & KEGG pathway maps• Display related many-genes-to-many-terms on 2-D view.• Search for other functionally related genes not in the list• List interacting proteins• Explore gene names in batch• Link gene-disease associations• Highlight protein functional domains and motifs
Alistair Chalk, 2008
Differential Gene Expression
Pathway based approaches• A list of “top ranking” pathways• Includes functional connections between genes• Dependent on underlying database and method
• Ingenuity• DAVID• GSEA
• More functionally motivated - but• False pathway detection• Overlapping pathways• Many pathways poorly understood• What about unknown genes?
Alistair Chalk, 2008
Differential Gene Expression
Network based / Systems biology approaches• A list of “top ranking” networks• Essentially a looser pathway description where interactions
can be identified from multiple data sources of very different types
• Ingenuity can be used in this way “grow network/pathway functionality”
• Networks/Pathways can be very complex• transcription factors• miRNAs• promoter status• Protein-protein interaction• Multiple levels of data
Alistair Chalk, 2008
Differential Gene Expression
• Alternatives – Classifier Approaches
• Multivariate• Looking at combinations of genes that tell us something about the
differences in the samples under study.
• Support Vector Machines• A supervised machine learning way of training a ‘machine’ on one
kind of data (diseased, or a state) and testing on some unseen data.
• Principal Components Analysis (PCA)• Standard procedure of finding combinations with greatest variance• Exploratory analysis, to see natural groupings, and to detect outliers• To identify combinations of features that usefully characterize
samples or genes.• Good for QC of arrays!
Alistair Chalk, 2008
SVMs, higher dimensions and kernels
Alistair Chalk, 2008
F-Josef Müller et al. Nature 000, 1-5 (2008) doi:10.1038/nature07213
Sample collection and analysis for the stem cell matrix.
Alistair Chalk, 2008
F-Josef Müller et al. Nature 000, 1-5 (2008) doi:10.1038/nature07213
Clusters of samples based on machine learning algorithm.
Alistair Chalk, 2008
F-Josef Müller et al. Nature 000, 1-5 (2008) doi:10.1038/nature07213
Clusters of samples based on machine learning algorithm.
Alistair Chalk, 2008
Microarray data analysis - multivariate
• GeneRaVE analysis• Use gene expression as a predictor of a response
• continuous • multigroup • ordered categorical• survival
• Integrated variable selectionand model fitting
• Able to handle many variables
• Sparse classifiers – analyse gene expression values to find a small set of genes ‘predictive’ of a response
• Sparse networks – build a network around the response of interest
• https://www.bioinformatics.csiro.au
Alistair Chalk, 2008
KLF4 is necessary for preventing the entry into mitosis following DNA damage
- UCHL1, ubiquitin thiolesterase. Associated with Parkinson’s and Alzheimer’s due to expression in nerve cells.
- FABP6, binds fatty acids, bile acids.
-SLC7A11, cysteine/glutamate transport system xc(-)mediates cystine entry into cell in exchange for intracellular glutamate, which accumulates in response to oxidative stress.
-MMP10, regulated by reactive oxy species, which also modulate tumor progression
Local gene network – smoking
Alistair Chalk, 2008
Gastro Intestinal expression- OLFM4 expressed in the inflammed colonic epithelium,- SPINK4 gastrointestinal protease inhibitor-REG1A expression is closely related to the carcinoma invasiveness of gastric neoplasms. - CA1 downregulated in GI mucosal neoplasmsMembrate-bound transporters - AQP8 water channel expressed in pancreas and colon. - SLC26A3 epithelial Cl- absorption and HCO3- secretion.
Hormone controlled carbohydrate homeostasis regulators of cellular glycogen release- GCG induces glucose production,- PYY increases after meals, -SST interacts with pituitary growth hormone, thyroid stimulating hormone, and most hormones of the gastrointestinal tract.- INSL5 insulin-like protein, has a newly identified receptor in the colon.
Cellular and developmental patterning- PRAC predictor of colon position- HOXB13 important in embryonic patterning along the axis of the organism,-CLDN8 determines cellular polarity, pathology in colon cancer.
uncharacterised
Local gene network – colon biology
Alistair Chalk, 2008
Microarrays, Diagnostics in Cancer
• A gene-expression signature to predict survival in breast cancer across independent data sets
• A Naderi, A E Teschendorff, N L Barbosa-Morais, S E Pinder, A R Green, D G Powe, J F R Robertson, S Aparicio, I O Ellis, J D Brenton and C Caldas
Alistair Chalk, 2008
Analysis - DAVID
• DAVID tools for functional analysis of gene lists• http://david.abcc.ncifcrf.gov/
• “The Database for Annotation, Visualization and Integrated Discovery (DAVID) 2008 is the sixth version of our original web-accessible programs. DAVID now provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.”
• We use this one in house often.
• Common questions• What are major enriched GO terms?• What are the highly active pathways?• What are the frequently interacting proteins?• What are the known disease associations?
Alistair Chalk, 2008
GSEA Analysis
• GSEA: Gene Set Enrichment Analysis
• Similar to DAVID• Gene sets are curated differently
• See tutorial• http://www.broad.mit.edu/gsea/do
c/desktop_tutorial.jsp
• A note on gene/probe identifiers • When running the gene set
enrichment analysis, it is critical that all of your data files use the same gene or probe identifiers. You can either use the probe identifiers native to your expression dataset, or collapse each probe set into a gene vector and use HUGO gene symbols as your identifiers
Alistair Chalk, 2008
Analysis – common TF's• Commonly regulated genes have common transcription
factors• Use sets of up/down regulated genes in your experiment to find
common binding patterns
• TF databases• TRANSFAC
• Large, commercial, noisy database
• JASPAR• a curated, non-redundant set of 123 profiles, derived from
published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes.
• open data acess, non-redundancy and quality• When use JASPAR? When seeking models for specific factors or
structural classes, or if experimental evidence is paramount • http://jaspar.genereg.net/
Alistair Chalk, 2008
What next??
• Validation of results?• Good for biomarkers, RT-PCR.
• Literature.
• Other results in the lab.
• Any independent datasets? – check GEO/ArrayExpress.
• Biological significance of results• Do your results suggest further experiments?
Alistair Chalk, 2008
Gene Expression
• Goals revisited• To understand current high throughput strategies for measuring
gene expression
• To understand quality control and normalisation in gene expression data
• To understand the factors behind choice of gene expression measurement strategies
• To identify downstream analysis methods for gene expression data
Recommended