Alistair Chalk, 2008 Gene Expression Goals To understand current high throughput strategies for...

Alistair Chalk, 2008

Gene Expression

• Goals• To understand current high throughput strategies for measuring

gene expression

• To understand quality control and normalisation in gene expression data

• To understand the factors behind choice of gene expression measurement strategies

• To identify downstream analysis methods for gene expression data

Microarrau analysis workflow

• Standard microarray analysis workflow

Gene Expression

• Data Collection• How do you measure the ‘transcriptome’??

• Arrays• Affymetrix (3’ arrays)• Illumina• cDNA arrays• Boutique arrays

• Deep sequencing• 454 (Roche)

• Solexa (Illumina)

• SOLiD (ABI)

• Helicos single cell sequencing

• More to come, $1000 genome is the goal

Gene Expression

• Data Collection• All these platforms (arrays) generate large amounts of data.

• Arrays: 50 000 to 70 000 data points (500 000 at probe level)

• Large numbers of samples or patients increase this, 10, 50, 100 fold.

• Data Analysis• Single comparisons, gene by gene, sample by sample.

• Clustering of genes or samples - common patterns or trends.

• Multivariate approach, identifying combinations of genes – biomarkers, gene ‘signatures’.

Gene Expression

• Affymetrix arrays• Arrays for every model organism .. And more.

• Human, mouse, rat, chicken, zebrafish, Drosophila, C.elegans, Maize, Arabidopsis.

• Are really 3’ UTR arrays! .. As most probes are placed here for specificity.

• About 50-80K probesets per array.

• More than 0.5 million probes.

• Reproducible results, standard procedures.

Gene Expression

• Illumina ‘bead’ arrays

• Similar attributes to Affy arrays

• More even hybridisation, as this occurs in solution.

Gene Expression

• Affymetrix exon arrays• Arrays for every every predicted exon on the

genome• Human, mouse, rat so far

• 1.4 Million probesets

• More probes, but noisier

• Probe set regions have variable length

• Short probe regions (~exons) make probes harder to design

Gene Expression

• Affymetrix exon junction arrays• Arrays for every predicted exon-junction on

the genome• Experimental – release late 2008

• Potentially noisy – junctions are quite similar so cross hybridisation an issue

Gene Expression

cDNA arrays• Older technology.

• Low density

• Issues with evenness of spots.

• Important to control for technical variation within the array (blocks)

• Messy dye swaps used to control for labelling variations

• Plagued by spatial effects and inconsistencies.

Gene Expression

• Boutique arrays• New array manufacturers (Nimblegen, etc) allow custom arrays

to be made.

• Suitable for focused approaches where the knowledge from research is turned into an array to survey a particular process or signature.

• Examples• Alternative splicing (transcription)

• Human promoters (CHIP-ChIP)

• Genes expressed in a particular cancer subtype

• miRNA

• How to analyse these datasets?

• Which method do I use to measure sample?

Gene Expression

• ‘Deep’ sequencing• Fantastic advances in

sequencing technologies.

• Shotgun approach.

• Millions of ‘reads’ possible.

• Profiling communities.

• Metagenomics

• Analysis of this kind of data is still under development

• Reference genome• What to do with novel

genomes? Homologies?• Virtual microarrays!

Curves for the various samples show the number of orthologous groups seen in each per megabase of sequencing. Parentheses indicate the lower

bound of the total number of orthologous groups for each sample.

"Comparative Metagenomics of Microbial Communities," Science 308: 554-557 (2005).

Sequencing technology development

Gene Expression

• Choice of technology• Cost vs exploratory value

• Most exploratory methods are most expensive and less assessible• Deep sequencing

• Exon arrays

• Exon-junction arrays

• “standard” Microarrays are cheap!• Illumina ref6/8

• Affymetrix U133 plus 2.0

• Study type and platform choice• How much is known about the system already?

• Well defined systems are less likely to discover new genes• New/difficult to handle cell types more likely to contain novel genes

not on standard microarrays

• Can also combine multiple expression techniques

Complex transcriptome sampling strategyAffymetrix exon array: Exon expressionIllumina beadarray: Transcript expression

Illumina CAGE: Transcript start site expression

Highly supported genesLow cost70+ samples~20k transcriptsHumanRef-8_V2

High coverage of the exons of genes and predicted genesModerate cost, 12+ samples1.4 million probe setsHuEx1.0ST

Discovery based TSSHigh cost, 4 samples~27M 27bp tags

Illumina beadarray: Genotyping

HapMap Low cost2+ samples>620,000 markers, median marker spacing 2.7kbHuman610-Quad

hOSC Expanded neurospheres

High cost / few samplesLow cost / many samples

Discovery basedWell defined

Complex transcriptome sampling strategyAffymetrix exon array: Exon expressionIllumina beadarray: Transcript expression

Illumina CAGE: Transcript start site expression

Sample known genes and well defined variants

Sample known and predicted genes and exons

Unbiased TSS discovery and annotation

Illumina beadarray: Genotyping

Sample known genotype variants

Interchangable Gene Model- Ensembl- RefSeq- AceView- Vega

“Complexity is anxiety”

Analysis• What do you want to do to your data.

• Quality Control of array data – Is your data good enough?• Any outliers? Strange values?

• Differential gene expression (DE or DGE) between two or more samples.

• Normalisation

• Gene expression values

• Multiple comparisons

• A list of differentially expressed genes

• Identify common pathways differentially expressed

• Identify common TFs that bind to DE genes

• Identify biomarkers indicative of a particular trait, process.• What does the expression of a biomarker look like?

• Identify gene signatures to allow you to classify a particular sample.• Is there a particular pattern of expression associated with disease?

Quality Control

• Plot your data!• Raw image• Raw values

Quality Control – Short Read Data

• Bioconductor ShortRead package

• Currently Solexa/Illumina

• Will soon include• SOLID

• Short sequence data contains errors

• Currently few standards for this type of data beyond “matches to the genome with <=2 mismatches”

• Data needs to be filtered!

Short Read Data – genome alignment

• Data is useless until mapped to a known genome to identify transcribed elements

• Correct genome build

• Including splice junctions, ribosomal RNA, databases of likely contaminates

• Many methods for fast short read alignment• Vendor specific

• ELAND – closed source

• Open source strategies• Mapq, etc (many new methods being published)• Helicos – open source platform

• Short sequence data contains errors

• Currently few standards for this type of data beyond “matches to the genome with <=2 mismatches”

Differential Gene Expression

Normalisation• In order to make comparisons between chips of the same

experiment possible, every chip has to be normalised

• Normalisation is an attempt to eliminate all the non-biological variation in microarray experiments without affecting the biological variations

• There is a danger that some or even most of the biological information will also be removed in normalisation

• It is good to keep in mind that the amount of normalisation should be minimised to avoid this

Median centering• The median intensity of every chip is brought to the same value

• Achieved by calculating the median of the log ratios for one microarray and producing the centered data by subtracting this median from the log ratio of every gene

• Median centering does not change the spread of the data, which also means that the original information content is not altered.

• Only applicable for linear data

• Median centering is done by quantile normalisation.

• Quantile normalisation• Based on the assumption that the distribution of the intensity

values is similar on every chip.

• Holds well for genes with low expression, but not necessarily well for highly expressed genes.

Distribution of Intensities• Most genes expressed at low levels

• Taking the log of these values brings them closer to a normal distribution

• There is still a skewed distribution!

Distribution of Intensities• Signal and noise give the characteristic shape of the gene

expression intensities.

• Now we can look for gene expression (RMA, LiWong etc)

• … and then differential expression.

abundances + noise = observed values

• Probes and Transcript assignment• Probes are sequences that hybridise to specific transcripts

• Have varying efficacy and specificity• Often change between chip versions• Do not always target the newly discovered genes

• Transcript models• Change between genome releases• Assignment of probe to transcript is done by the manufacturer

(traditionally quite badly)• Assignment is in the format of CDF files, so this can be changed to

• Statistical Testing• Select an appropriate statistical test, T-test, ANOVA.

• Select a significance threshold for the p-value

• Form the pair of hypotheses you want to compare

• Calculate the test statistic

• Find out p-value which corresponds to the test statistic

• Draw conclusions

• Assumes normality of distributions so that they can be tested.

• It’s the NULL hypothesis, that there is NO difference in the means of the distributions being compared, which is being tested.

• How to apply this to many genes from a microarray experiment?

p-values• The p-value is usually associated with a statistical test, and it is

the risk that the null hypothesis is rejected, when it is actually true.

• When a cutoff for the p-value is decided, the values below the threshold are considered statistically significant, and the values above the threshold are considered not statistically significant.

• Often a threshold of 0.05 is used. This means that every 20th time, by chance alone, the difference between groups is statistically significant, when it really isn't.

Multiple testing correction and the False Discovery Rate

• When many analyses are performed on a data set, many results will meet the arbitrary significance level by chance alone.

• Often the p-value is corrected to account for this problem.

• Bonferroni correction is the most simple correction.• The original p-value is multiplied by the number of comparisons to

create a new corrected p-value.

• FDR controls for the expected proportion of false positives• The FDR adjusted p-values are often called q-values

• If all genes with a q-value below a threshold, eg: 0.05, are selected as differentially expressed, then the expected proportion of false discoveries in the selected group is controlled to be less than the threshold value, in this case 5%

• The distribution for each gene is compared between samples

• A p-value is given to that particular t-test comparison.

• This is probably corrected for FDR.• Genes are ranked by p-value,• And a cut off is usually imposed for

significance .. Say 0.05.

This produces ….• A list of “top 100” genes • Lacks any connections or structure • Where to start/stop?• Unconscious bias to “favourite” genes • How do I control “false discovery” without

also controlling “discovery”?

Prediction of disease specific exon skipping events

GenomeGraphs Bioconductor package (with modifications for gene model selection).

Alternative splicing - FIRMA

Probe variation

Probe intensities

Gene expression

Residuals

Alt splicing?

Red – hOSC ???Green – hOSC ???Black – Non hOSC

Red – hOSC PDGreen – hOSC ControlBlack – Non hOSC

Alternative splicing – Other methods

• Other (bioconductor) methods for exon analysis• aroma.affymetrix

• exonmap

• Easy as this: Exon analysis in xmap to find a set of differentially expressed genes

• library(exonmap)

• raw.data <- read.exon()

• raw.data@cdfName <- “exon.pmcdf”

• x.rma <- rma(raw.data)

• pc.rma <- pc(x.rma,“group”,c(“a”,”b”))

• keep <- (abs(fc(pc.rma)) > 1) & tt(pc.rma)< 1e-4

• sigs <- featureNames(x.rma)[keep]

• This ease of analysis is RARE (appreciate it!)• Additional exon analysis is available here

• http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2323405

What next (according to DAVID)?• Identify enriched biological themes, particularly GO

terms• Discover enriched functional-related gene groups• Cluster redundant annotation terms• Visualize genes on BioCarta & KEGG pathway maps• Display related many-genes-to-many-terms on 2-D view.• Search for other functionally related genes not in the list• List interacting proteins• Explore gene names in batch• Link gene-disease associations• Highlight protein functional domains and motifs

Pathway based approaches• A list of “top ranking” pathways• Includes functional connections between genes• Dependent on underlying database and method

• Ingenuity• DAVID• GSEA

• More functionally motivated - but• False pathway detection• Overlapping pathways• Many pathways poorly understood• What about unknown genes?

Network based / Systems biology approaches• A list of “top ranking” networks• Essentially a looser pathway description where interactions

can be identified from multiple data sources of very different types

• Ingenuity can be used in this way “grow network/pathway functionality”

• Networks/Pathways can be very complex• transcription factors• miRNAs• promoter status• Protein-protein interaction• Multiple levels of data

• Alternatives – Classifier Approaches

• Multivariate• Looking at combinations of genes that tell us something about the

differences in the samples under study.

• Support Vector Machines• A supervised machine learning way of training a ‘machine’ on one

kind of data (diseased, or a state) and testing on some unseen data.

• Principal Components Analysis (PCA)• Standard procedure of finding combinations with greatest variance• Exploratory analysis, to see natural groupings, and to detect outliers• To identify combinations of features that usefully characterize

samples or genes.• Good for QC of arrays!

SVMs, higher dimensions and kernels

F-Josef Müller et al. Nature 000, 1-5 (2008) doi:10.1038/nature07213

Sample collection and analysis for the stem cell matrix.

Clusters of samples based on machine learning algorithm.

Microarray data analysis - multivariate

• GeneRaVE analysis• Use gene expression as a predictor of a response

• continuous • multigroup • ordered categorical• survival

• Integrated variable selectionand model fitting

• Able to handle many variables

• Sparse classifiers – analyse gene expression values to find a small set of genes ‘predictive’ of a response

• Sparse networks – build a network around the response of interest

• https://www.bioinformatics.csiro.au

KLF4 is necessary for preventing the entry into mitosis following DNA damage

- UCHL1, ubiquitin thiolesterase. Associated with Parkinson’s and Alzheimer’s due to expression in nerve cells.

- FABP6, binds fatty acids, bile acids.

-SLC7A11, cysteine/glutamate transport system xc(-)mediates cystine entry into cell in exchange for intracellular glutamate, which accumulates in response to oxidative stress.

-MMP10, regulated by reactive oxy species, which also modulate tumor progression

Local gene network – smoking

Gastro Intestinal expression- OLFM4 expressed in the inflammed colonic epithelium,- SPINK4 gastrointestinal protease inhibitor-REG1A expression is closely related to the carcinoma invasiveness of gastric neoplasms. - CA1 downregulated in GI mucosal neoplasmsMembrate-bound transporters - AQP8 water channel expressed in pancreas and colon. - SLC26A3 epithelial Cl- absorption and HCO3- secretion.

Hormone controlled carbohydrate homeostasis regulators of cellular glycogen release- GCG induces glucose production,- PYY increases after meals, -SST interacts with pituitary growth hormone, thyroid stimulating hormone, and most hormones of the gastrointestinal tract.- INSL5 insulin-like protein, has a newly identified receptor in the colon.

Cellular and developmental patterning- PRAC predictor of colon position- HOXB13 important in embryonic patterning along the axis of the organism,-CLDN8 determines cellular polarity, pathology in colon cancer.

uncharacterised

Local gene network – colon biology

Microarrays, Diagnostics in Cancer

• A gene-expression signature to predict survival in breast cancer across independent data sets

• A Naderi, A E Teschendorff, N L Barbosa-Morais, S E Pinder, A R Green, D G Powe, J F R Robertson, S Aparicio, I O Ellis, J D Brenton and C Caldas

Analysis - DAVID

• DAVID tools for functional analysis of gene lists• http://david.abcc.ncifcrf.gov/

• “The Database for Annotation, Visualization and Integrated Discovery (DAVID) 2008 is the sixth version of our original web-accessible programs. DAVID now provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.”

• We use this one in house often.

• Common questions• What are major enriched GO terms?• What are the highly active pathways?• What are the frequently interacting proteins?• What are the known disease associations?

GSEA Analysis

• GSEA: Gene Set Enrichment Analysis

• Similar to DAVID• Gene sets are curated differently

• See tutorial• http://www.broad.mit.edu/gsea/do

c/desktop_tutorial.jsp

• A note on gene/probe identifiers • When running the gene set

enrichment analysis, it is critical that all of your data files use the same gene or probe identifiers. You can either use the probe identifiers native to your expression dataset, or collapse each probe set into a gene vector and use HUGO gene symbols as your identifiers

Analysis – common TF's• Commonly regulated genes have common transcription

factors• Use sets of up/down regulated genes in your experiment to find

common binding patterns

• TF databases• TRANSFAC

• Large, commercial, noisy database

• JASPAR• a curated, non-redundant set of 123 profiles, derived from

published collections of experimentally defined transcription factor binding sites for multicellular eukaryotes.

• open data acess, non-redundancy and quality• When use JASPAR? When seeking models for specific factors or

structural classes, or if experimental evidence is paramount • http://jaspar.genereg.net/

What next??

• Validation of results?• Good for biomarkers, RT-PCR.

• Literature.

• Other results in the lab.

• Any independent datasets? – check GEO/ArrayExpress.

• Biological significance of results• Do your results suggest further experiments?

Gene Expression

• Goals revisited• To understand current high throughput strategies for measuring

gene expression

• To understand quality control and normalisation in gene expression data

• To understand the factors behind choice of gene expression measurement strategies

• To identify downstream analysis methods for gene expression data

Alistair Chalk, 2008 Gene Expression Goals To understand current high throughput strategies for...

Documents

Regulation of Gene Expression In Prokaryotes. Regulation of Gene Expression Constituitive Gene Expression (promoters) Regulating Metabolism (promoters

Gene expression…

Chapter 11 - The Control of Gene Expression AIM: What is the effect of differentiated gene expression? Now that we understand how genes are transcribed

1 Gene Expression Overview. 2 Gene Expression Gene Expression The Gene Structure The Gene Structure Protein Synthesis Protein Synthesis

6. The Gene Expression Omnibus (GEO): A Gene Expression

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases

Gene expression,Regulation of gene expression by dr.Tasnim

Regulation of gene expression - contentextra.com · How do you think the gene expression in a β-cell in ... • understand factors involved in the regulation of gene expression

An Overview of Weighted Gene Co-Expression Network Analysis · PDF filePhilosophy of Weighted Gene Co- Expression Network Analysis • Understand the “system” instead of reporting

Measuring Gene Expression Part 2 - Gene … Gene Expression Part 2 David Wishart Bioinformatics 301 david.wishart@ualberta.ca Measuring Gene Expression • Differential Display •

Gene expression

Multiple Choice Review Gene Expressioncontent.njctl.org/courses/science/ap-biology/gene-expression/gene... · PSI AP Biology Gene Expression Multiple Choice Review – Gene Expression

Gene Expression - Center For Teaching & Learningcontent.njctl.org/courses/science/ap-biology/gene-expression/gene... · PSI AP Biology Gene Expression ... Frederick Griffith,

Gene expression Gene Regulation - Biostatistics

Chapter 11: Gene Expression 11-1 Control of Gene Expression 11-2 Gene Expression and Development