38
Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D. 10/17/12

Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Embed Size (px)

Citation preview

Page 1: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Microarray Gene Expression Analysis

Differential expression, clustering, networks, and functional enrichment

STEMREM 201 Fall 2012

Aaron Newman, Ph.D.

10/17/12

Page 2: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

A genomics approach to biology involves…

• A plethora of techniques and tools exist…

• My goal is to introduce some practical, powerful, and freely available methods for gene expression analysis

Finding significant patterns in high-throughput data

Interpreting these patterns in the context of prior knowledge

Generating new hypotheses and predictions

Page 3: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Example workflow

Gene expression profiling

DEGsCluster analysis

GSEA

Functional enrichment

Networkanalysis

Biological meaning

Page 4: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Materials and MethodsGSE22651

ESCs and differentiated cells

Excel&

GenePattern

Hierarchical&

AutoSOMEGSEA

DAVIDToppfun

STRING&

Cytoscape

Biological meaning

Page 5: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Microarray normalization

• Probe/array-level normalization (reduce inter-array technical errors and noise)

– Raw CEL files merged and normalized text file

• Robust Multi-Chip Averaging (RMA) for Affymetrix arrays– Affymetrix Gene Expression Console software

• Quantile Normalization for Agilent, Illumina, and others.– Sets all arrays to the same distribution (mean, median, sd, etc.)

• Analysis-level normalization– Log2 Transformation

• Improves statistical properties for analysis (log-normal)

– Median/mean centering• Useful for reducing impact of transcript

abundance on identifying/visualizing

co-expressed genes

– Unit variance• Standardize each column (array) to mean of 0 and standard deviation of 1

center

Page 6: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Differentially expressed genes (DEGs)

• Goal: given known phenotypes, determine which genes exhibit significant differential expression.

• Many genes will be tested; multiple hypothesis testing should be performed to control the false discovery rate (FDR).– Bonferroni (α / n)– Benjamini-Hochberg (largest k s.t. P(k) ≤ (αk) / n)– Storey Q-value (p-value specific FDR)

• Q-value software - http://genomics.princeton.edu/storeylab/qvalue/

• Sample permutations can improve p-value accuracy – *Only apply with ≥ 10 samples per class– Implemented in GenePattern (ComparativeMarkerSelection module),

SAM

Page 7: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Supervised DEG identification tools (i.e., classes are known)

• Likely familiar– Excel (T-test) coupled with FDR assessment

• Watch for conversion to dates! (e.g., MARCH6 3/6)

• Basic to Intermediate– Statistical Analysis of Microarrays (Windows only)– GenePattern (Broad Institute)– p.adjust (R)

• Advanced– Bioconductor packages in R (e.g., limma)

Page 8: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 1:Identifying DEGs in Excel using FDR cutoff of 1% (BH method)

1) Open “DEG_example_large.txt” in Excel (2 classes, 25151 genes).

2) In column L, use T-test function to test for significant differential expression between ESCs and non-ESCs.

• =TTEST(“ESCs”, “non-ESCs”, 2, 3)• 2-tailed, unpaired with unequal variance

3) Sort p-values in column L in ascending order.

4) In column M, input p-value rank, going from 1,2,3…n.

5) Input following formula into column N to test for FDR of 1%• In general: = (0.01 * rank) / n • In our case: = (0.01 * M2) / 25151• Autocomplete column N

6) In column O, test for p-values that do not exceed FDR of 1%• = if(L2 <= N2, 1, 0)

7) That’s it! All genes with a 1 in column O are significant at an FDR of 1%.

• Should be 3237 significant genes.

Page 9: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Cleaning up result and creating a heat map

1) Calculate log2 fold change in column P• = AVERAGE(B2:F2) – AVERAGE(G2:K2)• Autocomplete

2) Use filter to isolate significant genes with absolute fold change ≥ 5

3) Copy and paste into new tab

4) Sort by fold change in descending order

5) Save as new file (e.g., DEG_example_sorted_fdr01_fold5.txt)

6) Center genes in Cluster 3.0• Open file• Go to “Adjust Data”• Check “Center genes” and select “Median”• Press “Apply”, then save file (e.g.

DEG_example_sorted_fdr01_fold5_medcen.txt)

7) Open in Java TreeView. • To customize heat map display and text, use “Settings>Pixel Settings”, and

“Settings>Font Settings”.• Export heat map image using “Export>Save Tree Image”.

Page 10: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Expected result

FDR ≤ 1%Fold change ≥ abs(25)

= 79 genes

Page 11: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 2:Identifying DEGs in GenePattern

1) Create account (if you have not already) and log in to GenePattern

2) Select “Differential Expression Analysis”.

3) Skip to ComparativeMarkerSelection (step 2) and click “Open module”.

4) Enter following fields:• input* file = GSEA_example_expression_large.gct• cls* file = GSEA_example_classes.cls• number of permutations* = 0

5) Select “Run”

6) When processing is finished, select “ComparativeMarkerSelectionViewer” from the pulldown menu next to “GSEA_example_expression_large.comp.marker.odf”

7) Select “Run”

8) Select “Open Visualizer” and “Allow”

9) A table will appear showing all genes and various statistics, including:1) Benjamini_Hochberg corrected p-value; FDR(BH)

2) Storey q-value; Q Value

3) Bonferroni p-value

4) Fold change

Page 12: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Extracting data and displaying as a heat map

1) Pipe .odf file to “ExtractComparativeMarkerResults” module.

2) Select genes with BH FDR ≤ 1%.

3) Filtered data are available as a new .gct file.

4) To display a heat map, pipe .gct file to “HeatMapViewer” module.

5) Select “Run”

Page 13: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Example workflow

Gene expression profiling

DEGsCluster analysis

GSEA

Functional enrichment

Networkanalysis

Biological meaning

Page 14: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Widespread Importance

– Genomics

– Phylogenetics

– Disease

– Galaxy Clusters

– etc., etc., etc.

The “Clustering Problem” for Large Data Sets

Page 15: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Common clustering methods

Figure 1. D’haeseleer, Nat Biotechnol. 2005

Toy data set Hierarchical (Eisen)

K-means SOM

Page 16: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Feature Comparison

MethodHandleLarge Datasets

Diverse Cluster Shapes

Detect Cluster Number

Identify Outliers

Low Output Variance

Hierarch-ical

√ √* √

K-Means √ √*

Self-Organizing Map (SOM)

Page 17: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 3: Hierarchical Clustering

1) Open “GSE22651_filtered.txt” in Cluster 3.0

2) Normalize data (Adjust Data tab)• Check “Log transform data”• Check “Center genes” and select “Median”• Press “Apply Button”

3) Cluster data (Hierarchical tab)• Check “Cluster” under Genes and under Arrays• Leave “Similarity Metric” at Uncentered Correlation

4) Press “Centroid linkage” under “Cluster method”• Centroid = distance between cluster centers• Single = closest distance between clusters• Complete = farthest distance between clusters• Average = mean of all pairwise distances between clusters

5) Open clustered data table file (*.cdt) in Java TreeView.

Page 18: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Java TreeView

Navigate the cluster tree to highlight genes with distinct expression patterns in particular samples

Export>”Save List” to copy or save gene lists of interest for further analysis.

Page 19: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Expected result

Genes

Samples

Page 20: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Automatic clustering of

Self-Organizing Map Ensembles

AutoSOME

Serial application of - SOM - Density Equalization - Minimum Spanning Tree - Ensemble Averaging

Newman and Cooper (2010) BMC Bioinformatics, 11:117

Page 21: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

AutoSOME

MethodHandleLarge Datasets

Diverse Cluster Shapes

Detect Cluster Number

Identify Outliers

Low Output Variance

Hierarch-ical

√ √* √

K-Means √ √*

Self-Organizing Map (SOM)

Affinity Propagation

√* √ √ √

Spectral Clustering

√* √ √

nNMF √ √ √

AutoSOME √ √ √ √ √

Page 22: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

AutoSOME Webstarthttp://jimcooperlab.mcdb.ucsb.edu/autosome

Page 23: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 4:Identifying discrete clusters without prior knowledge of cluster number

1) Launch the AutoSOME GUI via the large launch button.

2) Open “GSE22651_filtered.txt”

3) Skip filtering

4) Show “Basic Fields”• Set p-value = 0.05

5) Show “Input Adjustment”• Check Log2 Scaling, Unit Variance, Median Center, and Sum Squares = 1

6) Press the large “Run” button on the left.

7) From the menubar, select View>heat map> green red.

8) Select cluster 1 in the cluster list.

9) The data are rendered as a normalized heat map. To change the display, go to View>settings>image settings.

• Under the Normalization tab, check “Display Original Data”, “Log2 Scaling”, and “Median Center”.

• Check “Manually adjust range for contrast” and set minimum to -2 and maximum to 2. Press “Update” (lower left corner).

Page 24: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

AutoSOME continued…1) Select clusters 1 to 5 (hold down shift). Right click mouse in heat map window to resize.

2) Set “Zoom Factor” to 40 and Press “Save”.

3) See website for further tutorials and documentation.

Representative heat map View>heat map>rainbow

Page 25: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Example workflow

Gene expression profiling

DEGsCluster analysis

GSEA

Functional enrichment

Networkanalysis

Biological meaning

Page 26: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

How to interpret clustering results

• Gene set functional annotation– DAVID

– Toppfun

– MSigDB

• Network analysis– STRING

http://string-db.org/

http://www.broadinstitute.org/gsea/msigdb/

http://toppgene.cchmc.org/enrichment.jsp

http://david.abcc.ncifcrf.gov/

Page 27: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Gene sets: where do they come from?

• Gene ontology– An attempt to semantically organize genes and their functional

relationships.– Data are arranged in a graph structure, from broad to specific– Ontologies:

• Biological process (BP): series of ordered events• Molecular function (MF): activities that occur at the

molecular level• Cellular component (CC): part of a cell

• Biocarta/KEGG pathways (curated wiring diagrams)

• High-throughput studiese.g., MSigDB

Page 28: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 5: Functional annotation of a large cluster using DAVID

1) In AutoSOME, find the cluster of genes with higher expression in stem cells.

2) Go to View>raw data. Highlight all genes in the cluster and copy the list.

3) Open the DAVID homepage and press “Start Analysis”

4) Select “Upload” tab and paste in gene list.

5) Under “Select Identifier,” select “OFFICIAL_GENE_SYMBOL”.

6) Select “Gene List” for “List Type”.

7) Submit the list.

8) Press OK at multi-species warning message.

9) Select “Homo sapiens” and press “Select Species”.

10) Select “Functional Annotation Tool”

11) Press “Functional Annotation Clustering” button.

http://david.abcc.ncifcrf.gov/

Page 29: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

DAVID Output

Pathways

Gene sets

Similar gene sets are clustered together, eliminating redundancy and facilitating interpretation

Page 30: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 6: Protein-protein associations with STRING

Top 50 non-ESC genes

Page 31: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Protein-protein association network among top 50 non-ESC genes

Evidence types

Page 32: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Top 100 non-ESC genes

Page 33: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Example workflow

Gene expression profiling

DEGsCluster analysis

GSEA

Functional enrichment

Networkanalysis

Biological meaning

Page 34: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Gene set enrichment in a ranked list

• Gene Set Enrichment Analysis (GSEA)

• “Threshold-less”– Arbitrary DEG cutoffs are avoided

• Two modes of operation to rank input genes:– Rank by differential expression between phenotypes

• Default metric is signal to noise, defined as:

(avg[class 1] – avg[class 2]) / (sd[class 1] + sd[class 2])

– Pre-ranked according to user-defined criteria

• Evaluates statistical bias in the distribution of each defined gene set over the list of ranked input genes.

Page 35: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Input format

• Input: – Expression data (gene cluster text file format, *.gct)– Classes (*.cls)

• File extensions matter!– If using Notepad in Windows, set Save as type: to “All Files

(*.*)– If using TextEdit in Mac, go to Preferences > Open and Save >

and uncheck ‘Add “.txt” extension to plain text files’.

• Formatting instructions: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

Page 36: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Tutorial 7: GSEA

Load data Run GSEA

Input files: GSEA_example_expression_large.gct, GSEA_example_classes.cls

Page 37: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Output

Page 38: Microarray Gene Expression Analysis Differential expression, clustering, networks, and functional enrichment STEMREM 201 Fall 2012 Aaron Newman, Ph.D

Summary

Gene expression profiling

DEGsCluster analysis

GSEA

Functional enrichment

Networkanalysis

Biological meaning