Microarrays IST 444. Microarrays What if no test tubes were needed to conduct an experiment? Hundreds, thousands or even millions of individual experiments

Microarrays

IST 444

Microarrays

• What if no test tubes were needed to conduct an experiment?

• Hundreds, thousands or even millions of individual experiments are conducted in parallel, with very few reagents

• Microarrays combine genomics (study of all the genes in the genome) with experiments and eventually, diagnostics

Microarrays:Universal Biochemistry Platforms

PeptidesPeptides ProteinsProteins

Carbohydrates

LipidsLipids

Small moleculesSmall molecules

DNADNA

Types of microarrays include:

• DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays, SNPs, CHiP

• MMChips, for surveillance of microRNA populations

• Protein microarrays (protein-protein interactions)• Tissue microarrays • Cellular microarrays (also called transfection

microarrays • Chemical compound microarrays • Antibody microarrays (proteomics)• Carbohydrate arrays (glycoarrays)

Microarrays Require Bioinformatics

• Microarrays combine genomics, silicon chip manufacturing, DNA and Protein chemistry, signal and image processing, statistics, software skills and miniaturized versions of traditional molecular biology experiments.

• Develop new software to analyze the results of the many possible experiments.

Biological Samples in 2D Arrays on Membranes or Glass Slides

DNA Microarray Technologies• Gene expression profiling

- Monitoring expression levels for thousands of genes simultaneously.

• Three other common applications:• Array CGH (Comparative genomic hybridization)

- Assessing genome content in different cells or closely related organisms.

• SNP array (single nucleotide polymorphism) - Identifying single nucleotide polymorphism among alleles within or between populations.

• ChIP-on-chip (Chromatin immunoprecipitation) - Determining protein binding site occupancy throughout the genome.

• Methylation arrays (immonoprecipitate methylated DNA_-Determining which regions of DNA are methylated to determine epigenetics

History of DNA Microarrays

• Microarrays descend from Southern and Northern blotting. Unknown DNA is transferred to a membrane and then probed with a known DNA sequence with a label

• In Microarrays, the known DNA sequence (or probe) is on the membrane while the unknown labeled DNA (or target) is hybridized and then washed off so only specific hybrids remain.

• Dot Blots of different genes in an array were used to assay gene expression as early as 1987.

• Complete genome of all Saccharomyces cerevisiae ORFs on a microarray published in 1997 by Lashkari et al. http://www.pnas.org/content/94/24/13057.

• www.bio.davidson.edu/Courses/genomics/chip/chip.html

RNA Transcription as a Measure of Gene Expression

www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm

Transcription factors bind to the promoter and bind RNA polymerase

DNA strands separate and transcription is initiated

Transcription continues in the 3'-5' direction until the stop codons are reached

The completed RNA strand is released for post-processing

What Can be Measured using Microarrays?

1. Amount of mRNA expressed by a gene.gene expression array, exon array, tiling array

2. Amount of mRNA expressed by an exon. exon array, tiling array

3. Amount of RNA expressed by a region of DNA. tiling array

4. Which strand of DNA is expressed. exon array, tiling array

5. Which of several similar DNA sequences is present in the genome.SNP array

6. How many copies of a gene is present in the genome.gene expression array, exon array, tiling array

7. Where a known protein has bound to the DNA. (ChIP on chip)promoter array, tiling array

Types of Microarrays

Exon 1 Exon 2 Exon 3UTR UTR

A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.

oligo

exon exon exon

cDNA

chromosome sequence

CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA

tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile

promoter

CCGTTCACA AAGGCCGTT

CCGTGCACA AAGGACGTTSNP

cDNA sequence

Spotted vs. in situ synthesized arrays

• The DNAs can be chemically synthesized or made by PCR and then mechanically spotted on the array. – The amount spotted can vary. – Method is more flexible and less expensive.

• The DNA can be chemically synthesized directly on the array (Affymetrix). – This can be more consistent– Shorter pieces are used.

Format of an Affymetrix Array

http://cnx.rice.edu/content/m12388/latest/figE.JPG

Print Technology

1. The cDNA or oligo can be printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only)40,000+ spots

2. Oligos (all the same length) can be synthesized on the slide using:i) inkjet technologyii) photolithography1,000,000+ spots

3. There are other technologies that give similar types of results (e.g. "beads").

Spotted 2-Channel Array

http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg

Spotted arrays are printed on coated microscope slides.

2 RNA samples are converted to cDNA. Each is labelled with a different dye.

SNP Analysis Using the Illumina, Inc.

GoldenGate™ Assay

• Allele Specific Extension and Ligation

• PCR Amplification

• Hybridization to the Universal Sentrix® Array Matrix

AG

illumiCode’ AddressAllele Specific Extension &Ligation

Universal PCR Sequence 1

Universal PCR Sequence 2

Universal PCR Sequence 3’

Allele Specific Extension and Ligation

Genomic DNA [T/C] Ligase [T/A]Polymerase

Custom Oligo Pool All (OPA)

96-1,536 SNPs multiplexed

Total oligos in reaction – 288-4,608

A illumiCode #561Amplification Template

PCR with Common Primers

PCR Amplification

Cy3 Universal Primer 1

Cy5 Universal Primer 2

Universal Primer P3

/\/\/\

/

/\/\/\

/

/\/\/\

/

illumiCode #561

illumiCode #217

illumiCode #1024

Hybridization to Sentrix® Array Matrix

/\/\/\

/

/\/\/\

/

A/A G/G C/T

SNP #561 SNP #217 SNP #1024

Sentrix® Array Matrix

10 m1.5 mm 400 m

The Illumina BeadStation 500G permits high throughput analysis of thousands of SNP DNA markers in hundreds of genotypes in less than one week.

Gene Expression Microarrays

Two-color fluorescent scan of a yeast microarray containing 2,479 elements (ORFs). Red and Green probes interact with a single target. Yellow probes interact with

both targets and empty probes with neither target.

Lashkari D A et al. PNAS 1997;94:13057-13062©1997 by The National Academy of Sciences of the USA

Tiling Array

• Genome array consisting of overlapping probes

• Finer Resolution

• Better at finding RNA in the cell– mRNA

• Alternative splicing• Not Polyadenylated

– microRNA

Tiling Arrays

http://en.wikipedia.org/

Tiling Array

http://en.wikipedia.org/

© Affymetrix Inc.

SNP Array

* Most common genetic variation in human genome. Occur about every one thousand base pairs in genome

* Genome-wide SNP maps now available (millions in database)

Single Nucleotide Polymorphisms (SNPs)

Affymetrix Standard Tiling

C-T-C-C-A-A-A-A-A-A-A-T-T-T-C-A-T-T-C-T

C-T-C-C-A-A-A-A-A-A-C-T-T-T-C-A-T-T-C-T

C-T-C-C-A-A-A-A-A-A-G-T-T-T-C-A-T-T-C-T

C-T-C-C-A-A-A-A-A-A-T-T-T-T-C-A-T-T-C-T

Substitution position

ChIP-on-chip array(Chromatin

ImmunoPrecipitation )

Antibody

ChIP-on-chip array

ChIP-on-chip array

3. ChIP-on-chip array

Bioinformatic approaches for analysis

• Measuring 10000s of data points simultaneously

• High dimensional data– 10 Exp x 50K = 500K

• How to find real differences over the noise

• Statistical approaches

Tumor Normal

Bioinformatic approaches for analysis

• Class Comparison– Which genes are up or

down in tumors v normal, untreated v treated

• Class Discovery– Within the tumor

samples, are there subgroups that have a specific expression profile?

• Class prediction, pathway analysis etc

Tumor Normal

Challenges in microarray analysis

• Different platforms– Ilumina, Affymetrix, Agilent….

• Many file types, many data formats• Need to learn platform dependent methods and

software required

Public databases• Many sources for public

data – labs, consortia, government

• Publications require that data files including raw files be made public

• GEO –http://www.ncbi.nlm.nih.gov/geo/

• Array Express - http://www.ebi.ac.uk/arrayexpress/#ae-main[0]

Streamlined Analysis

Normalize

normal tumor tumor normal normal tumorID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL

AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P

Raw data Filter

ClassificationSignificance Clustering

Gene lists

Function(Genome Ontology)

(RMA)

•Present/Absent•Minimum value•Fold change

•t-test•SAM•Rank Product

•PAM•Machine learning

Microarray experiments

Obtain sequence info

select oligos

Print microarray

Print or buy the microarray



select oligos

Print microarray

Print or buy the microarray

sequencing errorassembly errorcontamination

uniquesimilar hybridization rates



select oligos

Print microarray

obtain tissue sample

extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microarray Create the labeled samples



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA


experimental design-number of biological replicates-technical replicatesblockssample pooling



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA


hybridize



select oligos

Print microarray


extract RNA

extract mRNA

label

normalize mRNA

Print or buy the microrray Create the labeled samples

hybridize

hybridization design (multichannel)


hybridize scan

detect spots

compute spot summary

detect background

detect bad spots

process image

remove array specific noise


hybridize scan

detect spots

compute spot summary

detect background

detect bad spots

spot detection software

pixel mean, median ...background correction

detection limitbackground > foregroundbadly printed spotsflaws

process image using multiple scans

remove array specific noise normalization

Raw data are not mRNA concentrations

• tissue contamination• RNA degradation• amplification efficiency• reverse transcription

efficiency• Hybridization efficiency and

specificity• clone identification and

mapping• PCR yield, contamination

• spotting efficiency

• DNA support binding

• other array manufacturing

related issues

• image segmentation

• signal quantification

• “background” correction

Quality control: Noise and reliable signal

Arrays 1 ... n

Array level Gene levelProbe level

Probe level: quality of the expression measurement of one spot on one particular array

Array level: quality of the expression measurement on one particular glass slide

Gene level: quality of the expression measurement of one probe across all arrays

Probe-level quality control

• Individual spots printed on the slide• Sources:

– faulty printing, uneven distribution, contamination with debris, magnitude of signal relative to noise, poorly measured spots;

• Visual inspection:– hairs, dust, scratches, air bubbles, dark regions, regions with haze

• Spot quality:– Brightness: foreground/background ratio– Uniformity: variation in pixel intensities and ratios of intensities within

a spot– Morphology: area, perimeter, circularity.– Spot Size: number of foreground pixels

• Action:– set measurements to NA (missing values)– local normalization procedures which account for regional

idiosyncrasies.– use weights for measurements to indicate reliability in later analysis.

Spot IdentificationIndividual spots are recognized, size and shape might be

adjusted per spot (automatically fine adjustments by hand).

Additional manual flagging of bad (X) or non-present (NA) spots

poor spot quality

good spot quality

Different Spot identification methods: Fixed circles, circles with variable size, arbitrary spot shape (morphological opening)

NA

X

Spot Identification

Histogram of pixel intensities of a single spot

• The signal of the spots is quantified.

„Donuts“

Mean / Median / Mode / 75% quantile

Rafael A Irizarry,Department of

Biostatistics [email protected]

http://www.biostat.jhsph.edu/~ririzarrhttp://www.biocon

ductor.org

nci 2002

Spot Detection

Adaptive segmentation Fixed circle segmentation

---- GenePix

---- QuantArray

---- ScanAnalyze

Spot uses morphological opening

Microarray Analysis – Data Preprocessing

• Objective– Convert image of thousands of

signals to a a signal value for each gene or probe set

• Multiple step– Image analysis– Background and noise

subtraction– Normalization– Expression value for a gene or

probe set

• Image analysis and background noise usually done by proprietary software

Gene 1 100Gene 2 150Gene 3 75.Gene10000 500

Array Level Quality Control

• Problems:– array fabrication defect– problem with RNA extraction– failed labeling reaction– poor hybridization conditions– faulty scanner

• Quality measures:– Percentage of spots with no signal (~30% excluded spots) – Range of intensities– (Av. Foreground)/(Av. Background) > 3 in both channels– Distribution of spot signal area– Amount of adjustment needed: signals have to substantially

changed to make slides comparable.

Gene-level Quality Control

Gene g• Poor hybridization in the

reference channel may introduce bias on the fold-change

• Some probes will not hybridize well

to the target RNA

• Printing problems: such that all

spots of a given inventory well

have poor quality.

•A well may be of bad quality – contamination

•Genes with a consistently low signal in the reference channel

are suspicious

Normalization• Corrects for variation in

hybridization etc• Assumption that no

global change in gene expression

• Without normalization– Intensity value for gene

will be lower on Chip B– Many genes will appear

to be downregulated when in reality they are not

Gene 1 100Gene 2 150Gene 3 75Gene10000 500

507532250

Treated Control

Gene

mRNA Samples

gene-expression level or ratio for gene i in mRNA sample j

M =Log2(red intensity / green intensity)

Function (PM, MM) of MAS, dchip or RMA

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

A =average: log2(red intensity), log2(green intensity)

Function (PM, MM) of MAS, dchip or RMA

Gene expression data

Data Data (log scale)

Scatterplot of Data

Message: look at your data on log-scale!

Use a log transformation of the ratio data:

Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold (lines).

X Axis is Average spot intensity on a log scale and Y Axis is specific spot intensity.

Statistical power• t test

– Test hypothesis that the two means are not statistically different

– Adding “confidence” to the fold change value

• Mean• Standard deviation• Sample size• Calculates statistic• You choose cutoff or

threshold– Give me gene list at a cutoff

of p <0.05» 95% confidence that

the mean for that gene between control are treated are different

Experimental Design – Very important!!!

• Sample size– How many samples in test and

control• Will depend on many factors

such as whether tissue culture or tissue sample

• Power analysis

• Replicates– Technical v biological

• Biological replicates is more important for more heterogeneous samples Need replicates for statistical analysis

• To pool or not to pool– Depends on objective

• Sample acquisition or extraction– Laser captured or gross

dissected

• All experimental steps from sample acquisition to hybridization– Microarray experiments are

very expensive. So, plan experiments carefully

t tests• Results might look like

– At a p<0.05, there are 300 genes up and 200 genes down regulated• 95% confidence that the

means of these genes in the two groups is different

– At a p < 0.05, x genes up and y genes down with a fold change of at least 3.0

Multiple Comparisons

• In a microarray experiment, each gene (each probe or probe set) is really a separate experiment

• Yet if you treat each gene as an independent comparison, you will always find some with significant differences– (the tails of a normal distribution)

Multiple Comparisons

• Microarrays have multiple comparison problem• p <= 0.05 says that 95% confidence means are

different; therefore 5% due to chance• 5% of 10000 is 500

– 500 genes are picked up by chance– Suppose t tests selects 1000 genes at a p of 0.05– 500/1000 ;Approximately 50% of the genes will be

false– Very high false discovery rate; need more confidence– How to correct? – Correction for multiple comparison– p value and a corrected p value

Corrections for multiple comparisons

• Involve corrections to the p value so that the actual p value is higher

• Bonferroni http://en.wikipedia.org/wiki/Bonferroni_correction

• Benjamin-Hochberg

• Significance Analysis of Microarrays– Tusher et al. at Stanford

Gene Expression Microarray experiments

obtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

differential expression analysis

Gene Expression Microarray experiments

obtain numerical summary for each gene or exon on each array

sampleclassification

clustering genesand samples

t-tests, ANOVABayesian versions of aboveFourier analysis of time seriesFalse discovery and nondiscovery rates

differential expression analysis

robust methods to down weight outliersdata imputation (filling in missing data)

discriminant analysissupport vector machinessupervised learning

unsupervised learninghierarchical clusteringk-means clusteringheatmaps

From Data to Knowledge

Gene

mRNA Samples

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Once data is of high quality and systematic, non-biological effects are removed, the result is a gene expression matrix

This is still just data, not knowledge.Use this data to answer a scientific question.

Supervised Analysis

Learning from examples, classification– We have already seen groups of healthy and

sick people. Now let’s diagnose the next person walking into the hospital.

– We know that these genes have function X (and these others don’t). Let’s find more genes with function X.

– We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs.

Known structure in the data needs to be generalized to new data.

Unsupervised analysis

= clustering– Are there groups of genes that behave

similarly in all conditions?– Disease X is very heterogeneous. Can we

identify more specific sub-classes for more targeted treatment?

No structure is known. We first need to find it. Exploratory analysis.

Supervised analysis

Calvin, I still don’t know the difference between cats and dogs …Oh, now I get it!!

Don’t worry!I’ll show you once more:

Class 1: cats Class 2: dogs

Unsupervised analysis

Calvin, I still don’t know the difference between cats and dogs …

I don’t know it either.

Let’s try to figure it out together …

Supervised analysis: setup

• Training set– Data: microarrays– Labels: for each one we know if it falls into our

class of interest or not (binary classification)

• New data (test data)– Data for which we don’t have labels. – These are genes without known function

• Goal: Generalization ability– Build a classifier from the training data that is

good at predicting the right class for the new data.

One microarray, one dotExp

ress

ion

of g

en

e 2

Expression of gene 1

Think of a space with #genes dimensions (yes, it’s hard for more than 3).

Each microarray corresponds to a point in this space.

If gene expression is similar under some conditions, the points will be close to each other.

If gene expression overall is very different, the points will be far away.

A heatmap

samples of different regions of the brain in humans and chimpanzees

sample clusters show that different regions of the brain cluster more closely than different species

gene clusters show that some genes differentiateamong brain regions while other differentiate the 2 species.

Heat Map: 2-D Cluster Analysis

Genome Wide Association Studies

• GWAS involves rapidly scanning markers across genome (≈0.5M or 1M) of many people (≈2K) to find genetic variations associated with a particular disease.

• A large number of subjects are needed because (1)associations between SNPs and causal variants are expected to show low odds ratios, typically below 1.5 (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction

• Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases


• GWAS involves rapidly scanning markers across genome (≈0.5M or 1M) of many people (≈2K) to find genetic variations associated with a particular disease.

• A large number of subjects are needed because– associations between SNPs and causal variants are

expected to show low odds ratios, typically below 1.5

– (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction

• Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases like Autism.

Look for Association with Diseases and SNPs

Many issues with data- includingPopulation diversities and need for T-test corrections and log transformationsbecause of many variables and fewer Samples (Bonferoni)

A recurrent mutation in the BMP type I receptor ACVR1 causes inherited and sporadic fibrodysplasia ossificans progressivaEileen M. Shore, Meiqi Xu, George J. Feldman, David A. Fenstermacher, The FOP International Research Consortium, Matthew A. Brown, and Frederick S. Kaplan Nature Genetics 2006

Collect 13 individuals from five families with FOP ectopic bone formation

Genome-wide linkage analysis with 400 microsatellite markers

Higher resolution linkage analysis with Affymetrix 10K SNP mapping

chip (in Facility)

Candidate gene sequencing identifies a new SNP in BMP

receptor

Predictive Value of Gene Expression

• Lymphoma dataset– This dataset is the gene expression in the

three most prevalent adult lymphoid malignancies: B-CLL,FL and DLBCL.

– This study produced gene expression data for p=4,682 genes in n=81 mRNA samples.

29 × B-CLL 9 × FL43 × DLBCL

http://genome-www.stanford.edu/lymphoma

Correlation Matrix

Personal Genomics

• Many companies are marketing SNP chips as useful- promising future information as it becomes available.– deCODEme.com – Navigenics– 23andMe – Knome– http://thepersonalgenome.com/

• Who will tell these people what the data means?

• http://www.nytimes.com/2009/01/11/magazine/11Genome-t.html?pagewanted=all

IN-DELS and CNVs

• Insertions, Deletions and Copy Number Variation

• Clone-based comparative genomic hybridization (Array CGH)– Test and reference DNA are differentially fluorescent

labeled and hybridized to the array.– cons: low resolution (Cannot find small CNV region)

• SNP genotyping array– pros: Higher resolution– Cons: poor signal-to-noise ratio of hybridization

Hidden Markov Model designed for high resolution CNV detection in

whole genome SNP genotyping data

CNV Analysis of Array Data

• Log R ratio (LRR): total fluorescent intensity signals from both sets of probe/allele at each SNP

• B Allelle Frequence (BAF) : relative ratio of the intensity signals between two probes/allele at each SNP

• Accurate model for log R ratio and B Allele Frequency• + Population allele frequency + distance between

adjacent SNPs + family information

CNV Data Analysis


• GWAS– http://grants.nih.gov/grants/gwas/– http://www.nature.com/scitable/topicpage/

Genetic-Variation-and-Disease-GWAS-682

• Personal Genomes– http://www.nytimes.com/2009/01/11/

magazine/11Genome-t.html?pagewanted=all

Functional Genomics

• Take a list of "interesting" genes and find their biological relationships

– Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods

• Requires a reference set of "biological knowledge"

Genome Ontology

• 3 hierarchical sets of terminology– Biological Process– Cellular Component (location within cell)– Molecular Function

• about 1000 categories of functions

Biological Pathways

Microarray Databases

• Large experiments may have hundreds of individual array hybridizations

• Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments

• Data-mining - look for similar patterns of gene expression across different experiments

• Microarray experiments are complex and this shares data

Using Public Databases

• Gene Expression data is an essential aspect of annotating the genome

• Publication and data exchange for microarray experiments

• Data mining/Meta-studies

• Common data format - XML

• MIAME (Minimal Information About a Microarray Experiment)

Transcriptome:Gene Expression Technologies

• cDNA (EST) libraries

• SAGE

• Microarray

• rt-PCR

• RNA-seq

The Cancer Genome Anatomy Project

• CGAP has collected a large amount of cDNA and related data online

• http://cgap.nci.nih.gov/

• cDNA libraries from various tissues– search for genes– compare expression levels

SAGE

• Serial Analysis of Gene Expression is a technology that sequences very short fragments of mRNA (10 or 17 bp) that have been randomly ligated together

• The short ‘tags’ are assigned to genes and then relative counts for each gene are computed for cDNA libraries from various tissues

SAGE Genie

• SAGE Anatomic Viewer

• SAGE Digital Gene Expression Displayer

• Digital Northern

• SAGE Experiment Viewer

GEO Microarray database at NCBI

• Microarray experiments– Defined arrays– Published results– Also lots of inconclusive experiments– Tools to search for specific genes– Unreliable to search for tissue or disease in

experiment description text

Array Express at EMBL

Antibodies on ArraysMiniature Western Blots

Documents

Microarrays IST 444. Microarrays What if no test tubes were needed to conduct an experiment? Hundreds, thousands or even millions of individual experiments