View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Microarrays
IST 444
Microarrays
• What if no test tubes were needed to conduct an experiment?
• Hundreds, thousands or even millions of individual experiments are conducted in parallel, with very few reagents
• Microarrays combine genomics (study of all the genes in the genome) with experiments and eventually, diagnostics
Microarrays:Universal Biochemistry Platforms
PeptidesPeptides ProteinsProteins
Carbohydrates
LipidsLipids
Small moleculesSmall molecules
DNADNA
Types of microarrays include:
• DNA microarrays, such as cDNA microarrays and oligonucleotide microarrays, SNPs, CHiP
• MMChips, for surveillance of microRNA populations
• Protein microarrays (protein-protein interactions)• Tissue microarrays • Cellular microarrays (also called transfection
microarrays • Chemical compound microarrays • Antibody microarrays (proteomics)• Carbohydrate arrays (glycoarrays)
Microarrays Require Bioinformatics
• Microarrays combine genomics, silicon chip manufacturing, DNA and Protein chemistry, signal and image processing, statistics, software skills and miniaturized versions of traditional molecular biology experiments.
• Develop new software to analyze the results of the many possible experiments.
Biological Samples in 2D Arrays on Membranes or Glass Slides
DNA Microarray Technologies• Gene expression profiling
- Monitoring expression levels for thousands of genes simultaneously.
• Three other common applications:• Array CGH (Comparative genomic hybridization)
- Assessing genome content in different cells or closely related organisms.
• SNP array (single nucleotide polymorphism) - Identifying single nucleotide polymorphism among alleles within or between populations.
• ChIP-on-chip (Chromatin immunoprecipitation) - Determining protein binding site occupancy throughout the genome.
• Methylation arrays (immonoprecipitate methylated DNA_-Determining which regions of DNA are methylated to determine epigenetics
History of DNA Microarrays
• Microarrays descend from Southern and Northern blotting. Unknown DNA is transferred to a membrane and then probed with a known DNA sequence with a label
• In Microarrays, the known DNA sequence (or probe) is on the membrane while the unknown labeled DNA (or target) is hybridized and then washed off so only specific hybrids remain.
• Dot Blots of different genes in an array were used to assay gene expression as early as 1987.
• Complete genome of all Saccharomyces cerevisiae ORFs on a microarray published in 1997 by Lashkari et al. http://www.pnas.org/content/94/24/13057.
• www.bio.davidson.edu/Courses/genomics/chip/chip.html
RNA Transcription as a Measure of Gene Expression
www.csu.edu.au/faculty/health/biomed/subjects/molbol/basic.htm
Transcription factors bind to the promoter and bind RNA polymerase
DNA strands separate and transcription is initiated
Transcription continues in the 3'-5' direction until the stop codons are reached
The completed RNA strand is released for post-processing
What Can be Measured using Microarrays?
1. Amount of mRNA expressed by a gene.gene expression array, exon array, tiling array
2. Amount of mRNA expressed by an exon. exon array, tiling array
3. Amount of RNA expressed by a region of DNA. tiling array
4. Which strand of DNA is expressed. exon array, tiling array
5. Which of several similar DNA sequences is present in the genome.SNP array
6. How many copies of a gene is present in the genome.gene expression array, exon array, tiling array
7. Where a known protein has bound to the DNA. (ChIP on chip)promoter array, tiling array
Types of Microarrays
Exon 1 Exon 2 Exon 3UTR UTR
A cDNA microarray can be made from the unsequenced cDNA library. All the other types require that the sequence be available.
oligo
exon exon exon
cDNA
chromosome sequence
CCGTTCACATTAGGATACCAGTTCAAGGCCGTTCACATTAGGATACCAGTTCAAGGAGGCCGTTCAGTTCACATTA
tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile tile
promoter
CCGTTCACA AAGGCCGTT
CCGTGCACA AAGGACGTTSNP
cDNA sequence
Spotted vs. in situ synthesized arrays
• The DNAs can be chemically synthesized or made by PCR and then mechanically spotted on the array. – The amount spotted can vary. – Method is more flexible and less expensive.
• The DNA can be chemically synthesized directly on the array (Affymetrix). – This can be more consistent– Shorter pieces are used.
Format of an Affymetrix Array
http://cnx.rice.edu/content/m12388/latest/figE.JPG
Print Technology
1. The cDNA or oligo can be printed on the slide using an "arraying robot" which deposits a drop of liquid containing the material at each spot. (gene expression only)40,000+ spots
2. Oligos (all the same length) can be synthesized on the slide using:i) inkjet technologyii) photolithography1,000,000+ spots
3. There are other technologies that give similar types of results (e.g. "beads").
Spotted 2-Channel Array
http://www.anst.uu.se/frgra677/bilder/micro_method_large.jpg
Spotted arrays are printed on coated microscope slides.
2 RNA samples are converted to cDNA. Each is labelled with a different dye.
SNP Analysis Using the Illumina, Inc.
GoldenGate™ Assay
• Allele Specific Extension and Ligation
• PCR Amplification
• Hybridization to the Universal Sentrix® Array Matrix
AG
illumiCode’ AddressAllele Specific Extension &Ligation
Universal PCR Sequence 1
Universal PCR Sequence 2
Universal PCR Sequence 3’
Allele Specific Extension and Ligation
Genomic DNA [T/C] Ligase [T/A]Polymerase
Custom Oligo Pool All (OPA)
96-1,536 SNPs multiplexed
Total oligos in reaction – 288-4,608
A illumiCode #561Amplification Template
PCR with Common Primers
PCR Amplification
Cy3 Universal Primer 1
Cy5 Universal Primer 2
Universal Primer P3
/\/\/\
/
/\/\/\
/
/\/\/\
/
illumiCode #561
illumiCode #217
illumiCode #1024
Hybridization to Sentrix® Array Matrix
/\/\/\
/
/\/\/\
/
A/A G/G C/T
SNP #561 SNP #217 SNP #1024
Sentrix® Array Matrix
10 m1.5 mm 400 m
The Illumina BeadStation 500G permits high throughput analysis of thousands of SNP DNA markers in hundreds of genotypes in less than one week.
Gene Expression Microarrays
Two-color fluorescent scan of a yeast microarray containing 2,479 elements (ORFs). Red and Green probes interact with a single target. Yellow probes interact with
both targets and empty probes with neither target.
Lashkari D A et al. PNAS 1997;94:13057-13062©1997 by The National Academy of Sciences of the USA
Tiling Array
• Genome array consisting of overlapping probes
• Finer Resolution
• Better at finding RNA in the cell– mRNA
• Alternative splicing• Not Polyadenylated
– microRNA
Tiling Arrays
http://en.wikipedia.org/
Tiling Array
http://en.wikipedia.org/
© Affymetrix Inc.
SNP Array
* Most common genetic variation in human genome. Occur about every one thousand base pairs in genome
* Genome-wide SNP maps now available (millions in database)
Single Nucleotide Polymorphisms (SNPs)
Affymetrix Standard Tiling
C-T-C-C-A-A-A-A-A-A-A-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-C-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-G-T-T-T-C-A-T-T-C-T
C-T-C-C-A-A-A-A-A-A-T-T-T-T-C-A-T-T-C-T
Substitution position
ChIP-on-chip array(Chromatin
ImmunoPrecipitation )
Antibody
ChIP-on-chip array
ChIP-on-chip array
3. ChIP-on-chip array
Bioinformatic approaches for analysis
• Measuring 10000s of data points simultaneously
• High dimensional data– 10 Exp x 50K = 500K
• How to find real differences over the noise
• Statistical approaches
Tumor Normal
Bioinformatic approaches for analysis
• Class Comparison– Which genes are up or
down in tumors v normal, untreated v treated
• Class Discovery– Within the tumor
samples, are there subgroups that have a specific expression profile?
• Class prediction, pathway analysis etc
Tumor Normal
Challenges in microarray analysis
• Different platforms– Ilumina, Affymetrix, Agilent….
• Many file types, many data formats• Need to learn platform dependent methods and
software required
Public databases• Many sources for public
data – labs, consortia, government
• Publications require that data files including raw files be made public
• GEO –http://www.ncbi.nlm.nih.gov/geo/
• Array Express - http://www.ebi.ac.uk/arrayexpress/#ae-main[0]
Streamlined Analysis
Normalize
normal tumor tumor normal normal tumorID_REF VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL VALUE ABS_CALL
AFFX-BioB-5_at 210.6 P 234.6 P 362.5 P 389 P 305.6 P 330.5 PAFFX-BioB-M_at 393 P 327.8 P 501.4 P 816.5 P 542 P 440.8 PAFFX-BioB-3_at 264.9 P 164.6 P 244.7 P 379.7 P 261.3 P 303.7 PAFFX-BioC-5_at 738.6 P 676.1 P 737.6 P 1191.2 P 917 P 767.9 PAFFX-BioC-3_at 356.3 P 365.9 P 423.4 P 711.6 P 560.3 P 484.9 PAFFX-BioDn-5_at 566.3 P 442.2 P 649.7 P 834.3 P 599.1 P 606.9 PAFFX-BioDn-3_at 3911.8 P 3703.7 P 4680.9 P 6037.7 P 4653.7 P 4232 PAFFX-CreX-5_at 6433.3 P 5980 P 7734.7 P 10591 P 8162.1 P 8428 PAFFX-CreX-3_at 11917.8 P 9376.7 P 11509.3 P 16814.4 P 13861.8 P 13653.4 PAFFX-DapX-5_at 12.2 A 44.3 M 31.2 A 37.7 P 33.3 A 12.8 AAFFX-DapX-M_at 57.8 M 42.5 A 79 M 48.8 P 39.5 A 39.2 AAFFX-DapX-3_at 29.8 A 6.2 A 23.4 A 28.4 A 3.2 A 7.6 AAFFX-LysX-5_at 15.3 A 16.2 A 15.6 A 16.7 A 3.1 A 3.9 AAFFX-LysX-M_at 33.2 A 12 A 17.7 A 37.3 A 49.2 A 9.1 AAFFX-LysX-3_at 40.7 M 10.7 A 36.2 A 22.1 A 22.8 A 28.2 AAFFX-PheX-5_at 7.8 A 3 A 7.6 A 5.6 A 5 A 6.4 AAFFX-PheX-M_at 4.2 A 4.8 A 6.8 A 6.1 A 3.7 A 5.5 AAFFX-PheX-3_at 54.2 A 39.6 A 19.4 A 16.1 A 44.7 A 31.2 AAFFX-ThrX-5_at 8.2 A 11.2 A 13.2 A 9.5 A 8.5 A 7.5 AAFFX-ThrX-M_at 38.1 A 30.6 A 37.6 A 7.2 A 26.9 A 36.3 AAFFX-ThrX-3_at 15.2 A 5 A 15 A 8.3 A 36.8 A 11.5 AAFFX-TrpnX-5_at 11.2 A 11.8 A 22.2 A 22.1 A 8.9 A 35.6 AAFFX-TrpnX-M_at 9 A 8.1 A 9.1 A 8.7 A 8.1 A 12 AAFFX-TrpnX-3_at 19.8 A 12.8 A 11.8 A 43.2 M 17.4 A 10 AAFFX-HUMISGF3A/M97935_5_at 82.7 P 120.7 P 92.7 P 46.4 P 55.9 P 46.5 PAFFX-HUMISGF3A/M97935_MA_at 397.6 P 416.7 P 244.8 A 181.4 A 197.5 A 192.3 AAFFX-HUMISGF3A/M97935_MB_at 206.2 P 303 P 300.8 P 253.5 P 195.3 P 216 PAFFX-HUMISGF3A/M97935_3_at 663.8 P 723.9 P 812.1 P 666.1 P 629.4 P 754.1 PAFFX-HUMRGE/M10098_5_at 547.6 P 405.9 P 6894.7 P 3496.1 P 1958.5 P 5799.4 PAFFX-HUMRGE/M10098_M_at 239.1 P 175.8 P 3675 P 1348.6 P 695.9 P 2428.2 PAFFX-HUMRGE/M10098_3_at 1236.4 P 721.4 P 9076.1 P 7795.9 P 4237.1 P 7890 PAFFX-HUMGAPDH/M33197_5_at 19508 P 19267.1 P 22892 P 26584 P 29666.6 P 25038.1 PAFFX-HUMGAPDH/M33197_M_at 18996.6 P 20610.4 P 21573.7 P 29936 P 30106.6 P 22380.2 PAFFX-HUMGAPDH/M33197_3_at 18016.4 P 17463.8 P 20921.3 P 26908.3 P 28382.2 P 21885 PAFFX-HSAC07/X00351_5_at 23294.6 P 21783.7 P 18423.3 P 21858.9 P 23517.1 P 19450.3 PAFFX-HSAC07/X00351_M_at 25373.1 P 24922.8 P 22384.2 P 25760.2 P 27718.5 P 21401.6 PAFFX-HSAC07/X00351_3_at 20032.8 P 20251.1 P 20961.7 P 23494.6 P 23381.2 P 21173.3 P
Raw data Filter
ClassificationSignificance Clustering
Gene lists
Function(Genome Ontology)
(RMA)
•Present/Absent•Minimum value•Fold change
•t-test•SAM•Rank Product
•PAM•Machine learning
Microarray experiments
Obtain sequence info
select oligos
Print microarray
Print or buy the microarray
Microarray experiments
Obtain sequence info
select oligos
Print microarray
Print or buy the microarray
sequencing errorassembly errorcontamination
uniquesimilar hybridization rates
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
experimental design-number of biological replicates-technical replicatesblockssample pooling
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microarray Create the labeled samples
hybridize
Microarray experiments
Obtain sequence info
select oligos
Print microarray
obtain tissue sample
extract RNA
extract mRNA
label
normalize mRNA
Print or buy the microrray Create the labeled samples
hybridize
hybridization design (multichannel)
Microarray experiments
hybridize scan
detect spots
compute spot summary
detect background
detect bad spots
process image
remove array specific noise
Microarray experiments
hybridize scan
detect spots
compute spot summary
detect background
detect bad spots
spot detection software
pixel mean, median ...background correction
detection limitbackground > foregroundbadly printed spotsflaws
process image using multiple scans
remove array specific noise normalization
Raw data are not mRNA concentrations
• tissue contamination• RNA degradation• amplification efficiency• reverse transcription
efficiency• Hybridization efficiency and
specificity• clone identification and
mapping• PCR yield, contamination
• spotting efficiency
• DNA support binding
• other array manufacturing
related issues
• image segmentation
• signal quantification
• “background” correction
Quality control: Noise and reliable signal
Arrays 1 ... n
Array level Gene levelProbe level
Probe level: quality of the expression measurement of one spot on one particular array
Array level: quality of the expression measurement on one particular glass slide
Gene level: quality of the expression measurement of one probe across all arrays
Probe-level quality control
• Individual spots printed on the slide• Sources:
– faulty printing, uneven distribution, contamination with debris, magnitude of signal relative to noise, poorly measured spots;
• Visual inspection:– hairs, dust, scratches, air bubbles, dark regions, regions with haze
• Spot quality:– Brightness: foreground/background ratio– Uniformity: variation in pixel intensities and ratios of intensities within
a spot– Morphology: area, perimeter, circularity.– Spot Size: number of foreground pixels
• Action:– set measurements to NA (missing values)– local normalization procedures which account for regional
idiosyncrasies.– use weights for measurements to indicate reliability in later analysis.
Spot IdentificationIndividual spots are recognized, size and shape might be
adjusted per spot (automatically fine adjustments by hand).
Additional manual flagging of bad (X) or non-present (NA) spots
poor spot quality
good spot quality
Different Spot identification methods: Fixed circles, circles with variable size, arbitrary spot shape (morphological opening)
NA
X
Spot Identification
Histogram of pixel intensities of a single spot
• The signal of the spots is quantified.
„Donuts“
Mean / Median / Mode / 75% quantile
Rafael A Irizarry,Department of
Biostatistics [email protected]
http://www.biostat.jhsph.edu/~ririzarrhttp://www.biocon
ductor.org
nci 2002
Spot Detection
Adaptive segmentation Fixed circle segmentation
---- GenePix
---- QuantArray
---- ScanAnalyze
Spot uses morphological opening
Microarray Analysis – Data Preprocessing
• Objective– Convert image of thousands of
signals to a a signal value for each gene or probe set
• Multiple step– Image analysis– Background and noise
subtraction– Normalization– Expression value for a gene or
probe set
• Image analysis and background noise usually done by proprietary software
Gene 1 100Gene 2 150Gene 3 75.Gene10000 500
Array Level Quality Control
• Problems:– array fabrication defect– problem with RNA extraction– failed labeling reaction– poor hybridization conditions– faulty scanner
• Quality measures:– Percentage of spots with no signal (~30% excluded spots) – Range of intensities– (Av. Foreground)/(Av. Background) > 3 in both channels– Distribution of spot signal area– Amount of adjustment needed: signals have to substantially
changed to make slides comparable.
Gene-level Quality Control
Gene g• Poor hybridization in the
reference channel may introduce bias on the fold-change
• Some probes will not hybridize well
to the target RNA
• Printing problems: such that all
spots of a given inventory well
have poor quality.
•A well may be of bad quality – contamination
•Genes with a consistently low signal in the reference channel
are suspicious
Normalization• Corrects for variation in
hybridization etc• Assumption that no
global change in gene expression
• Without normalization– Intensity value for gene
will be lower on Chip B– Many genes will appear
to be downregulated when in reality they are not
Gene 1 100Gene 2 150Gene 3 75Gene10000 500
507532250
Treated Control
Gene
mRNA Samples
gene-expression level or ratio for gene i in mRNA sample j
M =Log2(red intensity / green intensity)
Function (PM, MM) of MAS, dchip or RMA
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
A =average: log2(red intensity), log2(green intensity)
Function (PM, MM) of MAS, dchip or RMA
Gene expression data
Data Data (log scale)
Scatterplot of Data
Message: look at your data on log-scale!
Use a log transformation of the ratio data:
Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold (lines).
X Axis is Average spot intensity on a log scale and Y Axis is specific spot intensity.
Statistical power• t test
– Test hypothesis that the two means are not statistically different
– Adding “confidence” to the fold change value
• Mean• Standard deviation• Sample size• Calculates statistic• You choose cutoff or
threshold– Give me gene list at a cutoff
of p <0.05» 95% confidence that
the mean for that gene between control are treated are different
Experimental Design – Very important!!!
• Sample size– How many samples in test and
control• Will depend on many factors
such as whether tissue culture or tissue sample
• Power analysis
• Replicates– Technical v biological
• Biological replicates is more important for more heterogeneous samples Need replicates for statistical analysis
• To pool or not to pool– Depends on objective
• Sample acquisition or extraction– Laser captured or gross
dissected
• All experimental steps from sample acquisition to hybridization– Microarray experiments are
very expensive. So, plan experiments carefully
t tests• Results might look like
– At a p<0.05, there are 300 genes up and 200 genes down regulated• 95% confidence that the
means of these genes in the two groups is different
– At a p < 0.05, x genes up and y genes down with a fold change of at least 3.0
Multiple Comparisons
• In a microarray experiment, each gene (each probe or probe set) is really a separate experiment
• Yet if you treat each gene as an independent comparison, you will always find some with significant differences– (the tails of a normal distribution)
Multiple Comparisons
• Microarrays have multiple comparison problem• p <= 0.05 says that 95% confidence means are
different; therefore 5% due to chance• 5% of 10000 is 500
– 500 genes are picked up by chance– Suppose t tests selects 1000 genes at a p of 0.05– 500/1000 ;Approximately 50% of the genes will be
false– Very high false discovery rate; need more confidence– How to correct? – Correction for multiple comparison– p value and a corrected p value
Corrections for multiple comparisons
• Involve corrections to the p value so that the actual p value is higher
• Bonferroni http://en.wikipedia.org/wiki/Bonferroni_correction
• Benjamin-Hochberg
• Significance Analysis of Microarrays– Tusher et al. at Stanford
Gene Expression Microarray experiments
obtain numerical summary for each gene or exon on each array
sampleclassification
clustering genesand samples
differential expression analysis
Gene Expression Microarray experiments
obtain numerical summary for each gene or exon on each array
sampleclassification
clustering genesand samples
t-tests, ANOVABayesian versions of aboveFourier analysis of time seriesFalse discovery and nondiscovery rates
differential expression analysis
robust methods to down weight outliersdata imputation (filling in missing data)
discriminant analysissupport vector machinessupervised learning
unsupervised learninghierarchical clusteringk-means clusteringheatmaps
From Data to Knowledge
Gene
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
Once data is of high quality and systematic, non-biological effects are removed, the result is a gene expression matrix
This is still just data, not knowledge.Use this data to answer a scientific question.
Supervised Analysis
Learning from examples, classification– We have already seen groups of healthy and
sick people. Now let’s diagnose the next person walking into the hospital.
– We know that these genes have function X (and these others don’t). Let’s find more genes with function X.
– We know many gene-pairs that are functionally related (and many more that are not). Let’s extend the number of known related gene pairs.
Known structure in the data needs to be generalized to new data.
Unsupervised analysis
= clustering– Are there groups of genes that behave
similarly in all conditions?– Disease X is very heterogeneous. Can we
identify more specific sub-classes for more targeted treatment?
No structure is known. We first need to find it. Exploratory analysis.
Supervised analysis
Calvin, I still don’t know the difference between cats and dogs …Oh, now I get it!!
Don’t worry!I’ll show you once more:
Class 1: cats Class 2: dogs
Unsupervised analysis
Calvin, I still don’t know the difference between cats and dogs …
I don’t know it either.
Let’s try to figure it out together …
Supervised analysis: setup
• Training set– Data: microarrays– Labels: for each one we know if it falls into our
class of interest or not (binary classification)
• New data (test data)– Data for which we don’t have labels. – These are genes without known function
• Goal: Generalization ability– Build a classifier from the training data that is
good at predicting the right class for the new data.
One microarray, one dotExp
ress
ion
of g
en
e 2
Expression of gene 1
Think of a space with #genes dimensions (yes, it’s hard for more than 3).
Each microarray corresponds to a point in this space.
If gene expression is similar under some conditions, the points will be close to each other.
If gene expression overall is very different, the points will be far away.
A heatmap
samples of different regions of the brain in humans and chimpanzees
sample clusters show that different regions of the brain cluster more closely than different species
gene clusters show that some genes differentiateamong brain regions while other differentiate the 2 species.
Heat Map: 2-D Cluster Analysis
Genome Wide Association Studies
• GWAS involves rapidly scanning markers across genome (≈0.5M or 1M) of many people (≈2K) to find genetic variations associated with a particular disease.
• A large number of subjects are needed because (1)associations between SNPs and causal variants are expected to show low odds ratios, typically below 1.5 (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction
• Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases
Genome Wide Association Studies
• GWAS involves rapidly scanning markers across genome (≈0.5M or 1M) of many people (≈2K) to find genetic variations associated with a particular disease.
• A large number of subjects are needed because– associations between SNPs and causal variants are
expected to show low odds ratios, typically below 1.5
– (2)In order to obtain a reliable signal, given the very large number of tests that are required, associations must show a high level of significance to survive the multiple testing correction
• Such studies are particularly useful in finding genetic variations that contribute to common, complex diseases like Autism.
Look for Association with Diseases and SNPs
Many issues with data- includingPopulation diversities and need for T-test corrections and log transformationsbecause of many variables and fewer Samples (Bonferoni)
A recurrent mutation in the BMP type I receptor ACVR1 causes inherited and sporadic fibrodysplasia ossificans progressivaEileen M. Shore, Meiqi Xu, George J. Feldman, David A. Fenstermacher, The FOP International Research Consortium, Matthew A. Brown, and Frederick S. Kaplan Nature Genetics 2006
Collect 13 individuals from five families with FOP ectopic bone formation
Genome-wide linkage analysis with 400 microsatellite markers
Higher resolution linkage analysis with Affymetrix 10K SNP mapping
chip (in Facility)
Candidate gene sequencing identifies a new SNP in BMP
receptor
Predictive Value of Gene Expression
• Lymphoma dataset– This dataset is the gene expression in the
three most prevalent adult lymphoid malignancies: B-CLL,FL and DLBCL.
– This study produced gene expression data for p=4,682 genes in n=81 mRNA samples.
29 × B-CLL 9 × FL43 × DLBCL
http://genome-www.stanford.edu/lymphoma
Correlation Matrix
Personal Genomics
• Many companies are marketing SNP chips as useful- promising future information as it becomes available.– deCODEme.com – Navigenics– 23andMe – Knome– http://thepersonalgenome.com/
• Who will tell these people what the data means?
• http://www.nytimes.com/2009/01/11/magazine/11Genome-t.html?pagewanted=all
IN-DELS and CNVs
• Insertions, Deletions and Copy Number Variation
• Clone-based comparative genomic hybridization (Array CGH)– Test and reference DNA are differentially fluorescent
labeled and hybridized to the array.– cons: low resolution (Cannot find small CNV region)
• SNP genotyping array– pros: Higher resolution– Cons: poor signal-to-noise ratio of hybridization
Hidden Markov Model designed for high resolution CNV detection in
whole genome SNP genotyping data
CNV Analysis of Array Data
• Log R ratio (LRR): total fluorescent intensity signals from both sets of probe/allele at each SNP
• B Allelle Frequence (BAF) : relative ratio of the intensity signals between two probes/allele at each SNP
• Accurate model for log R ratio and B Allele Frequency• + Population allele frequency + distance between
adjacent SNPs + family information
CNV Data Analysis
Genome Wide Association Studies
• GWAS– http://grants.nih.gov/grants/gwas/– http://www.nature.com/scitable/topicpage/
Genetic-Variation-and-Disease-GWAS-682
• Personal Genomes– http://www.nytimes.com/2009/01/11/
magazine/11Genome-t.html?pagewanted=all
Functional Genomics
• Take a list of "interesting" genes and find their biological relationships
– Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods
• Requires a reference set of "biological knowledge"
Genome Ontology
• 3 hierarchical sets of terminology– Biological Process– Cellular Component (location within cell)– Molecular Function
• about 1000 categories of functions
Biological Pathways
Microarray Databases
• Large experiments may have hundreds of individual array hybridizations
• Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments
• Data-mining - look for similar patterns of gene expression across different experiments
• Microarray experiments are complex and this shares data
Using Public Databases
• Gene Expression data is an essential aspect of annotating the genome
• Publication and data exchange for microarray experiments
• Data mining/Meta-studies
• Common data format - XML
• MIAME (Minimal Information About a Microarray Experiment)
Transcriptome:Gene Expression Technologies
• cDNA (EST) libraries
• SAGE
• Microarray
• rt-PCR
• RNA-seq
The Cancer Genome Anatomy Project
• CGAP has collected a large amount of cDNA and related data online
• http://cgap.nci.nih.gov/
• cDNA libraries from various tissues– search for genes– compare expression levels
SAGE
• Serial Analysis of Gene Expression is a technology that sequences very short fragments of mRNA (10 or 17 bp) that have been randomly ligated together
• The short ‘tags’ are assigned to genes and then relative counts for each gene are computed for cDNA libraries from various tissues
SAGE Genie
• SAGE Anatomic Viewer
• SAGE Digital Gene Expression Displayer
• Digital Northern
• SAGE Experiment Viewer
GEO Microarray database at NCBI
• Microarray experiments– Defined arrays– Published results– Also lots of inconclusive experiments– Tools to search for specific genes– Unreliable to search for tissue or disease in
experiment description text
Array Express at EMBL
Antibodies on ArraysMiniature Western Blots