Introduction to Affymetrix Microarrays

Embed Size (px)

DESCRIPTION

Affymetrix Microarrays

Citation preview

  • Introduction to Affymetrix MicroarraysStem Cell Network Microarray Course, Unit 1 August 2006

  • GoalsReview technology & terminology of Affymetrix GeneChipsDescribe some methods for processing raw data from Affymetrix chips and generating expression values.Show relative benefits of each methodology.

  • What is a Microarray?

    Microarray has become a general term, there are many types nowDNA microarraysProtein microarraysTransfection microarraysTissue microarrayWell be discussing cDNA microarrays

  • What is a DNA Microarray (very generally)A grid of DNA spots (probes) on a substrate used to detect complementary sequencesThe DNA spots can be deposited bypiezolectric (ink jet style) PenPhotolithography (Affymetrix)The substrate can be plastic, glass, silicon (Affymetrix)RNA/DNA of interest is labelled & hybridizes with the arrayHybridization with probes is detected optically.

  • Types of DNA microarrays and their usesWhat is measured depends on the chip design and the laboratory protocol:ExpressionMeasure mRNA expression levels (usually polyadenylated mRNA)ResequencingDetect changes in genomic regions of interestTilingTiles probes over an entire genome for various applications (novel transcripts, ChIP, epigenetic modifications)SNPDetect which known SNPs are in the tested DNA?...

  • What do Expression Arrays really measure?Gene ExpressionmRNA levels in a cellmRNA levels averaged over a population of cells in a samplerelative mRNA levels averaged over populations of cells in multiple samplesrelative mRNA hybridization readings averaged over populations of cells in multiple samplessome relative mRNA hybridization readings averaged over populations of cells in multiple samples

  • Why some & multiple samplessomeIn a comparison of Affymetrix vs spotted arrays, 10% of probesets yielded very different results.In the small number of cases in which platforms yielded discrepant results, qRT-PCR generally did not confirm either set of data, suggesting that sequence-specific effects may make expression predictions difficult to make using any technique.*It appears that some transcripts just cant be detected accurately by these techniques.* Independence and reproducibility across microarray platforms., Quackenbush et al. Nat Methods. 2005 May;2(5):337-44

  • Why multiple samplesmultiple samplesWe can only really depend on between-sample fold change for Microarrays not absolute values or within sample comparisons (>1.3-2.0 fold change, in general)

  • Central Assumption of Gene Expression MicroarraysThe level of a given mRNA is positively correlated with the expression of the associated protein.Higher mRNA levels mean higher protein expression, lower mRNA means lower protein expressionOther factors:Protein degradation, mRNA degradation, polyadenylation, codon preference, translation rates, alternative splicing, translation lagThis is relatively obvious, but worth emphasizing

  • Affymetrix Expression Arrayshttp://www.affymetrix.com/technology/ge_analysis/index.affx

  • Affymetrix File TypesDAT file: Raw (TIFF) optical image of the hybridized chipCDF File (Chip Description File): Provided by Affy, describes layout of chipCEL File:Processed DAT file (intensity/position values)CHP File: Experiment results created from CEL and CDF filesTXT File:Probeset expression values with annotation (CHP file in text format)EXP FileSmall text file of Experiment details (time, name, etc)RPT FileGenerated by Affy software, report of QC info

  • Affymetrix Data FlowScan ChipHybridized GeneChipDAT fileProcess Image (GCOS)CEL fileCDF fileMAS5 (GCOS)CHP fileTXT fileRPT fileEXP file

  • Affymetrix Expression GeneChip Terminology

    A chip consists of a number of probesets.Probesets are intended to measure expression for a specific mRNA Each probeset is complementary to a target sequence which is derived from one or more mRNA sequencesProbesets consist of 25mer probe pairs selected from the target sequence: one Perfect Match (PM) and one Mismatch (MM) for each chosen target position.Each chip has a corresponding Chip Description File (CDF) which (among other things) describes probe locations and probeset groupings on the chip.

  • Choosing probesHow are taget sequences and probes chosen?Target sequences are selected from the 3 end of the transcript Probes should be unique in genome (unless probesets are intended to cross hybridize)Probes should not hybridize to other sequences in fragmented cDNAThermodynamic properties of probes See Affymetrix docs for more details

    http://www.affymetrix.com/support/technical/technotes/hgu133_p2_technote.pdf

  • Affymetrix Probeset NamesProbeset identifiers beginning with AFFX are affy internal, not generally used for analysisSuffixes are meaningful, for example:_at : hybridizes to unique antisense transcript for this chip_s_at: all probes cross hybridize to a specified set of sequences_a_at: all probes cross hybridize to a specified gene family_x_at: at least some probes cross hybridize with other target sequences for this chip_r_at: rules dropped (my favorite!)and many moreSee the Affymetrix document Data Analysis Fundamentals for details

  • Target Sequences and Probes

    Example: 1415771_at: Description: Mus musculus nucleolin mRNA, complete cdsLocusLink: AF318184.1 (NT sequence is 2412 bp long)Target Sequence is 129 bp long11 probe pairs tiling the target sequencegagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttcttgagaagtcaaccatccaaaactctgtttgtcaaaggtctgtctgaggataccactgaagagaccttaaaagaatcatttgagggctctgttcgtgcaagaatagtcactgatcgggaaactggttctt

  • Perfect Match and Mismatch

    ctgtctgaggataccactgaagagactgtctgaggattccactgaagagaTargetPerfect matchMismatchProbe pairtttccagacagactcctatggtgacttctctggaat

  • Affymetrix Chip Pseudo-image*image created using dChip software

  • 1415771_at on MOE430A*image created using dChip software

  • 1415771_at on MOE430APMMM*Note that PM, MM are always adjacent*image created using dChip software

  • 1415771_at on MOE430AProbe pairProbesetPM

    MMPMMMProbe pairIntensity*images created using dChip software

  • Intensity to Expression Now we have thousands of intensity values associated with probes, grouped into probesets.How do you transform intensity to expression values?AlgorithmsMAS5 Affymetrix proprietary methodRMA/GCRMAIrizarry, Bolstad..many othersOften called normalization

  • Common elements of different techniquesAll techniques do the following:Background adjustmentScalingAggregationThe goal is to remove non-biological elements of the signal

  • MAS5Standard Affymetrix analysis, best documented in: http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdfMAS5 results cant be exactly reproduced based on this document, though the affy package in Bioconductor comes close. MAS5 C++ source code released by Affy under GPL in 2005

  • MAS5 ModelMeasured Value = N + P + SN = NoiseP = Probe effects (non-specific hybridization)S = Signal

  • MAS5: Background & NoiseBackgroundDivide chip into zonesSelect lowest 2% intensity valuesstdev of those values is zone variabilityBackground at any location is the sum of all zones background, weighted by 1/((distance^2) + fudge factor)NoiseUsing same zones as aboveSelect lowest 2% backgroundstedev of those values is zone noiseNoise at any location is the sum of all zone noise as above

    From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

  • MAS5: Adjusted IntensityA = Intensity minus background, the final value should be > noise.

    A: adjusted intensityI: measured intensityb: backgroundNoiseFrac: default 0.5 (another fudge factor)

    And the value should always be >=0.5 (log issues)(fudge factor)From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

  • MAS5: Ideal Mismatch Because Sometimes MM > PMFrom http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf

  • MAS5: Signal Value for each probe: Modified mean of probe values: Scaling Factor (Sc default 500)Tbi = Tukey Biweight (mean estimate, resistant to outliers)TrimMean = Mean less top and bottom 2%From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdfReportedValue(i) = nf * sf * 2 (SignalLogValuei)Signal(nf=1)

  • MAS5: p-value and callsFirst calculate discriminant for each probe pair:R=(PM-MM)/(PM+MM)Wilcoxon one sided ranked test used to compare R vs tau value and determine p-valuePresent/Marginal/Absent calls are thresholded from p=value above and Present =< alpha1alpha1 < Marginal < alpha2Alpha2
  • MAS5: SummaryGoodUsable with single chips (though replicated preferable)Gives a p-value for expression dataBad:Lots of fudge factors in the algorithmNot *exactly* reproducible based upon documentation (source now available)MiscMost commonly used processing method for Affy chipsHighly dependent on Mismatch probes

  • RMARobust Multichip AnalysisUsed with groups of chips (>3), more chips are betterAssumes all chips have same background, distribution of values: do they?Does not use the MM probes as (PM-MM*) leads to high varianceThis means that half the probes on the chip are excluded, yet it still gives good results!Ignoring MM decreases accuracy, increases precision.

  • RMA ModelFrom a presentation by Ben Bolstadhttp://bioinformatics.ca/workshop_pages/genomics/lectures2004/16

  • RMA BackgroundThis provides background correctionFrom a presentation by Ben Bolstadhttp://bioinformatics.ca/workshop_pages/genomics/lectures2004/16

  • RMA: Quantile Normalization & ScalingFit all the chips to the same distributionScale the chips so that they have the same mean.

    From a presentation by Ben Bolstadhttp://bioinformatics.ca/workshop_pages/genomics/lectures2004/16

  • RMA: Estimate Expressionassumption that these log transformed, background corrected expression values follow a linear model, Linear Model is estimated by using a median polish algorithmGenerates a model based on chip, probe and a constant

  • GCRMA: Background AdjustmentPHYSICAL REVIEW E 68, 011906 ~2003!

    Sequence specificity of brightness in the PM probes.

  • (GC)RMA: SummaryGood:Results are log2GCRMA: Adjusts for probe sequence effectsRigidly model based: defines model then tries to fit experimental data to the model. Fewer fudge factors than MAS5BadDoes not provide calls as MAS5 doesMiscThe input is a group of samples that have same distribution of intensities.Requires multiple samples

  • Comparison (Affy spike in data set)Nature Biotechnology22, 656 - 658 (2004) doi:10.1038/nbt0604-656b Non-spike inSpike in(fold change)

  • Affycomp

  • How many replicates?3 or more Biological Replicates is a minimum!

    Biological ReplicatesRecreate the experiment several times. This gives a sense of biological variability.Technical ReplicatesDont bother unless youre doing a technical study of microarray variability.

  • Unit 1 ExercisesDownloading microarray data from StemBaseGenerating MAS5, RMA, GCRMA expression values using RComparing expression values with each otherDetermining fold change of probesets for MAS5, RMA, GCRMA results.

  • ConclusionPlease contact [email protected] if you have any comments, corrections or questions.See associated bibliography for references from this presentation and further reading.Thanks for your attention!

    (We wont describe lab methodology in detail)

    Microarray generally refers to types of massively parallel biological assays where many tests are done simultaneously.

    Photolithographic uses a combination of deposition, photographic masking, similar to the process used to create computer chips.**Discuss ChIP work in Rudnicki lab-its possible to use SNPs with unique locations in conjuntion with ChIP to determine what areas of the genome are methylated or acetylated, for example.

    This is probably due to our lack of understanding of how to properly design probes.-an assumption, because if you *dont* make this assumption, what does the data mean?Note that only the polyadenylated RNA is reverse transcribed: this will pick up mRNA but not microRNA or other non polyadenylated sequences.Other normalization techniques usually start with the CEL file and generated by GCOSGCOS is the Current term for the standard set of microarray software, Gene ChiP Operating Systemhttp://www.affymetrix.com/support/technical/technotes/hgu133_p2_technote.pdf

    http://www.affymetrix.com/support/help/faqs/rat_230/faq_8.jspNot all probesets have probes with overlapping tiling of the target sequence

    Note that the probesets in this example are unevenEvery probe is part of a pair: one perfect match, one mismatch. The mismatch should not hybridize and should detect non-specific hybridization and local background providing something like a local control to compare the perfect match with. Pseudo image of an affy MOE430A chip, reconstructed from the CEL file, where intensity is proportional to brightness.This particular image was generated by dchip

    Note that the probes in a given probeset can give widely varying results.The highlighted probe pair has an almost zero intensityIt is possible for a given mismatch probeset to give higher values than a perfect mach.Skip dchip on this onehttp://www.stat.berkeley.edu/~bolstad/MAS5diff/Mas5difference.htmlMas5 Variance probably due to rounding errors or something magical happening

    Original affymetrix technology was resequencing technology, hence the mismatch probesInability to reproduce based on docs serious issue, indicates variation in implementationLack of use of MM probes means that *half* of the probes on the chip are being discarded.Statistically much more complicated.Computationally complex mthodCite bolstad slide this was taken from, or get it from the paperGET BOLSTAD SLIDE FOR THIS