Download pdf - Introduction to biology and measurement of gene expressionweb.math.ku.dk/~richard/courses/bioconductor2009/handout/... · 2009. 8. 17. · Statistical analysis of gene expression

1

Introduction to biology and measurement of gene expression

Statistical analysis of gene expression data with R and Bioconductor

University of Copenhagen, 17-21 August, 2009

Margaret Taub University of California, Berkeley

Slides courtesy of Terry Speed, with modifications

2

Central dogma of molecular biology

Each gene is transcribed (at the appropriate time) from DNA into mRNA, which then leaves the nucleus and is translated into the required protein.

Any gene which is active in this way at any particular time is said to be expressed.

DNA activity in eukaryotic cell

3

Base Pairing

4

A nucleotide is a phospate, a sugar, and a purine or a pyramidine base.

The monomeric units of nucleic acids are called nucleotides.

5

DNA Replication I New strands of DNA are synthesized by copying of parental

strands.

The DNA strand that is copied to form a new strand is called a template.

The information in the template is preserved: although the first copy has a complementary sequence, not an identical one, a copy of the copy produces the original (template) sequence again.

In the replication of double-stranded or duplex DNA molecule, both original (parental) DNA strands are copied. When copying is finished, the two new duplexes, each consisting of one of the original strands plus its copy, separate from each other.

6

DNA Replication II

All DNA synthesis proceeds in the same chemical direction: 5ʼ->3ʼ.

The enzymes that copy (replicate) DNA to make more DNA are DNA polymerases.

DNA polymerases cannot initiate chain synthesis de novo; instead they require a short preexisting DNA strand, called a primer, to begin chain growth. With a primer base-paired to the template strand, a DNA polymerase adds nucleotides to the free hydroxyl group at the 3ʼ end of the primer.

Replication of duplex DNA requires assembly of many proteins (at least 30) at a growing fork: helicases to unwind, primases to prime, ligases to ligate (join), topisomerases to remove supercoils, RNA polymerase, etc.

7

8

9

10

Transcription: from DNA to mRNA

Transcription is a very complex process involving many steps and many proteins.

RNA polymerase initiates transcription of most genes at a unique position (a single base: the start site) in the template DNA upstream of the coding sequence.

RNA polymerase binds to specific promoter sequences to initiate transcription.

Initiation begins when a subunit of a polymerase binds to a promotor DNA sequence.

There are three different types of RNA polymerases with different functions.

11

Regulation of transcription Expression of eukaryotic protein coding genes is regulated

through multiple transcriptional control regions. Regulatory elements are often many kilobases from start

sites. Repressors bind to DNA sequences called operators which

overlap the promoter region that contacts RNA polymerase. A bound repressor interferes with binding of RNA polymerase and transcription initiation.

Activators generally bind to DNA on the opposite side of the helix from the polymerase at specific positions.

Enhancers, usually 100-200 bp in length contain multiple 8-20 bp control elements, may be from 200 bp to tens of kilobases upstream of downstream from a promoter, within an intron, or downstream from a final exon of gene.

12

From Genes VII

13

14

From Alberts et al, 1998

15

Cartoon of a gene

16

Structure of mRNA The open reading frame (ORF) consists of introns and

exons.

Untranslated regions (UTRs) are present at both the 5ʼ and the 3ʼ end of the mRNA.

A oligo consisting of all adenine nucleotides (a poly-A tail) is added to the 3ʼ end of the mRNA after transcription.

17

Post-transcriptional modification: splicing Before translation, most introns must be spliced out of the

mRNA.

Sometimes, different combinations of introns and exons are spliced out, potentially leading to different mRNA products from the same gene.

These different products are referred to as isoforms.

Scientists may be interested in measuring expression of these different isoforms, which adds a layer of complexity to the concept of gene expression.

18

Translation: from mRNA to protein Triples of nucleic acids (called codons) code for the

different amino acids. Translation takes place in an ORF between a start codon and a stop codon.

Ribosomes handle the translation of mRNA into protein, decoding the codons to produce an amino acid sequence that corresponds to the nucleotide sequence in the mRNA.

Eukaryotic initiation of protein synthesis occurs at the 5ʼ end and internal sites in mRNA.

Multiple ribosomes can operate in parallel on the same mRNA transcript.

19

20

Polymerase Chain Reaction (PCR) I

This reaction is used to amplify specific DNA sequences in a complex mixture when the ends of the sequence are known.

Start with the DNA you want to amplify. Heat-denature it into single strands. Add synthetic oligonucleotides complementary to the 3ʼ ends of the

segments of interest in great excess to the denatured DNA, and lower the temperature of the solution.

The genomic DNA remains denatured, because the complementary strands are at too low a concentration to encounter each other during the period of incubation, but the specific oligonucleotides, which are at a very high concentration, hybridize with their complementary sequences in the genomic DNA.

21

PCR II

The hybridized oligos then serve as primers for DNA chain synthesis.

Add dNTPs and a temperature resistant polymerase such as that from Thermus aquilus (a bacterium that lives in hot springs). Synthesis will then proceed.

When synthesis is complete, heat the mixture further (to 95˚C) to melt the newly formed duplexes.

When the temperature is lowered again, a new round of synthesis takes place because excess primer is still present.

Repeated cycles of synthesis (cooling) and melting (heating) quickly amplify.

22

23

Hybridization

This is a technique which exploits a potent feature of the DNA duplex - the sequence complementarity of the two strands. Remarkably, DNA can reassemble with perfect fidelity from the separated strands. The melting temperature of a DNA molecule is denoted Tm and this depends on G+C content and salt concentration.

24

25

Some bio abbreviations (in this context)

PCR = Polymerase Chain Reaction RT= Reverse Transcriptase (also: Real-Time) cDNA = complementary DNA (usually to RNA) Cy3, Cy5 = Cyan….you donʼt want to know = green, red dye wt, mt = wild-type (also “normal”), mutant organism 5ʼ, 3ʼ = directionalities of DNA, usually depicted from left, right oligo = oligonucleotide = short (say < 100 bp) DNA molecule ss, ds = single-stranded, double-stranded (usually DNA) DNA probe = cDNA or oligo, usually fixed to solid (see later) target = mRNA or cDNA seeking complementary probe bp = base pairs (measuring length of DNA molecule)

26

Quantifying gene expression

To quantify gene expression, we can measure either mRNA or protein, and there are many ways of doing each. Some are little more than detection assays, others give relative expression, and yet others give “absolute” expression.

Measuring mRNA: quantitative Northern blot, qPCR, qrt-PCR, short- or long-oligo or cDNA microarrays, EST sequencing, SAGE, MPSS, bead-arrays, next-generation sequencing…

Measuring protein: many assays using antibodies (quantitative Western blots, ELISA..), many separation techniques (e.g. 2D gels, gas, liquid chromatography, MS) leading to spots, peaks and peak areas, etc.

27

How microarrays are used to measure gene expression

Basic idea: measure the activity level of a gene (its expression level), in a particular cell at a particular time, by measuring the concentration of that geneʼs mRNA transcript in the cellʼs total RNA. How?

•  Immobilize DNA probes (oligos or cDNA) onto glass, •  Hybridize labelled target mRNA (in reality cDNA equivalent)

with probes on glass, •  Measure how much binds to each probe (i.e. forms ds DNA).

(In two-channel experiments, equal amounts of two differently labelled target cDNAs are hybridized to the probes.)

28

Notes I: What we are measuring

The details of microarray assays are quite extensive. Look them up on the www or in books. Just some notes now.

The mRNA we quantify is being synthesized in the cells (in vivo) at some rate, and is decaying in vivo at some other rate, as well as being degraded in vitro at a third rate. We get an imperfect measure of the net mRNA. Most mRNA degradation occurs 3ʼ to 5ʼ, and it can be a real problem. Degradation may affect different transcripts differently.

29

Notes II: Probe effects

The fixed probes are attached to the glass in ss form. Binding is (Watson-Crick) base-pairing of ss fragments to form ds fragments.

Hybridization efficiency (how much hybridization takes place in a given length of time) will be affected by the sequence composition of the probes (e.g., GC-content). This makes comparison of signal between different probes difficult.

Efficiency will also depend on the concentration of the target molecule and the temperature at which the hybridization reaction is taking place.

Not all pairing is perfectly complementary, so we do have cross-hybridization (imperfect pairing).

30

Notes III: Intensity and concentration

In general, our models assume a linear relationship (on a log-scale) between concentration of target present in the sample and signal intensity for a given probe.

However, there are limitations to detection capability both at the low end of concentration (inability to distinguish signal from background) and the high end (saturation effects).

31

Types of microarrays cDNA microarrays: probes are cDNA 300-3000 base pairs

in length, PCRd from custom libraries and spotted onto a microscope slide using a robot.

Long-oligo spotted arrays: shorter but uniform length nucleotide probes, 60-90 bp, usually synthesized and spotted as with cDNA, but there are also

Commercial spotted arrays (e.g., Agilent) spotting long oligo probes using inkjet technology.

High-density short (e.g. 25 bp) oligo arrays (Affymetrix, Nimblegen) synthesized in situ. Single-channel.

High-density long-oligo arrays: 60-90 bp, synthesized in situ (Nimblegen)

High-density bead arrays: Illumina 50bp, tag sequence hybridization

32

Microarray measurements

All raw measurements are fluorescence intensities.

We measure the amount of labelled target cDNA which is bound to the immobilized probe by exciting the labelling molecules (the dye) with a laser, and collecting and counting the photons emitted.

Dyes can be chosen to have different peak emission wavelengths, and so two (or more) measurements per probe may be made.

In practice the entire glass slide or chip is scanned (excitation-emission-counting of emitted photons), and the result is a digital image. This then needs to be processed to locate the probes in the image and assign intensity measurements to each of them. There can be from hundreds to millions of different probes on each slide/chip.

For the next part of this lecture, Iʼll discuss two-channel microarrays.

33

Channel 1Excitation

(Red HeNe Laser)

Channel 1Emission

Channel 2Emission

Channel 2Excitation

(Red GreNe Laser)

Normal Cell Mutant Cell

Isolate RNA andprepared fluorescently

labeled cDNA

Hybridize and Wash

Image AcquisitionUsing 2 laser confocal

scanning system

Two colour arrays compare two RNA samples, e.g., normal vs mutant

Channel 1 emission proportion to gene expression in mutant RNA, Channel 2 to normal RNA

34

Scanner output consists of two TIFF images, one for each of red (Cy5) and green (Cy3) channels.

Shown here is false-coloured image with two channels overlaid:

●  more highly expressed in mutant

●  equally expressed

●  more highly expressed in normals

AGRF NIA 15k mouse array:

32k spots, 12 x 4 pin groups

cDNA microarray Image

35

Arrows indicate print direction

26 x 26 spots in each pin group. also called print-tip group.There can be real pin effects, or spatial effects for which pin group provides a surrogate.

36

Quantification of expression

Red, Green foreground : Rf, Gf

Red, Green background : Rb, Gb

Background corrected: R = Rf - Rb. G = Gf - Gb

log-ratio (“Minus”): M = log2R - log2G

Average intensity (“Add”): A = (log2R + log2G)/2

Lots of issues: which bg measure is best, to subtract or not to subtract bg, quality filtering, etc.

37

Spatial bias in cDNA arrays

Print-tip groups

Log-

ratio

s

Boxplots of M = log2R/G by print-tip group (1-16)

Locations of spots with extreme 5% M: high, low

Same slide

38

Spatial plots: background intensities

39

Affymetrix GeneChips®

Probes = 25 bp sequences

Probe Sets = set of probes corresponding to a particular gene, EST or exon.

In the past there has been 20 probes/probe set on human chips, 16 on mouse, while there are currently 11 on Human GeneChips® HG-U133A. Most genes or ESTs contain one probe set, but quite a few have > 1. Newer arrays with a probeset for each exon are also available.

40

In situ synthesis of probes base by base

41

Hybridization of mRNA to probes

42

Affymetrix GeneChip® Arrays

* **

**

43

Perfect Match and MisMatch probes

Perfect Match probe (PM) = 25 bp probe perfectly complementary to a specific region of a gene

MisMatch probe (MM) = 25 bp probe agreeing with a PM apart from the middle base, which is different, being a transition (A←→G, C←→T) of that base.

MMs were an attempt to measure cross-hybridization, and initially were subtracted from PM. They no longer are. Affymetrix has included a MM probe with every PM probe on every one of its chips, but this is not being done now.

44

5’ 3’

mRNA reference sequence PM: CAGACATAGTGTCTGTGTTTCTTCT MM CAGACATAGTGTGTGTGTTTCTTCT

45

46

Chip dat file – checkered board – close up w/ grid

Courtesy: F. Collin Image Analysis DAT file (TIFF file data)

47

Chip dat file – checkered board – close up pixel selection

48

CEL and CDF files

Image analysis of each scanned Affy array produces a CEL file containing the intensity (75th %ile, and other statistics) for each square probe cell.

To do anything with this file you also need the CDF (Chip Description File) which specifies the probe and the probe set to which each cell belongs.

CDF information is provided by CDF packages for R which can be downloaded from Bioconductor for the standard Affymetrix chips.

49

Pre-processing and summarization

•  Background correction •  Normalization •  Probe set summaries

More on this later in the course.

Computing expression values for each probe set requires 3-steps

50

Annotations

Each probe set is designed to measure a particular gene, but sometimes changes to the annotation of the genome of the organism you are working with may change the genes that certain probes map you. You may want to take this into account in your analysis.