9
1 06/03/2001 Mette Langaas 1 Norsk Regnesentral Norwegian Computing Center Bioinformatics – an interesting area of research for statisticians (in Norway)? Mette Langaas Norsk Regnesentral Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/ 06/03/2001 Mette Langaas 2 Norsk Regnesentral Norwegian Computing Center Outline of talk What is bioinformatics? What do we need to know in biochemistry? The Human Genome Project. Research questionsin bioinformatics. Gene expression and data from DNA microarrays. Statistical methods for analysing gene expression data. Bioinformatics in Norway. How can statisticians contribute? 06/03/2001 Mette Langaas 3 Norsk Regnesentral Norwegian Computing Center What is Bioinformatics ? Russ Altman (in Bioinformatics), broad definition: Bioinformatics is the study of how information technologies are used to solve problems in biology. Russ Altman, (in Bioinformatics), narrow definition: Bioinformatics is the creation and management of biological databases in support of genomic sequences. The BITS-journal: Bioinformatics is a combination of Computer Science, Information Technology and Genetics to determine and analyse geneticinformation. 06/03/2001 Mette Langaas 4 Norsk Regnesentral Norwegian Computing Center Bioinformatics programme at the University of Michigan (slightly modified) : Bioinformatics merges recent advances in molecular biology and genetics with advanced statistics and computer science technology. The goal is increased understanding of the complex web of interactions linking the individual components of a living cell to the integrated behavior of the entire organism . Statisticians: Bioinformatics is a collection of statistical methods for dealing with large biological data sets. What is Bioinformatics ? 06/03/2001 Mette Langaas 5 Norsk Regnesentral Norwegian Computing Center What do we need to know in Molecular Genetics and Biochemistry? Cell Chromosomes DNA Gene Genome 06/03/2001 Mette Langaas 6 Norsk Regnesentral Norwegian Computing Center The Cell Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt

Human Genome Project Recent Advances Ppt

  • Upload
    bwwcom

  • View
    3.237

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Human Genome Project Recent Advances Ppt

1

06/03/2001 Mette Langaas 1

Norsk RegnesentralNorwegian Computing Center

Bioinformatics – an interesting area of researchfor statisticians (in Norway)?

Mette LangaasNorsk Regnesentral

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

06/03/2001 Mette Langaas 2

Norsk RegnesentralNorwegian Computing Center

Outline of talk

• What is bioinformatics?• What do we need to know in biochemistry?• The Human Genome Project.• Research questionsin bioinformatics.• Gene expression and data from DNA microarrays.• Statistical methods for analysing gene expression

data.• Bioinformatics in Norway.• How can statisticians contribute?

06/03/2001 Mette Langaas 3

Norsk RegnesentralNorwegian Computing Center

What is Bioinformatics ?

Russ Altman (in Bioinformatics), broad definition:Bioinformatics is the study of how information technologies are used to solve problems in biology.

Russ Altman, (in Bioinformatics), narrow definition: Bioinformatics is the creation and management of biological databases in support of genomic sequences.

The BITS-journal:Bioinformatics is a combination of Computer Science,Information Technology and Genetics to determine and analyse genetic information.

06/03/2001 Mette Langaas 4

Norsk RegnesentralNorwegian Computing Center

Bioinformatics programme at the University of Michigan(slightly modified):Bioinformatics merges recent advances in molecular biology and genetics with advanced statistics and computer science technology. The goal is increased understanding of the complex web of interactions linking the individual components of a living cell to the integrated behavior of the entire organism .

Statisticians:Bioinformatics is a collection of statistical methods for dealing with large biological data sets.

What is Bioinformatics ?

06/03/2001 Mette Langaas 5

Norsk RegnesentralNorwegian Computing Center

What do we need to know in Molecular Genetics and Biochemistry?

• Cell

• Chromosomes

• DNA

• Gene

• Genome

06/03/2001 Mette Langaas 6

Norsk RegnesentralNorwegian Computing Center

The Cell

Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt

Page 2: Human Genome Project Recent Advances Ppt

2

06/03/2001 Mette Langaas 7

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html

06/03/2001 Mette Langaas 8

Norsk RegnesentralNorwegian Computing Center

Human Chromosomes

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt

06/03/2001 Mette Langaas 9

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

06/03/2001 Mette Langaas 10

Norsk RegnesentralNorwegian Computing Center

Example of DNA

(tertiary structure)

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt

06/03/2001 Mette Langaas 11

Norsk RegnesentralNorwegian Computing Center

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fg t_tspeed7.ppt

DNA (Deoxyribonucleid acid) is a double stranded helix, consisting of a

• Suger Phosphate backbone and

• Nitrogenous bases:• Adenine

• Cytosine

• Guanine

• Thymine

Base pair (bp): two bases paired by hydrogen bonds between the bases.

Adenine pairs with Thymine

Guanine pairs with Cytosine

06/03/2001 Mette Langaas 12

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

Page 3: Human Genome Project Recent Advances Ppt

3

06/03/2001 Mette Langaas 13

Norsk RegnesentralNorwegian Computing Center

Genome-Chromosome-Gene

Genome: All the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs. [Human genome: 3.109 bp, more than 99% of the human DNA sequences are the same across the population]

Chromosome: The self-replicating genetic structureof cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. Eukaryotic genomes consistof a number of chromosomes whose DNA is associated withdifferent kinds of proteins. [Human chromosomes lenghts from 50 million to 263 million bp]

Gene: The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). [Human genes: 30 000?, average length 3000 bp]

Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html

06/03/2001 Mette Langaas 14

Norsk RegnesentralNorwegian Computing Center

What MORE do we need to know in Molecular Genetics and Biochemistry?

• What does the gene do? The gene encodes a specific functional product (i.e., a protein

or RNA molecule).

• Protein syntesis

• mRNA

• Amino acid

06/03/2001 Mette Langaas 15

Norsk RegnesentralNorwegian Computing Center

Proteins are built from amino acids

Protein: A large moleculecomposed of one or morechains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function, and regulation of the bodys cells, tissues, and organs, and each protein has unique functions. Examples arehormones, enzymes, and antibodies.

Amino acid: Any of a class of 20 molecules that are combined to form proteins in living things. The sequence of amino acids in a protein and hence protein function are determined by the genetic code. [The 20 amino acids are:alanine, arginine, aspargine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine.]

Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html

06/03/2001 Mette Langaas 16

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html

Protein synthesis :transcription and translation

06/03/2001 Mette Langaas 17

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

06/03/2001 Mette Langaas 18

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

Page 4: Human Genome Project Recent Advances Ppt

4

06/03/2001 Mette Langaas 19

Norsk RegnesentralNorwegian Computing Center

The Human Genome Project

• Begun formally in 1990, planned to be completed in 2003.• U.S. Human Genome Project is coordinated by the U.S .

Department of Energy and the National Institutes of Health .• Project goals are

– to identify all the approximately 50,000(?) genes in human DNA, – determine the sequences of the 3 billion chemical bases that make

up human DNA, – store this information in databases, – develop faster, more efficient sequencing technologies, – develop tools for data analysis, and – address the ethical, legal, and social issues that may arise from the

project.Results by now:• Draft of entire genome (June 2000)• 9711 mapped genes (February 4, 2001)• New estimate: 30 000 genes (February, 2001)

06/03/2001 Mette Langaas 20

Norsk RegnesentralNorwegian Computing Center

Research questions in Bioinformatics

Data Management– databases, searchable, compare.

Biological sequence alignment:– compare two DNA sequences (HMM).

Pharmacogenetics:– how genetic differences influence the variability in patients’

responses to drugs.Proteomics:

– which proteins are present in a cell and which proteins interact with each other.

Structural genomics:– determine the (3D) structure of the proteins encoded by a genome.

Comparative genomics:– the function of human genes and other DNA regions are often

revealed by studying their parallels in nonhumans (mice and men...)

06/03/2001 Mette Langaas 21

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

Comparing human and mouse chromosomes

06/03/2001 Mette Langaas 22

Norsk RegnesentralNorwegian Computing Center

Research questions in Bioinformatics (cont’d.)

Transcriptomics:– use mRNA transcripts to determine which genes are turned

on/off in a particular cell or tissue type, and how disease changes this expression.

Functional genomics:– experimental approaches and resources to assess gene

function– development of software tools to handle and interpret data

e.g. from DNA microarrays.

06/03/2001 Mette Langaas 23

Norsk RegnesentralNorwegian Computing Center

Functional genomics: gene expression and data from DNA microarrays

• Gene expression.• cDNA microarray experiment.• Data from one cDNA microarray experiment.• Data from many cDNA microarray experiments

(reference design).• Applications.

06/03/2001 Mette Langaas 24

Norsk RegnesentralNorwegian Computing Center

Gene expression

The process by which a gene's coded information isconverted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs).

http://www.ornl.gov/hgmis/publicat/glossary.html

Page 5: Human Genome Project Recent Advances Ppt

5

06/03/2001 Mette Langaas 25

Norsk RegnesentralNorwegian Computing Center

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt

cDNA microarray experiment

06/03/2001 Mette Langaas 26

Norsk RegnesentralNorwegian Computing Center

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

06/03/2001 Mette Langaas 27

Norsk RegnesentralNorwegian Computing Center

The cDNA microarray experiment

1. Constructing the microarray (probe):• From a collection of purified DNA’s. A drop of each type of

DNA in solution is placed on a specially prepared glass microscope slide by an arrayer machine.

2. Choosing and preparing the targets:• Select targets: the aim is to comparegene expression in

different cell populations: tissue specific, disease specific, environmental , cell cycle etc.

• mRNA extraction: capture mRNA, amplification .• Reverse transcription to cDNA (more stable).• Fluorescent labelling of cDNA targets: to identify its

presence. Red and green dyes (Cy3 and Cy5) are the most common.

06/03/2001 Mette Langaas 28

Norsk RegnesentralNorwegian Computing Center

The cDNA microarray experiment (cont’d.)

3. Hybridization and scanning:• The cDNA target will hybridize to spots on the array. • Using a laser (different wavelengths) the fluorescent target

will emit light. The intensity will reflect the abundance ofmRNA in the original target tissue. Using a scanner two images (red and green) is aquired.

4. Image analysisof the microarray:• Identifythe spots (gridding , segmentation) and assign a

intensity measurement.• Relate the intensity in each spot to the background intensity

(local or overall) and filter out weak spots (signal -to-noiseratio low, label as missing).

06/03/2001 Mette Langaas 29

Norsk RegnesentralNorwegian Computing Center

Data from one cDNA experiment

From image to intensities: using image analysis techniques the spot and background pixels are determined. An intensity measurement is assigned as the difference between spot and (local) background. Missing values are defined as spots wherethe signal is not much larger than the background noise. Gg=green intensity for gene gRg=red intensityfor gene g

Relative log-intensities: there is variation in the amount of DNA from spot to spot so intensities are only meaningful in a relative sense. For gene g on array i the relative log -intensity (usuallybase 2) is Xg

*=log2(Rg/Gg)The data vector: {Xg

*} for g=1,...,#genes.

Raw images : for each microarray experiment we have two images with measurement of fluorescent intensities. Spots of bad quality are flagged.

06/03/2001 Mette Langaas 30

Norsk RegnesentralNorwegian Computing Center

Data from many cDNA experimentsReference design: use the same reference sample (green) for

each experiment (often cultivated cells). The different tissue samples are dyed red.

From image to intensities for each experiment:Ggi=green intensity for gene g at array i Rgi=red intensityfor gene g at array i

Relative log-intensities from each experiment:Xgi

*=log2(Rgi/Ggi)

Median polishing: iteratively subtract the column and row medians.

The data matrix: {Xgi} for g=1,...,#genes and i=1,...,#arrays.

reference

sample nsample 2sample 1 sample 3

reference reference reference

Page 6: Human Genome Project Recent Advances Ppt

6

06/03/2001 Mette Langaas 31

Norsk RegnesentralNorwegian Computing Center

DNA microarray applications

• Human disease diagnostics and treatment– determination of predisposition and risk factors wrt. certain

diseases– prediction of risk factors involved using certain treatment schemes

– monitor disease stage and treatment progress

• Agricultural diagnostics and development– identify plant pathogens to allow suitable plant protection to be

improved– efficiacy and economy in plant biotechnology

• Analysis of food and genetically modified organisms (GMO)– determine the integrity of food

– detect alterations and contaminations– quantify GMOs

• Drug discovery and drug development

06/03/2001 Mette Langaas 32

Norsk RegnesentralNorwegian Computing Center

Statistical methods for analysing gene expression data

METHODexperimental designanalysis of variance

clustering

discrimination and classification

multipletestingfeature extraction

TASKdesign the experiment

normalize with-in array and between array

find similar groups of samples and/or genes

discriminate between two or more groups of samples

classify a new sample to one for many groups (or

compute group probability)find genes that are

differentially expressed (all samples or subsets)

06/03/2001 Mette Langaas 33

Norsk RegnesentralNorwegian Computing Center

Experimental design and ANOVA[Kerr & Churchill, The Jackson Lab, August 2000]

Sources of variation for the fluorescent intensities• Variety:

• timepoints of a biological process• different types of tissue• different treatments• different types of disease

• Genes (hybridization efficiency)• Dyes (two dyes – one dye consistently brighter than the other?)• Arrays ( probing conditions)• Dye*Variety(differences when dyeing the samples)• Array*Gene (amount of cDNA on probes for same gene on

different arrays vary = ”spot” effects)• Dye*Gene (are there differences in the dyes that are gene

specific?)• Variety*Gene(EFFECT of INTEREST)

06/03/2001 Mette Langaas 34

Norsk RegnesentralNorwegian Computing Center

Experimental design and ANOVA (cont’d.)

ANOVA model:yijkg=µ+Ai+Dj+Vk+Gg+(VG)kg+(AG)ig+εijkg

where yijkg is a transformation of Rgi or Ggi so effects are additive and εijkg has F(0,σ2) or Fg(0,σg

2).

Replication of genes on every array: (AG)igs

06/03/2001 Mette Langaas 35

Norsk RegnesentralNorwegian Computing Center

Experimental design and ANOVA (cont’d.)

Reference design:• same reference variety on each array (variety not of

interest)• the most popular design.• VG is completely confounded with DG.• No degrees of freedom left for error estimation.• Use when not enough tissue to dye twice.

Loop design:• collects twice as much data on the varieties of interest.• balanced wrt. dye, but each sample must be dyed

twice.• (#genes-1) degrees if freedom left for error estimation.• more difficult to understand for biologists.

06/03/2001 Mette Langaas 36

Norsk RegnesentralNorwegian Computing Center

Clustering

Aim: partition genes or samples into groups so that the groups are homogeneous and well-separated.

Data: {Xgi} for g=1,...,#genes and i=1,...,#arrays.

Results: – Find groups of genes that are co-regulated– Find subgroups (previously unknown) of diseases

Issues:– feature extraction (choosing a subset of the genes)

– one-way or two -way clustering– overlapping vs. non-overlapping clusters– membership in more than one cluster

– assessing the reliability of clustering results

Page 7: Human Genome Project Recent Advances Ppt

7

06/03/2001 Mette Langaas 37

Norsk RegnesentralNorwegian Computing Center

Clustering: methods for analysing gene expression data

One-way clustering:– Hierarchical clustering– Self-organizing maps (SOM) [Kohonen]– K-means – SVD-based (principal component) clustering

Two-way clustering:– Block clustering – Gene Shaving [Hastie et al. (2000)]– Plaid Models [Lazzeroni & Owen (2000)]

06/03/2001 Mette Langaas 38

Norsk RegnesentralNorwegian Computing Center

Molecular Portraits of Breast Cancer , Perou et al., Nature, 406, 6797, 2000.

06/03/2001 Mette Langaas 39

Norsk RegnesentralNorwegian Computing Center

Discrimination and Classification

Aim:– Discriminate between two ore more classes (e.g. normal vs.

different disease classes).– Predict the class (or probability of belonging to each class) of a new

sample.

Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays. – Class membership (e.g. normal, different disease classes).

Results: – See which genes are important in discriminating between classes.– Predictive tool.

Issues:– Large p (#genes) small n (#arrays, #samples).– Feature extraction or other forms of shrinkage.

06/03/2001 Mette Langaas 40

Norsk RegnesentralNorwegian Computing Center

Discrimination and classification: methods for analysing gene expression data

Used:– K-nearest neighbour [Fix and Hodges (1951)]– Support Vector Machines – CART [Breiman et al. (1984)] – Different versions of classifying the class with the largest

probability p(c).p(x|c) where p(x|c) is Gaussian with some structure on the covariance matrix (often diagonal).

– Voted Classification (bagging, boosting)– Bayesian regression [West et al. (2000)]

Alternative methods:– Methods for ”large p small n” - regression (PLS, PCR, ridge

regression, continuum regression, etc.)

06/03/2001 Mette Langaas 41

Norsk RegnesentralNorwegian Computing Center

Feature extraction and multiple testing

Aim:– Identify differentially expressed genes.

Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays.– Possible class membership.

Results: – Identify genes that can be important in discriminating between

different classes.

Methods:– Within the ANOVA framework testing the (VG) interation.– Compute t-statistic for each gene (difference e.g. for control and

treatment group) and adjust the p-values (Bonferroni, permutation methods)

Other:– Many ad hoc rules; differentially expressed genes are genes where

more than 3 values are outside some intervall (no class).

06/03/2001 Mette Langaas 42

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway

No statisticians involved in Norway!

Bioinformatics groups at Norwegian universities:• UiO: Bioinformatics group, Department of Informatics, O. C.

Lingjærde and K. Liestøl– functional genomics using microarrays, clustering.

• UiB: Bioinformatics Research Group, Department of Informatics, lead by I. Jonassen. – analysis of biological sequences and structure– J-Express clustering method for gene expression data

• NTNU: Knowledge Systems Group , Department of Computer and Information Science, lead by J. Komorowski.– classification from gene expression data and apriori

information

Page 8: Human Genome Project Recent Advances Ppt

8

06/03/2001 Mette Langaas 43

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway: some academic actors

• UiO:

– Department of Biochemistry• Det norske Radiumhospital (DNR):

– The Microarray Project at DNR, lead by Ola Myklebost

– Department of Tumor Biology– Department of Immunology– Department of Genetics

• The Norwegian Vetrinary College• UiB:

– Department of Biochemistry and Molecular Biology

– Department of Oncology• NTNU:

– Department of Physiology and Biomedical Engineering (Astrid Lægreid)

• Agricultural University of Norway

06/03/2001 Mette Langaas 44

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway:consortium on microarray technology

More information at http://www.med.uio.no/dnr/microarray/english.html

• Who: NTNU, UiB and DNR (UiO)• Aim:

• Establish front line competence in microarray bioinformatics at all participating institutions.

• Create national data warehouse for microarray based functional genomic analysis.

• Support:The Norwegian Cancer Society and NFR

06/03/2001 Mette Langaas 45

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway: other actors

• The Norwegian Biotechnology Advisory Board (Bioteknologinemnda), lead by Sissel Rogne, publication ”Genialt”. http://www.bion.no

• EMBnet Norway, Norwegian node of network for commersial and academic bioinformatic centers. http://www.no.embnet.org

• Biotechnology Center (Bioteknologisenteret)• SINTEF UniMed, MR Center, Bioinformatics group• MATFORSK (fingerprinting bacteria, GMO)• Genomar (salmon and tilapia) http://www.genomar.com

• Glaxo SmithKline (free offices to gene-researchers?)• Nycomed Pharma

06/03/2001 Mette Langaas 46

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway: NFR project

Salmon Genome Project (SGP)• Aim: Expand our knowledge of the biology of salmons

and introduce modern genetic techniques in breeding and management of Atlantic salmon. The project is said to combine research in molecular genetics with bioinformatics, with focus on genome organization and gene function.

• Who:The Norwegian Veterinary College, the University of Oslo, the University of Bergen, SINTEF Unimed and the Insitute of Marine Research.

• Cost: 350 MNOK

06/03/2001 Mette Langaas 47

Norsk RegnesentralNorwegian Computing Center

Bioinformatics in Norway:national research initiative

FUGE Funksjonell genomforskning• What: National plan by NFR and the

Norwegian universities.• Aim: Bring Norway up-to-date on

functional genome research.• Areas: biological,medical, marine

research.• Cost: 300 MNOK each year in 5-10

years (dependent on accept from Stortinget).

More information at http://www.forskningsradet.no/fag/andre/fuge/

06/03/2001 Mette Langaas 48

Norsk RegnesentralNorwegian Computing Center

How can statisticians contribute ?

• Close cooperation between researchers fromgenetics - biochemisty - medicine - biology andstatisticians is very important!

• Communicate the need for statistical thinking in analysis of gene expression data– Consept of noise, replication, reproduceable analyses.

• Statistical challenges identified today:– Model the entire experimental phase to arrive at optimal

experimental designs dependent on practical limitations.– Suggestions for within-array and between-array

normalization.– Handle missing values.– Large p small n.

Page 9: Human Genome Project Recent Advances Ppt

9

06/03/2001 Mette Langaas 49

Norsk RegnesentralNorwegian Computing Center

Bioinformatics – an interesting area of researchfor statisticians (in Norway)?

ü Important biological/medical problemsüNew area with exciting technologiesüLarge amounts of dataüStatistical experience is scarce!üMany statistical challenges!

YES!