View
3.237
Download
3
Category
Preview:
Citation preview
1
06/03/2001 Mette Langaas 1
Norsk RegnesentralNorwegian Computing Center
Bioinformatics – an interesting area of researchfor statisticians (in Norway)?
Mette LangaasNorsk Regnesentral
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
06/03/2001 Mette Langaas 2
Norsk RegnesentralNorwegian Computing Center
Outline of talk
• What is bioinformatics?• What do we need to know in biochemistry?• The Human Genome Project.• Research questionsin bioinformatics.• Gene expression and data from DNA microarrays.• Statistical methods for analysing gene expression
data.• Bioinformatics in Norway.• How can statisticians contribute?
06/03/2001 Mette Langaas 3
Norsk RegnesentralNorwegian Computing Center
What is Bioinformatics ?
Russ Altman (in Bioinformatics), broad definition:Bioinformatics is the study of how information technologies are used to solve problems in biology.
Russ Altman, (in Bioinformatics), narrow definition: Bioinformatics is the creation and management of biological databases in support of genomic sequences.
The BITS-journal:Bioinformatics is a combination of Computer Science,Information Technology and Genetics to determine and analyse genetic information.
06/03/2001 Mette Langaas 4
Norsk RegnesentralNorwegian Computing Center
Bioinformatics programme at the University of Michigan(slightly modified):Bioinformatics merges recent advances in molecular biology and genetics with advanced statistics and computer science technology. The goal is increased understanding of the complex web of interactions linking the individual components of a living cell to the integrated behavior of the entire organism .
Statisticians:Bioinformatics is a collection of statistical methods for dealing with large biological data sets.
What is Bioinformatics ?
06/03/2001 Mette Langaas 5
Norsk RegnesentralNorwegian Computing Center
What do we need to know in Molecular Genetics and Biochemistry?
• Cell
• Chromosomes
• DNA
• Gene
• Genome
06/03/2001 Mette Langaas 6
Norsk RegnesentralNorwegian Computing Center
The Cell
Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt
2
06/03/2001 Mette Langaas 7
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html
06/03/2001 Mette Langaas 8
Norsk RegnesentralNorwegian Computing Center
Human Chromosomes
Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt
06/03/2001 Mette Langaas 9
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
06/03/2001 Mette Langaas 10
Norsk RegnesentralNorwegian Computing Center
Example of DNA
(tertiary structure)
Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt
06/03/2001 Mette Langaas 11
Norsk RegnesentralNorwegian Computing Center
Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fg t_tspeed7.ppt
DNA (Deoxyribonucleid acid) is a double stranded helix, consisting of a
• Suger Phosphate backbone and
• Nitrogenous bases:• Adenine
• Cytosine
• Guanine
• Thymine
Base pair (bp): two bases paired by hydrogen bonds between the bases.
Adenine pairs with Thymine
Guanine pairs with Cytosine
06/03/2001 Mette Langaas 12
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
3
06/03/2001 Mette Langaas 13
Norsk RegnesentralNorwegian Computing Center
Genome-Chromosome-Gene
Genome: All the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs. [Human genome: 3.109 bp, more than 99% of the human DNA sequences are the same across the population]
Chromosome: The self-replicating genetic structureof cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. Eukaryotic genomes consistof a number of chromosomes whose DNA is associated withdifferent kinds of proteins. [Human chromosomes lenghts from 50 million to 263 million bp]
Gene: The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). [Human genes: 30 000?, average length 3000 bp]
Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html
06/03/2001 Mette Langaas 14
Norsk RegnesentralNorwegian Computing Center
What MORE do we need to know in Molecular Genetics and Biochemistry?
• What does the gene do? The gene encodes a specific functional product (i.e., a protein
or RNA molecule).
• Protein syntesis
• mRNA
• Amino acid
06/03/2001 Mette Langaas 15
Norsk RegnesentralNorwegian Computing Center
Proteins are built from amino acids
Protein: A large moleculecomposed of one or morechains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function, and regulation of the bodys cells, tissues, and organs, and each protein has unique functions. Examples arehormones, enzymes, and antibodies.
Amino acid: Any of a class of 20 molecules that are combined to form proteins in living things. The sequence of amino acids in a protein and hence protein function are determined by the genetic code. [The 20 amino acids are:alanine, arginine, aspargine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine.]
Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html
06/03/2001 Mette Langaas 16
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html
Protein synthesis :transcription and translation
06/03/2001 Mette Langaas 17
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
06/03/2001 Mette Langaas 18
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
4
06/03/2001 Mette Langaas 19
Norsk RegnesentralNorwegian Computing Center
The Human Genome Project
• Begun formally in 1990, planned to be completed in 2003.• U.S. Human Genome Project is coordinated by the U.S .
Department of Energy and the National Institutes of Health .• Project goals are
– to identify all the approximately 50,000(?) genes in human DNA, – determine the sequences of the 3 billion chemical bases that make
up human DNA, – store this information in databases, – develop faster, more efficient sequencing technologies, – develop tools for data analysis, and – address the ethical, legal, and social issues that may arise from the
project.Results by now:• Draft of entire genome (June 2000)• 9711 mapped genes (February 4, 2001)• New estimate: 30 000 genes (February, 2001)
06/03/2001 Mette Langaas 20
Norsk RegnesentralNorwegian Computing Center
Research questions in Bioinformatics
Data Management– databases, searchable, compare.
Biological sequence alignment:– compare two DNA sequences (HMM).
Pharmacogenetics:– how genetic differences influence the variability in patients’
responses to drugs.Proteomics:
– which proteins are present in a cell and which proteins interact with each other.
Structural genomics:– determine the (3D) structure of the proteins encoded by a genome.
Comparative genomics:– the function of human genes and other DNA regions are often
revealed by studying their parallels in nonhumans (mice and men...)
06/03/2001 Mette Langaas 21
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
Comparing human and mouse chromosomes
06/03/2001 Mette Langaas 22
Norsk RegnesentralNorwegian Computing Center
Research questions in Bioinformatics (cont’d.)
Transcriptomics:– use mRNA transcripts to determine which genes are turned
on/off in a particular cell or tissue type, and how disease changes this expression.
Functional genomics:– experimental approaches and resources to assess gene
function– development of software tools to handle and interpret data
e.g. from DNA microarrays.
06/03/2001 Mette Langaas 23
Norsk RegnesentralNorwegian Computing Center
Functional genomics: gene expression and data from DNA microarrays
• Gene expression.• cDNA microarray experiment.• Data from one cDNA microarray experiment.• Data from many cDNA microarray experiments
(reference design).• Applications.
06/03/2001 Mette Langaas 24
Norsk RegnesentralNorwegian Computing Center
Gene expression
The process by which a gene's coded information isconverted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs).
http://www.ornl.gov/hgmis/publicat/glossary.html
5
06/03/2001 Mette Langaas 25
Norsk RegnesentralNorwegian Computing Center
cDNA clones(probes)
PCR product amplificationpurification
printing
microarray Hybridise target to microarray
mRNA target
excitation
laser 1laser 2
emission
scanning
analysis
overlay images and normalise
0.1nl/spot
Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt
cDNA microarray experiment
06/03/2001 Mette Langaas 26
Norsk RegnesentralNorwegian Computing Center
Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/
06/03/2001 Mette Langaas 27
Norsk RegnesentralNorwegian Computing Center
The cDNA microarray experiment
1. Constructing the microarray (probe):• From a collection of purified DNA’s. A drop of each type of
DNA in solution is placed on a specially prepared glass microscope slide by an arrayer machine.
2. Choosing and preparing the targets:• Select targets: the aim is to comparegene expression in
different cell populations: tissue specific, disease specific, environmental , cell cycle etc.
• mRNA extraction: capture mRNA, amplification .• Reverse transcription to cDNA (more stable).• Fluorescent labelling of cDNA targets: to identify its
presence. Red and green dyes (Cy3 and Cy5) are the most common.
06/03/2001 Mette Langaas 28
Norsk RegnesentralNorwegian Computing Center
The cDNA microarray experiment (cont’d.)
3. Hybridization and scanning:• The cDNA target will hybridize to spots on the array. • Using a laser (different wavelengths) the fluorescent target
will emit light. The intensity will reflect the abundance ofmRNA in the original target tissue. Using a scanner two images (red and green) is aquired.
4. Image analysisof the microarray:• Identifythe spots (gridding , segmentation) and assign a
intensity measurement.• Relate the intensity in each spot to the background intensity
(local or overall) and filter out weak spots (signal -to-noiseratio low, label as missing).
06/03/2001 Mette Langaas 29
Norsk RegnesentralNorwegian Computing Center
Data from one cDNA experiment
From image to intensities: using image analysis techniques the spot and background pixels are determined. An intensity measurement is assigned as the difference between spot and (local) background. Missing values are defined as spots wherethe signal is not much larger than the background noise. Gg=green intensity for gene gRg=red intensityfor gene g
Relative log-intensities: there is variation in the amount of DNA from spot to spot so intensities are only meaningful in a relative sense. For gene g on array i the relative log -intensity (usuallybase 2) is Xg
*=log2(Rg/Gg)The data vector: {Xg
*} for g=1,...,#genes.
Raw images : for each microarray experiment we have two images with measurement of fluorescent intensities. Spots of bad quality are flagged.
06/03/2001 Mette Langaas 30
Norsk RegnesentralNorwegian Computing Center
Data from many cDNA experimentsReference design: use the same reference sample (green) for
each experiment (often cultivated cells). The different tissue samples are dyed red.
From image to intensities for each experiment:Ggi=green intensity for gene g at array i Rgi=red intensityfor gene g at array i
Relative log-intensities from each experiment:Xgi
*=log2(Rgi/Ggi)
Median polishing: iteratively subtract the column and row medians.
The data matrix: {Xgi} for g=1,...,#genes and i=1,...,#arrays.
reference
sample nsample 2sample 1 sample 3
reference reference reference
6
06/03/2001 Mette Langaas 31
Norsk RegnesentralNorwegian Computing Center
DNA microarray applications
• Human disease diagnostics and treatment– determination of predisposition and risk factors wrt. certain
diseases– prediction of risk factors involved using certain treatment schemes
– monitor disease stage and treatment progress
• Agricultural diagnostics and development– identify plant pathogens to allow suitable plant protection to be
improved– efficiacy and economy in plant biotechnology
• Analysis of food and genetically modified organisms (GMO)– determine the integrity of food
– detect alterations and contaminations– quantify GMOs
• Drug discovery and drug development
06/03/2001 Mette Langaas 32
Norsk RegnesentralNorwegian Computing Center
Statistical methods for analysing gene expression data
METHODexperimental designanalysis of variance
clustering
discrimination and classification
multipletestingfeature extraction
TASKdesign the experiment
normalize with-in array and between array
find similar groups of samples and/or genes
discriminate between two or more groups of samples
classify a new sample to one for many groups (or
compute group probability)find genes that are
differentially expressed (all samples or subsets)
06/03/2001 Mette Langaas 33
Norsk RegnesentralNorwegian Computing Center
Experimental design and ANOVA[Kerr & Churchill, The Jackson Lab, August 2000]
Sources of variation for the fluorescent intensities• Variety:
• timepoints of a biological process• different types of tissue• different treatments• different types of disease
• Genes (hybridization efficiency)• Dyes (two dyes – one dye consistently brighter than the other?)• Arrays ( probing conditions)• Dye*Variety(differences when dyeing the samples)• Array*Gene (amount of cDNA on probes for same gene on
different arrays vary = ”spot” effects)• Dye*Gene (are there differences in the dyes that are gene
specific?)• Variety*Gene(EFFECT of INTEREST)
06/03/2001 Mette Langaas 34
Norsk RegnesentralNorwegian Computing Center
Experimental design and ANOVA (cont’d.)
ANOVA model:yijkg=µ+Ai+Dj+Vk+Gg+(VG)kg+(AG)ig+εijkg
where yijkg is a transformation of Rgi or Ggi so effects are additive and εijkg has F(0,σ2) or Fg(0,σg
2).
Replication of genes on every array: (AG)igs
06/03/2001 Mette Langaas 35
Norsk RegnesentralNorwegian Computing Center
Experimental design and ANOVA (cont’d.)
Reference design:• same reference variety on each array (variety not of
interest)• the most popular design.• VG is completely confounded with DG.• No degrees of freedom left for error estimation.• Use when not enough tissue to dye twice.
Loop design:• collects twice as much data on the varieties of interest.• balanced wrt. dye, but each sample must be dyed
twice.• (#genes-1) degrees if freedom left for error estimation.• more difficult to understand for biologists.
06/03/2001 Mette Langaas 36
Norsk RegnesentralNorwegian Computing Center
Clustering
Aim: partition genes or samples into groups so that the groups are homogeneous and well-separated.
Data: {Xgi} for g=1,...,#genes and i=1,...,#arrays.
Results: – Find groups of genes that are co-regulated– Find subgroups (previously unknown) of diseases
Issues:– feature extraction (choosing a subset of the genes)
– one-way or two -way clustering– overlapping vs. non-overlapping clusters– membership in more than one cluster
– assessing the reliability of clustering results
7
06/03/2001 Mette Langaas 37
Norsk RegnesentralNorwegian Computing Center
Clustering: methods for analysing gene expression data
One-way clustering:– Hierarchical clustering– Self-organizing maps (SOM) [Kohonen]– K-means – SVD-based (principal component) clustering
Two-way clustering:– Block clustering – Gene Shaving [Hastie et al. (2000)]– Plaid Models [Lazzeroni & Owen (2000)]
06/03/2001 Mette Langaas 38
Norsk RegnesentralNorwegian Computing Center
Molecular Portraits of Breast Cancer , Perou et al., Nature, 406, 6797, 2000.
06/03/2001 Mette Langaas 39
Norsk RegnesentralNorwegian Computing Center
Discrimination and Classification
Aim:– Discriminate between two ore more classes (e.g. normal vs.
different disease classes).– Predict the class (or probability of belonging to each class) of a new
sample.
Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays. – Class membership (e.g. normal, different disease classes).
Results: – See which genes are important in discriminating between classes.– Predictive tool.
Issues:– Large p (#genes) small n (#arrays, #samples).– Feature extraction or other forms of shrinkage.
06/03/2001 Mette Langaas 40
Norsk RegnesentralNorwegian Computing Center
Discrimination and classification: methods for analysing gene expression data
Used:– K-nearest neighbour [Fix and Hodges (1951)]– Support Vector Machines – CART [Breiman et al. (1984)] – Different versions of classifying the class with the largest
probability p(c).p(x|c) where p(x|c) is Gaussian with some structure on the covariance matrix (often diagonal).
– Voted Classification (bagging, boosting)– Bayesian regression [West et al. (2000)]
Alternative methods:– Methods for ”large p small n” - regression (PLS, PCR, ridge
regression, continuum regression, etc.)
06/03/2001 Mette Langaas 41
Norsk RegnesentralNorwegian Computing Center
Feature extraction and multiple testing
Aim:– Identify differentially expressed genes.
Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays.– Possible class membership.
Results: – Identify genes that can be important in discriminating between
different classes.
Methods:– Within the ANOVA framework testing the (VG) interation.– Compute t-statistic for each gene (difference e.g. for control and
treatment group) and adjust the p-values (Bonferroni, permutation methods)
Other:– Many ad hoc rules; differentially expressed genes are genes where
more than 3 values are outside some intervall (no class).
06/03/2001 Mette Langaas 42
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway
No statisticians involved in Norway!
Bioinformatics groups at Norwegian universities:• UiO: Bioinformatics group, Department of Informatics, O. C.
Lingjærde and K. Liestøl– functional genomics using microarrays, clustering.
• UiB: Bioinformatics Research Group, Department of Informatics, lead by I. Jonassen. – analysis of biological sequences and structure– J-Express clustering method for gene expression data
• NTNU: Knowledge Systems Group , Department of Computer and Information Science, lead by J. Komorowski.– classification from gene expression data and apriori
information
8
06/03/2001 Mette Langaas 43
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway: some academic actors
• UiO:
– Department of Biochemistry• Det norske Radiumhospital (DNR):
– The Microarray Project at DNR, lead by Ola Myklebost
– Department of Tumor Biology– Department of Immunology– Department of Genetics
• The Norwegian Vetrinary College• UiB:
– Department of Biochemistry and Molecular Biology
– Department of Oncology• NTNU:
– Department of Physiology and Biomedical Engineering (Astrid Lægreid)
• Agricultural University of Norway
06/03/2001 Mette Langaas 44
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway:consortium on microarray technology
More information at http://www.med.uio.no/dnr/microarray/english.html
• Who: NTNU, UiB and DNR (UiO)• Aim:
• Establish front line competence in microarray bioinformatics at all participating institutions.
• Create national data warehouse for microarray based functional genomic analysis.
• Support:The Norwegian Cancer Society and NFR
06/03/2001 Mette Langaas 45
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway: other actors
• The Norwegian Biotechnology Advisory Board (Bioteknologinemnda), lead by Sissel Rogne, publication ”Genialt”. http://www.bion.no
• EMBnet Norway, Norwegian node of network for commersial and academic bioinformatic centers. http://www.no.embnet.org
• Biotechnology Center (Bioteknologisenteret)• SINTEF UniMed, MR Center, Bioinformatics group• MATFORSK (fingerprinting bacteria, GMO)• Genomar (salmon and tilapia) http://www.genomar.com
• Glaxo SmithKline (free offices to gene-researchers?)• Nycomed Pharma
06/03/2001 Mette Langaas 46
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway: NFR project
Salmon Genome Project (SGP)• Aim: Expand our knowledge of the biology of salmons
and introduce modern genetic techniques in breeding and management of Atlantic salmon. The project is said to combine research in molecular genetics with bioinformatics, with focus on genome organization and gene function.
• Who:The Norwegian Veterinary College, the University of Oslo, the University of Bergen, SINTEF Unimed and the Insitute of Marine Research.
• Cost: 350 MNOK
06/03/2001 Mette Langaas 47
Norsk RegnesentralNorwegian Computing Center
Bioinformatics in Norway:national research initiative
FUGE Funksjonell genomforskning• What: National plan by NFR and the
Norwegian universities.• Aim: Bring Norway up-to-date on
functional genome research.• Areas: biological,medical, marine
research.• Cost: 300 MNOK each year in 5-10
years (dependent on accept from Stortinget).
More information at http://www.forskningsradet.no/fag/andre/fuge/
06/03/2001 Mette Langaas 48
Norsk RegnesentralNorwegian Computing Center
How can statisticians contribute ?
• Close cooperation between researchers fromgenetics - biochemisty - medicine - biology andstatisticians is very important!
• Communicate the need for statistical thinking in analysis of gene expression data– Consept of noise, replication, reproduceable analyses.
• Statistical challenges identified today:– Model the entire experimental phase to arrive at optimal
experimental designs dependent on practical limitations.– Suggestions for within-array and between-array
normalization.– Handle missing values.– Large p small n.
9
06/03/2001 Mette Langaas 49
Norsk RegnesentralNorwegian Computing Center
Bioinformatics – an interesting area of researchfor statisticians (in Norway)?
ü Important biological/medical problemsüNew area with exciting technologiesüLarge amounts of dataüStatistical experience is scarce!üMany statistical challenges!
YES!
Recommended