Human Genome Project Recent Advances Ppt

06/03/2001 Mette Langaas 1

Norsk RegnesentralNorwegian Computing Center

Bioinformatics – an interesting area of researchfor statisticians (in Norway)?

Mette LangaasNorsk Regnesentral

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/

Outline of talk

• What is bioinformatics?• What do we need to know in biochemistry?• The Human Genome Project.• Research questionsin bioinformatics.• Gene expression and data from DNA microarrays.• Statistical methods for analysing gene expression

data.• Bioinformatics in Norway.• How can statisticians contribute?

What is Bioinformatics ?

Russ Altman (in Bioinformatics), broad definition:Bioinformatics is the study of how information technologies are used to solve problems in biology.

Russ Altman, (in Bioinformatics), narrow definition: Bioinformatics is the creation and management of biological databases in support of genomic sequences.

The BITS-journal:Bioinformatics is a combination of Computer Science,Information Technology and Genetics to determine and analyse genetic information.

Bioinformatics programme at the University of Michigan(slightly modified):Bioinformatics merges recent advances in molecular biology and genetics with advanced statistics and computer science technology. The goal is increased understanding of the complex web of interactions linking the individual components of a living cell to the integrated behavior of the entire organism .

Statisticians:Bioinformatics is a collection of statistical methods for dealing with large biological data sets.

What is Bioinformatics ?

What do we need to know in Molecular Genetics and Biochemistry?

• Cell

• Chromosomes

• DNA

• Gene

• Genome

The Cell

Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html

Human Chromosomes

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt

Example of DNA

(tertiary structure)

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fgt_tspeed7.ppt

Figure copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/fg t_tspeed7.ppt

DNA (Deoxyribonucleid acid) is a double stranded helix, consisting of a

• Suger Phosphate backbone and

• Nitrogenous bases:• Adenine

• Cytosine

• Guanine

• Thymine

Base pair (bp): two bases paired by hydrogen bonds between the bases.

Adenine pairs with Thymine

Guanine pairs with Cytosine

Genome-Chromosome-Gene

Genome: All the genetic material in the chromosomes of a particular organism; its size is generally given as its total number of base pairs. [Human genome: 3.109 bp, more than 99% of the human DNA sequences are the same across the population]

Chromosome: The self-replicating genetic structureof cells containing the cellular DNA that bears in its nucleotide sequence the linear array of genes. Eukaryotic genomes consistof a number of chromosomes whose DNA is associated withdifferent kinds of proteins. [Human chromosomes lenghts from 50 million to 263 million bp]

Gene: The fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e., a protein or RNA molecule). [Human genes: 30 000?, average length 3000 bp]

Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html

What MORE do we need to know in Molecular Genetics and Biochemistry?

• What does the gene do? The gene encodes a specific functional product (i.e., a protein

or RNA molecule).

• Protein syntesis

• mRNA

• Amino acid

Proteins are built from amino acids

Protein: A large moleculecomposed of one or morechains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function, and regulation of the bodys cells, tissues, and organs, and each protein has unique functions. Examples arehormones, enzymes, and antibodies.

Amino acid: Any of a class of 20 molecules that are combined to form proteins in living things. The sequence of amino acids in a protein and hence protein function are determined by the genetic code. [The 20 amino acids are:alanine, arginine, aspargine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine.]

Glossary found at http://www.ornl.gov/hgmis/publicat/glossary.html

Figure taken from The Human Genome Project at http://www.ornl.gov/hgmis/project/info.html

Protein synthesis :transcription and translation

The Human Genome Project

• Begun formally in 1990, planned to be completed in 2003.• U.S. Human Genome Project is coordinated by the U.S .

Department of Energy and the National Institutes of Health .• Project goals are

– to identify all the approximately 50,000(?) genes in human DNA, – determine the sequences of the 3 billion chemical bases that make

up human DNA, – store this information in databases, – develop faster, more efficient sequencing technologies, – develop tools for data analysis, and – address the ethical, legal, and social issues that may arise from the

project.Results by now:• Draft of entire genome (June 2000)• 9711 mapped genes (February 4, 2001)• New estimate: 30 000 genes (February, 2001)

Research questions in Bioinformatics

Data Management– databases, searchable, compare.

Biological sequence alignment:– compare two DNA sequences (HMM).

Pharmacogenetics:– how genetic differences influence the variability in patients’

responses to drugs.Proteomics:

– which proteins are present in a cell and which proteins interact with each other.

Structural genomics:– determine the (3D) structure of the proteins encoded by a genome.

Comparative genomics:– the function of human genes and other DNA regions are often

revealed by studying their parallels in nonhumans (mice and men...)

Comparing human and mouse chromosomes

Research questions in Bioinformatics (cont’d.)

Transcriptomics:– use mRNA transcripts to determine which genes are turned

on/off in a particular cell or tissue type, and how disease changes this expression.

Functional genomics:– experimental approaches and resources to assess gene

function– development of software tools to handle and interpret data

e.g. from DNA microarrays.

Functional genomics: gene expression and data from DNA microarrays

• Gene expression.• cDNA microarray experiment.• Data from one cDNA microarray experiment.• Data from many cDNA microarray experiments

(reference design).• Applications.

Gene expression

The process by which a gene's coded information isconverted into the structures present and operating in the cell. Expressed genes include those that are transcribed into mRNA and then translated into protein and those that are transcribed into RNA but not translated into protein (e.g., transfer and ribosomal RNAs).

http://www.ornl.gov/hgmis/publicat/glossary.html

cDNA clones(probes)

PCR product amplificationpurification

printing

microarray Hybridise target to microarray

mRNA target

excitation

laser 1laser 2

emission

scanning

analysis

overlay images and normalise

0.1nl/spot

Copied from talk by Terry Speed at http://www.ipam.ucla.edu/programs/fg2000/ fgt_tspeed7.ppt

cDNA microarray experiment

The cDNA microarray experiment

1. Constructing the microarray (probe):• From a collection of purified DNA’s. A drop of each type of

DNA in solution is placed on a specially prepared glass microscope slide by an arrayer machine.

2. Choosing and preparing the targets:• Select targets: the aim is to comparegene expression in

different cell populations: tissue specific, disease specific, environmental , cell cycle etc.

• mRNA extraction: capture mRNA, amplification .• Reverse transcription to cDNA (more stable).• Fluorescent labelling of cDNA targets: to identify its

presence. Red and green dyes (Cy3 and Cy5) are the most common.

The cDNA microarray experiment (cont’d.)

3. Hybridization and scanning:• The cDNA target will hybridize to spots on the array. • Using a laser (different wavelengths) the fluorescent target

will emit light. The intensity will reflect the abundance ofmRNA in the original target tissue. Using a scanner two images (red and green) is aquired.

4. Image analysisof the microarray:• Identifythe spots (gridding , segmentation) and assign a

intensity measurement.• Relate the intensity in each spot to the background intensity

(local or overall) and filter out weak spots (signal -to-noiseratio low, label as missing).

Data from one cDNA experiment

From image to intensities: using image analysis techniques the spot and background pixels are determined. An intensity measurement is assigned as the difference between spot and (local) background. Missing values are defined as spots wherethe signal is not much larger than the background noise. Gg=green intensity for gene gRg=red intensityfor gene g

Relative log-intensities: there is variation in the amount of DNA from spot to spot so intensities are only meaningful in a relative sense. For gene g on array i the relative log -intensity (usuallybase 2) is Xg

*=log2(Rg/Gg)The data vector: {Xg

*} for g=1,...,#genes.

Raw images : for each microarray experiment we have two images with measurement of fluorescent intensities. Spots of bad quality are flagged.

Data from many cDNA experimentsReference design: use the same reference sample (green) for

each experiment (often cultivated cells). The different tissue samples are dyed red.

From image to intensities for each experiment:Ggi=green intensity for gene g at array i Rgi=red intensityfor gene g at array i

Relative log-intensities from each experiment:Xgi

*=log2(Rgi/Ggi)

Median polishing: iteratively subtract the column and row medians.

The data matrix: {Xgi} for g=1,...,#genes and i=1,...,#arrays.

reference

sample nsample 2sample 1 sample 3

reference reference reference

DNA microarray applications

• Human disease diagnostics and treatment– determination of predisposition and risk factors wrt. certain

diseases– prediction of risk factors involved using certain treatment schemes

– monitor disease stage and treatment progress

• Agricultural diagnostics and development– identify plant pathogens to allow suitable plant protection to be

improved– efficiacy and economy in plant biotechnology

• Analysis of food and genetically modified organisms (GMO)– determine the integrity of food

– detect alterations and contaminations– quantify GMOs

• Drug discovery and drug development

Statistical methods for analysing gene expression data

METHODexperimental designanalysis of variance

clustering

discrimination and classification

multipletestingfeature extraction

TASKdesign the experiment

normalize with-in array and between array

find similar groups of samples and/or genes

discriminate between two or more groups of samples

classify a new sample to one for many groups (or

compute group probability)find genes that are

differentially expressed (all samples or subsets)

Experimental design and ANOVA[Kerr & Churchill, The Jackson Lab, August 2000]

Sources of variation for the fluorescent intensities• Variety:

• timepoints of a biological process• different types of tissue• different treatments• different types of disease

• Genes (hybridization efficiency)• Dyes (two dyes – one dye consistently brighter than the other?)• Arrays ( probing conditions)• Dye*Variety(differences when dyeing the samples)• Array*Gene (amount of cDNA on probes for same gene on

different arrays vary = ”spot” effects)• Dye*Gene (are there differences in the dyes that are gene

specific?)• Variety*Gene(EFFECT of INTEREST)

Experimental design and ANOVA (cont’d.)

ANOVA model:yijkg=µ+Ai+Dj+Vk+Gg+(VG)kg+(AG)ig+εijkg

where yijkg is a transformation of Rgi or Ggi so effects are additive and εijkg has F(0,σ2) or Fg(0,σg

Replication of genes on every array: (AG)igs

Experimental design and ANOVA (cont’d.)

Reference design:• same reference variety on each array (variety not of

interest)• the most popular design.• VG is completely confounded with DG.• No degrees of freedom left for error estimation.• Use when not enough tissue to dye twice.

Loop design:• collects twice as much data on the varieties of interest.• balanced wrt. dye, but each sample must be dyed

twice.• (#genes-1) degrees if freedom left for error estimation.• more difficult to understand for biologists.

Clustering

Aim: partition genes or samples into groups so that the groups are homogeneous and well-separated.

Data: {Xgi} for g=1,...,#genes and i=1,...,#arrays.

Results: – Find groups of genes that are co-regulated– Find subgroups (previously unknown) of diseases

Issues:– feature extraction (choosing a subset of the genes)

– one-way or two -way clustering– overlapping vs. non-overlapping clusters– membership in more than one cluster

– assessing the reliability of clustering results

Clustering: methods for analysing gene expression data

One-way clustering:– Hierarchical clustering– Self-organizing maps (SOM) [Kohonen]– K-means – SVD-based (principal component) clustering

Two-way clustering:– Block clustering – Gene Shaving [Hastie et al. (2000)]– Plaid Models [Lazzeroni & Owen (2000)]

Molecular Portraits of Breast Cancer , Perou et al., Nature, 406, 6797, 2000.

Discrimination and Classification

Aim:– Discriminate between two ore more classes (e.g. normal vs.

different disease classes).– Predict the class (or probability of belonging to each class) of a new

sample.

Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays. – Class membership (e.g. normal, different disease classes).

Results: – See which genes are important in discriminating between classes.– Predictive tool.

Issues:– Large p (#genes) small n (#arrays, #samples).– Feature extraction or other forms of shrinkage.

Discrimination and classification: methods for analysing gene expression data

Used:– K-nearest neighbour [Fix and Hodges (1951)]– Support Vector Machines – CART [Breiman et al. (1984)] – Different versions of classifying the class with the largest

probability p(c).p(x|c) where p(x|c) is Gaussian with some structure on the covariance matrix (often diagonal).

– Voted Classification (bagging, boosting)– Bayesian regression [West et al. (2000)]

Alternative methods:– Methods for ”large p small n” - regression (PLS, PCR, ridge

regression, continuum regression, etc.)

Feature extraction and multiple testing

Aim:– Identify differentially expressed genes.

Data: – {Xgi} for g=1,...,#genes and i=1,...,#arrays.– Possible class membership.

Results: – Identify genes that can be important in discriminating between

different classes.

Methods:– Within the ANOVA framework testing the (VG) interation.– Compute t-statistic for each gene (difference e.g. for control and

treatment group) and adjust the p-values (Bonferroni, permutation methods)

Other:– Many ad hoc rules; differentially expressed genes are genes where

more than 3 values are outside some intervall (no class).

Bioinformatics in Norway

No statisticians involved in Norway!

Bioinformatics groups at Norwegian universities:• UiO: Bioinformatics group, Department of Informatics, O. C.

Lingjærde and K. Liestøl– functional genomics using microarrays, clustering.

• UiB: Bioinformatics Research Group, Department of Informatics, lead by I. Jonassen. – analysis of biological sequences and structure– J-Express clustering method for gene expression data

• NTNU: Knowledge Systems Group , Department of Computer and Information Science, lead by J. Komorowski.– classification from gene expression data and apriori

information

Bioinformatics in Norway: some academic actors

• UiO:

– Department of Biochemistry• Det norske Radiumhospital (DNR):

– The Microarray Project at DNR, lead by Ola Myklebost

– Department of Tumor Biology– Department of Immunology– Department of Genetics

• The Norwegian Vetrinary College• UiB:

– Department of Biochemistry and Molecular Biology

– Department of Oncology• NTNU:

– Department of Physiology and Biomedical Engineering (Astrid Lægreid)

• Agricultural University of Norway

Bioinformatics in Norway:consortium on microarray technology

More information at http://www.med.uio.no/dnr/microarray/english.html

• Who: NTNU, UiB and DNR (UiO)• Aim:

• Establish front line competence in microarray bioinformatics at all participating institutions.

• Create national data warehouse for microarray based functional genomic analysis.

• Support:The Norwegian Cancer Society and NFR

Bioinformatics in Norway: other actors

• The Norwegian Biotechnology Advisory Board (Bioteknologinemnda), lead by Sissel Rogne, publication ”Genialt”. http://www.bion.no

• EMBnet Norway, Norwegian node of network for commersial and academic bioinformatic centers. http://www.no.embnet.org

• Biotechnology Center (Bioteknologisenteret)• SINTEF UniMed, MR Center, Bioinformatics group• MATFORSK (fingerprinting bacteria, GMO)• Genomar (salmon and tilapia) http://www.genomar.com

• Glaxo SmithKline (free offices to gene-researchers?)• Nycomed Pharma

Bioinformatics in Norway: NFR project

Salmon Genome Project (SGP)• Aim: Expand our knowledge of the biology of salmons

and introduce modern genetic techniques in breeding and management of Atlantic salmon. The project is said to combine research in molecular genetics with bioinformatics, with focus on genome organization and gene function.

• Who:The Norwegian Veterinary College, the University of Oslo, the University of Bergen, SINTEF Unimed and the Insitute of Marine Research.

• Cost: 350 MNOK

Bioinformatics in Norway:national research initiative

FUGE Funksjonell genomforskning• What: National plan by NFR and the

Norwegian universities.• Aim: Bring Norway up-to-date on

functional genome research.• Areas: biological,medical, marine

research.• Cost: 300 MNOK each year in 5-10

years (dependent on accept from Stortinget).

More information at http://www.forskningsradet.no/fag/andre/fuge/

How can statisticians contribute ?

• Close cooperation between researchers fromgenetics - biochemisty - medicine - biology andstatisticians is very important!

• Communicate the need for statistical thinking in analysis of gene expression data– Consept of noise, replication, reproduceable analyses.

• Statistical challenges identified today:– Model the entire experimental phase to arrive at optimal

experimental designs dependent on practical limitations.– Suggestions for within-array and between-array

normalization.– Handle missing values.– Large p small n.

Bioinformatics – an interesting area of researchfor statisticians (in Norway)?

ü Important biological/medical problemsüNew area with exciting technologiesüLarge amounts of dataüStatistical experience is scarce!üMany statistical challenges!

Human Genome Project Recent Advances Ppt

Documents

Recent Advances in RAAS

in Genetics - Weebly...FIGURE 18 The Human Genome Project Scientists on the Human Genome Project continue to study human DNA. Learning About Human Genetics Recent advances have enabled

Illuminating the Druggable Genome: Recent Advances · – PubMed text-mining score from Jensen Lab < 5 –

RECENT ADVANCES in MECHANICAL - wseas.org · RECENT ADVANCES in MECHANICAL ENGINEERING ... Series: Recent Advances in Mechanical Engineering ... to the generation of electricity in

Recent Advances Deformity

Requiem for Avogadro: recent advances from laboratory research · Requiem for Avogadro: recent advances from laboratory research ... recent advances from laboratory research ... Jean

Recent advances in nanotechnology

RECENT ADVANCES IN DENTAL CERAMICS RECENT ADVANCES IN DENTAL CERAMICS

RECENT ADVANCES IN MULTIREFERENCE METHODS Recent …ccl.scc.kyushu-u.ac.jp/~nakano/papers/rev-racc-4-131.pdf · RECENT ADVANCES IN MULTIREFERENCE METHODS Recent Advances in Computational

Recent advances in the genome mining of Aspergillus

IB88012: Proposal to Map and Sequence the Human Genome/67531/metacrs8520/m... · THE PROPOSAL TO MAP AND SEQUENCE THE HUMAN GENOME Recent advances in molecular biology have made it

RECENT ADVANCES on

Recent Advances in Pulmonary Rehabilitation for … · Recent Advances in Pulmonary Rehabilitation for Patients with COPD ... Recent advances in pulmonary rehabilitation for ... Pulm

Recent Advances in Polyphenol Research Recent · Recent Advances in Polyphenol Research Volume 3 Recent Advances in Polyphenol Research Volume 3 Véronique Cheynier, Pascale Sarni-Manchado

Intro & Recent Advances Remote Data Access via … · Intro & Recent Advances Remote Data Access via OPeNDAP Web Services ... NetCDF4 …). Suffix “dmr ... Intro & Recent Advances

Genome Editing in Rice: Recent Advances, Challenges, and ... · studies due to its small genome size, availability of genetic resources, high transformation eﬃciency, and greater

RECENT ADVANCES in MECHANICAL ENGINEERING and MECHANICS · RECENT ADVANCES in MECHANICAL ENGINEERING and MECHANICS ... RECENT ADVANCES in MECHANICAL ENGINEERING and MECHANICS

Recent Advances in Rock Engineering Recent Advances in

Recent advances in surgery

Recent advances in bacterial community ... Advances on Bacterial Community... · 2013-06-08 · Recent advances in bacterial community involvementinmercurytransformationsintheinvolvement