36
Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), Spain. http://bioinfo.cnio.es The use of high throughput methodologies allows us to query our systems in a new way but, at the same time, generates new challenges for data analysis and requires from us a change in our data management habits National Institute of Bioinformatics, Functional Genomics node

Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Understanding biological systemsby using DNA microarrays and

bioinformaticsJoaquín DopazoBioinformatics Unit,

Centro Nacional de Investigaciones Oncológicas (CNIO), Spain.http://bioinfo.cnio.es

The use of high throughput methodologies allows us to queryour systems in a new way but, at the same time, generates newchallenges for data analysis and requires from us a change in our data management habits

National Institute of Bioinformatics, Functional Genomics node

Page 2: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

From genotype to phenotype.

(only the genetic component)

>protein kinase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

…code for the structure of proteins...

…which accounts for the function...

…providing they are expressed in the proper moment and place...

…in cooperation with other proteins…

…conforming complex interaction networks...

Genes in the DNA... …whose final

effect can be different because of the variability.

Now: 23531 (NCBI 34 assembly 02/04) Estimations: 20.000 to 100.000.

50% mRNAs do not code for proteins (mouse)50% display alternative splicing

Each protein has an average of 8 interactions

A typical tissue isexpressing among5000 and 10000

genes

That undergopost-

translationalmodifications

More than 3.5 millonSNPs have been

mapped

25%-60% unknown

Page 3: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Pre-genomics scenario in the lab

>protein kunase

acctgttgatggcgacagggactgtatgctgatctatgctgatgcatgcatgctgactactgatgtgggggctattgacttgatgtctatc....

Page 4: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Sequence

Molecular databases

Search results

Phylogenetictree

alignment

Conserved region

MotifMotif

databases

Information

Secondary and tertiary protein structure

The aim:

Extracting as muchinformation as possible for onesingle data

Bioinformatics tools for pre-genomicsequence data analysis

Page 5: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Genome sequencing

2-hybrid systemsMass spectrometry for protein complexes

Post-genomic vision

ExpressionArrays

Literature, databases

Who?

What do we know?

In what way?

Where, when and how much?

SNPs

And who else?

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

Page 6: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Post-genomic vision

genes

interactions

Gene expression

Information

polimorphisms

InformationDatabases

The new tools:Clustering

Feature selectionData integration

Information mining

Page 7: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Gene expression profiling.The rationale, what we would like and related problems

Differences at phenotype level are the visible cause of differences at molecular level which, in many cases, can be detected by measuring the levels of gene expression. The same holds for different experiments, treatments, etc.

• Classification of phenotypes / experiments (Can I distinguish among classes, values of variables, etc. using molecular gene expression data?)

• Selection of differentially expressed genes among the phenotypes / experiments(did I select the relevant genes, all the relevant genes and nothing but the relevantgenes?)

• Biological roles the genes are playing in the cell (what general biological roles are really represented in the set of relevant genes?)

Page 8: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

A note of caution:

Genome-wide technologies allows us to produce vastamounts of data. But... data is not knowledgeMisunderstanding of this has lead to “new” (notnecessarily good) ways of asking (scientific) questions

Question Experiment test

Is gene A involved in process B?

Experiment (sometimes) test Question

Is there any gene (or set of genes) involved in any process?

Page 9: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Gene expression analysis using DNA microarrays

Cy5

There are twodominanttechnologies: spotted arraysand oligo arraysalthough newplayers are arriving to thearena

Cy3

cDNA arrays Oligonucleotide arrays

Page 10: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Transforming images into data

Test sample labeled red (Cy5)Reference sample labeled green (Cy3)

Red : gene overexpressed in test sampleGreen : gene underexpressed in test sampleYellow - equally expressed

red/green - ratio of expression

Page 11: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

NormalisationA

There are many sources of error that can affect and seriously biass theinterpretation of the results. Differences in the efficience of labeling, thehibridisation, local effects, etc.

Normalisation is a necessary step beforeproceeding with the analysis

B

C

Before (left) and after (right) normalization. A) BoxPlots, B) BoxPlots of subarrays and C) MA plots (ratio versus intensity)

(a) After normalization by average (b) after print-tip lowessnormalization (c) after normalization taking into account spatialeffects

Page 12: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

The data

Characteristics of the data:

• Many more variables (genes) thanmeasurements (experiments / arrays)

• Low signal to noise ratio

• High redundancy and intra-gene correlations

• Most of the genes are notinformative with respect to the traitwe are studying (account forunrelatedphysiological conditions, etc.)

• Many genes have no annotation!!

Genes(thousands)

A B C

Expression profileof a gene across theexperimental conditions

Expressionprofile of all thegenes for a experimental condition (array)

Different classesof experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc.

...

Experimental conditions(from tens up to no more than a few houndreds)

Page 13: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Co-expressing genes... What do theyhave in common?

Different phenotypes...

What genes are responsible for?

B CAHow is the

network?Genes interacting in a network (A,B,C..)...

DE

Molecular classification of samples

Multiple array experiments.Can we find groups ofexperiments withsimilar gene expressionprofiles?

UnsupervisedSupervised

Reverse engineering

Page 14: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Unsupervised clustering methods:Useful for class discovery (we do not have

any a priori knowledge on classes)

Non hierarchical hierarchical

K-means, PCA UPGMA

SOM SOTA

Different levels of information

quick and robust

Page 15: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

An unsupervised problem: clustering of genes.

•Gene clusters are unknown beforehand

•Distance function

•Cluster gene expressionpatterns based uniquelyon their similarities.

•Results are subjected tofurther interpretation (ifpossible)

Page 16: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Perou et al., PNAS 96 (1999)

Distinctive gene expression patterns in human mammary epithelial cells and breast cancers

Overview of the combined in vitro and breast tissuespecimen cluster diagram. A scaled-down representation of

the 1,247-gene cluster diagram The black bars show thepositions of the clusters discussed in the text: (A) proliferation-associated, (B) IFNregulated, (C) B

lymphocytes, and (D) stromal cells.

Clustering of experiments:The rationale

If enough genes have theirexpression levels altered in thedifferent experiments, we mightbe able of finding these classes by comparing gene expressionprofiles.

Page 17: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Clustering of experiments:The problems

Any gene (regardeless its relevance forthe classification) has the same weightin the comparison. If relevant genes are not in overwhelming majority itproduces:

Noise

and/or

irrelevant trends

Page 18: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Supervised analysis.If we already have information on the classes, our question

to the data should use it.Class prediction based on gene expression profiles:

A B C Problems:

How can classes A, B, C... be distiguished based on the correspondingprofiles of gene expression?

How a continuous phenotypic trait(resistence to drugs, survival, etc.) can be predicted?

And

Which genes among the thousandsanalysed are relevant for theclassification?

Genes(thousands)

Predictor

Gene selection

Experimental conditions(from tens up to no more than a few houndreds)

Page 19: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Gene selection.We are interested in selecting those genes showingdifferential expression among the classes studied.

• Contingency table (Fisher's test)

For discrete data (presence/absence, etc).

• T-test

We could compare gene expressiondata between two types of patients.

• ANOVA

Analysis of variance. We compare between two or more groups thevalue of an interval data. The pomelo tool

Page 20: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Gene selection and classdiscrimination

Genes differentially expressedamong classes (t-test orANOVA), with p-value < 0.05

10 10cases controls

Page 21: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Sorry... the data was a collection ofrandom numbers labelled for two classes

This is a multiple-testingstatistic contrast.

Adjusted p-values must be used!

Page 22: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

NE EEC

NEEEC

G Symbol

Gene selection

between normal endometrium(ne) and endometrioid

endometrial carcinomas (eec)

A Number

Hierarchical Clustering of 86 genes withdifferent expression patterns between

Normal Endometrium and EndometrioidEndometrial Carcinoma (p<0.05) selected

among the ~7000 genes in the CNIO oncochip

Moreno et al., BREAST AND GYNAECOLOGICAL CANCER LABORATORY, Molecular Pathology Programme, CNIO

Page 23: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

And, genes are not only related todiscrete classes...

Pomelo: a tool forfinding differentiallyexpressed genes • Among classes

• Survival

• Related to a continuousparameter

Page 24: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Of predictors and molecular signaturesA B

Model, orclassificator

A/B?

1 Training

(with internal and/orexternal CV)

A

2. Classification / predictionA/B?

Unknown sample

CV

Page 25: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Predictor of clinical outcome in breast cancer

Genes are arranged totheir correlation eiththe pronostic groups

Pronostic classifierwith optimal accuracy

van’t Veer et al., Nature, 2002

Page 26: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Information miningHow are structured?

What isthis gen?

Clustering Links

?

What are thesegroups?

Information mining

Cell cycle...

DBs Information

My data...

Page 27: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Information mining applications.

1) use of biological informationas a validation criteria

Information mining of DNA array data.Allows quick assignation of function, biological role and subcellular location to groups of genes.

Used to understand why genes differ in theirexpression between two different conditions

Sources of information: • Free text• Curated terms (ontologies, etc.)

Page 28: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Gene OntologyCONSORTIUM

http://www.geneontology.org

• The objective of GO is to provide controlled vocabulariesfor the description of the molecular function, biologicalprocess and cellular component of gene products.

• These terms are to be used as attributes of gene productsby collaborating databases, facilitating uniform queriesacross them.

• The controlled vocabularies of terms are structured toallow both attribution and querying to be at different levelsof granularity.

Page 29: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

FatiGO: GO-driven data analysisThe aim: to develop a statistical frameworkable to deal with multiple-testing questions

GO: source of information. A reduced number of curated termsThe Gene Ontology Consortium. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29

Page 30: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

How does FatiGO work?Compares two sets of genes (query and reference) Has Ontology information [Process, Function and Component] ondifferent organismsSelect level [2-5]. Important: annotations are upgraded to the levelchosen. This increases the power of the test: there are less terms to be tested and more genes by term.

Cluster GenesQuery

ClusterGenes

Reference

Remove genes

repeated

in Cluster Query

Remove genes repeated

between Clusters

Remove genes

repeated

in Cluster Reference

CleanCluster Query

CleanCluster

Reference

GO – DBSearch GO term atlevel and ontology

selected

DistributionOf GO Terms

In QueryCluster

DistributionOf GO TermsIn Reference

Cluster

p-valuemultiple test

Important: since we are performing as many tests as GO terms, multiple-testing adjustment must be used

Page 31: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Number Genes with GO Term at leveland ontology selected for each Cluster

Unadjusted p-valueStep-down min p adjusted p-value

FDR (indep.) adjusted p-valueFDR (arbitrary depend.) adjusted p-value

TablesGO Term – Genes

Genes of old versions (Unigene)Genes without result

Repeated Genes

GO Tree with diferent levels ofinformation

FatiGO ResultsThe application extracts biologicalrelevant terms (showing a significant differential distribution) for a set of genes

Page 32: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

PTL LBC

Understanding why genes differin their expression between two

different phenotypes

Limphomas from mature lymphocytes (LB) and precursor T-lymphocyte (PTL).

Genes differentially expressed, selected among the ~7000 genes in the CNIO oncochip

Genes differentially expressed among bothgroups were mainly related to immuneresponse (activated in mature lymphocytes)

Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO

Page 33: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Biological processes shown by the genes differentiallyexpressed among PTL-LB

Martinez et al., Human Genetics Laboratory. Molecular Pathology Programme, CNIO

Page 34: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Algorithms are used if they are available in programs.GEPAS, a package for DNA array data analysisArray

Scanning,

Image processing

Preprocessor+ hub

Supervisedclustering

SVM

Unsupervisedclustering

HierarchicalSOMSOTA

SomTree

Datamining

FatiGO

FatiWise

ViewersSOTATreeTreeViewSOMplot

External tools

EP, HAPI

Two-conditionscomparisonGene selection

Two-classesMultiple classesContinuous variableCategorical variablesurvival

NormalizationDNMAD

Predictor

tnasas

In silico CGH

Page 35: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

A

BC D

E

F

G

Page 36: Understanding biological systems by using DNA microarrays ...Understanding biological systems by using DNA microarrays and bioinformatics Joaquín Dopazo Bioinformatics Unit, Centro

Bioinformatics Group, CNIO

From left to right: Lucía Conde, Joaquín Dopazo, Alvaro Mateos, Fátima Al-Shahrour, Víctor Calzado, Hernán Dopazo, Javier Herrero, Javier Santoyo, Ramón Díaz, MichalKarzinstky & Juanma Vaquerizas

http://gepas.bioinfo.cnio.es

http://fatigo.bioinfo.cnio.eshttp://bioinfo.cnio.es