106
Statistical analysis of gene expression data Alex Sánchez Unitat d'Estadística i Bioinformàtica (VHIR) Statistics Department (UB)

Course VHIR-UCTS-UEB - Session 3 - Statistical Analysis

Embed Size (px)

DESCRIPTION

High throughput technologies in Genomics - Tecnologías de alto rendimiento en genómica. Session 3: Statistical Analysis Course held at Vall d'Hebron Research Institute (VHIR), in Barcelona, Catalonia, Spain, on October 5th, 2011.

Citation preview

Page 1: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Statistical analysis of gene expression data

Alex SánchezUnitat d'Estadística i Bioinformàtica (VHIR)

Statistics Department (UB)

Page 2: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Who, where, what?

Page 3: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Outline

• Basic principles of experimental design• Analysis of RT-qPCR data• The microarray data analysis process

Page 4: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Basic principles of Experimental Design

Page 5: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Father of modern Mathematical Statistics and Developer of Experimental Design and ANOVA

Sir Ronald A. Fisher

And Fisher said…

Page 6: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

The three basic principles of Experimental Design

• Apply the following principles to best attain the objectives of experimental design– Replication– Local control or Blocking– Randomization

Page 7: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

1. Replication

• Each treatment must be applied independently to several experimental units.

• Provides the means to estimate the EE variance in the absence of systematic differences among EUs treated alike which is important because treatment differences are judged against the EE variance.

• Provides the capacity to increase the precision for estimates of treatment means.

• By itself, does not guarantee valid estimates of EE or treatment differences.

Page 8: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Biological vs Technical Replicates

@ Nature reviews & G. Churchill (2002)

2Bσ

2Aσ

2eσ

Page 9: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Replication vs Pooling

• mRNA from different samples are often combined to form a ``pooled-sample’’ or pool. Why?– If each sample doesn’t yield enough mRNA– To compensate an excess of variability ?

• Statisticians tend not to like it but pooling may be OK if properly done– Combine several samples in each pool– Use several pools from different samples– Do not use pools when individual information is

important (e.g.paired designs)

Page 10: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Examples of “pooling”

• Study with 12 patients 12 chips Expensive– Optiob 1:

• Group A: 6 individuals 1 pool of 6 1 chip• Group B: 6 individuals 1 pool of 6 1 chip

– Option 2: • Group A: 12 individuals 4 pools of 3 4 chip• Grupo B: 12 individuals 4 pools of 3 4 chip

– Option 2 may be cheaper and, at the samae time have similar precisioHowever, without having information about variability within pools and between individuals it cannot be assured

Page 11: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Local Control

• Group EUs so that the variability of units within the groups is less than that among all units prior to grouping – Differences among treatments are not confused with

differences among experimental units. – EE is reduced by the variability associated with

environmental differences among groups of units.– Effects of nuisance factors which contribute

systematic variation to the differences among EUs can be eliminated.

– Analysis is more sensitive.

Page 12: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Confounding block with treatment effects

Awful design Balanced designSample Treatment Sex Batch Sample Treatment Sex Batch

1 A Male 1 1 A Male 12 A Male 1 2 A Female 23 A Male 1 3 A Male 24 A Male 1 4 A Female 15 B Female 2 5 B Male 16 B Female 2 6 B Female 27 B Female 2 7 B Male 28 B Female 2 8 B Female 1

• Two alternative designs to investigate treatment effects– Left: Treatment effects confounded with Sex and Batch

effect– Right: Treatments are balanced between blocks

• Influence of blocks is automatically compensated• Statistical analysis may separate block from treatment efefect

Page 13: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

3. Randomisation

• Randomly assigning samples to groups to eliminate unspecific disturbances– Randomly assign individuals to treatments.– Randomise order in which experiments are

performed.

• Randomisation required to – ensure validity of statistical procedures.– Lead to unbiased estimates of variances and

unbiased estimates of treatment differences,– Simulates the effects of independence among

EUs that are otherwise controlled, selected, and monitored.

Page 14: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Allocating samples to treatments

• A key point in any experiment is the way that experimental units are allocated to treatments– It must be chosen so that random variability

is as small as possible– It must be chosen so that the best local

control is achieved. – It implicitly defines the analysis model, so it

must be chosen so that the analysis can be performed and validity conditions hold.

Page 15: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Scary stories: batch effects

Page 16: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Efecto Batch en Microarrays

Diferencias/variaciones no biológicas observadas en experimentos de microarrays

Origen:

•Técnico que procesa las muestras

•Amplificación

•Lote del kit de tinción

•Reparto de muestras en las tandas de amplificación

•Kit de amplificación....

No suele invalidar el expeimento aunque si añade una cantidad de ruído no cuantificable

Solemos conocer la fuente pero no siempre se podrá cuantificar y/o eliminar!!!

Page 17: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Técnico que procesa las muestras

Técnico 1: procesa muestras control

Técnico 2: procesa muestras problema

Técnico 1: procesa muestras control y problema

Técnico 2: procesa muestras problema y control

SOLUCION

Técnico 1 y 2 no comparten proyecto

Page 18: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Reparto de muestras en las tandas de amplificación

12 muestras máximo por tanda de amplificación

Proyectos n>12 muestras se han de repartir en diferentes tandas de amplificación

Tanda 1: Controles

Tanda 2: muestras problema

Tanda 1: se procesan muestras control y problema

Tanda 2: se procesan muestras problema y control

SOLUCION

Page 19: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Lote del Kit de tinción

Sondas se marcan con ficoeritrina

Va perdiendo intensidad con el tiempo

Hibridar cada tanda de 12 muestras

Esperar a tener todas las muestras preparadas e hibridarlas todas a la vez

SOLUCION

Page 20: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Eliminación del efecto batch

• Con un diseño experimental apropiado el efecto batch se puede eliminar o atenuar

• de forma implícita balanceando las muestras entre distinos lotes

• de forma explícita estimando los efectos del batch y substrayéndolos de los valores originales.

• Si el diseño no es adecuado, (e.g. hay CONFUSIÓN entre lote y tratamientos) no se podrá hacer nada.

• Incluso con un buen diseño no se puede realizar la eliminación de muchos efectos batch de forma indefinida, porque cada vez se pierde más potencia estadística.

• Es fácil que al final tengamos que aceptar algún efecto batch.

Page 21: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

EJEMPLOS-1 Efecto del kit de marcaje

Page 22: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

EJEMPLOS-2

fileName Camada Grupo ShortName Colores

E39+_-.CEL 1 1 E39pm11 yellow

E39+_+.CEL 1 2 E39pp21 green

E40+_-.CEL 2 1 E40pm12 yellow

E40+_+.CEL 2 2 E40pp22 green

E41+_-.CEL 3 1 E41pm23 yellow

E41+_+.CEL 3 2 E41pp13 green

E42+_-.CEL 4 1 E42pm24 yellow

E42+_+.CEL 4 2 E42pp14 green

Efecto batch de nacimiento

Page 23: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

SIN CORREGIR

Page 24: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

CORREGIDO

Page 25: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis
Page 26: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

In summary

• Good experimental design is essential to perform good experiments.

• Experimental design means planning ahead– Should be done before the experiment starts– Should consider all the steps: from sampling

to data analysis.

• Not a question of "statistical snobism" but of saving time and money and of doing good science

Page 27: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Basic aspects of qPCR data analysis

Page 28: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Outline

• Common types of qPCR data analyses• Biostatistical aspects of relative

quantification• Confirmatory and exploratory statistical

analysis.

Page 29: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Real time qPCR data

• RT-qPCR data are CT or threshold cycle values.

– CT= Cycle number at which detectable signal is achieved.

– The Lower/higher the CT Larger/Smaller amount of starting material

Page 30: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Basic types of RT-qPCR analysis

• Two basic types of analysis– Absolute quantification– Relative quantification

• Choice based on– Experimental goals– Available resources

Page 31: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Absolute quantification

• Use absolute quantification…– To understand properties that are intrinsic

to a given sample.– To answer the question "how many"?

• Examples of applications– Chromosome or gene copy number

determination– Viral load measurements

Page 32: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Standard curve

• Absolute quantification is achieved by comparing CT values of each sample to a standard curve, which is obtained by– Using different known amounts of sample

– For which CT is calculated

– And plotted vs the (log) (known) quantity

Page 33: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Standard Calibration Curve

Page 34: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Example: determining absolute copy number from absolute quantification

• The standard curve is used only for interpolation but not for extrapolation (relation may not be linear outside the limits tested).

Page 35: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Absolute vs Relative quantifications

• Absolute quantification answers the question "how many" but gives no information about change.

• Relative quantification can be used to– Compare levels or changes in gene

expression.– Answer the question – What is the fold

difference?

Page 36: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Relative quantification methods

• For absolute quantification one requires a standard template with several known concentrations to build the curve.

• For relative quantification one needs to apply some form of normalization, that is one has to transform the data in order to– Remove possible experimental biases– Make data from different samples/groups

comparable so that the term "relative" keeps its meaning.

Page 37: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Normalization against a unit mass

Page 38: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Normalization against a reference gene

• Benefit: – Circumvents need for accurate

quantification of starting material

• Drawback: – Requires known reference genes with stable

expression levels

Page 39: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Required CT values

Page 40: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Most common approaches

• Livak or ∆∆CT method

• The ∆CT method against a reference gene

• The Pfaffl method

Page 41: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Livak method (1)

Page 42: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Livak method (2)

Page 43: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Other methods

• Although Livak method is the most used

• The ∆CT method yields equivalent results but is simpler to calculate.

• The Pfaffl method is preferable when reaction efficiencies of the target and reference are not similar.

Page 44: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Biostatistical aspects of relative quantification

Page 45: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Biostatistical analysis

• Two main types of analyses– Comparative analyses,

• Relatively rigorous• Check a predefined hypotheses• Relies on statistical testing

– Expression profiling: Search for trends and patterns in the data• Exploratory, hypothesis generating approach• Less rigorous • Cluster analysis or PCA

Page 46: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Relative quantification

Page 47: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Expression profiling

Page 48: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Three basic premises

• Statistical analyses of RT-qPCR data relies on three assumptions– One gene-at-a-time– We are sampling from two different

(unknown) independent populations– There exist unknown mechanisms that

contribute to variability.

Page 49: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

From assumptions to strategies (1)

• Use random sampling and randomization to obtain independent and representative samples.

Page 50: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

From assumptions to strategies (2)

• Apply experimental design principles to minimize confounding variability

Page 51: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

From assumptions to strategies (3)

• Perform statistical testing• DO NOT FORGET about multiple testing adjustments

Page 52: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Statistical analysis

• Standard statistical approach: Confirmatory study-Reject or accept predefined hypothesis

Page 53: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Comparing two groups…

Page 54: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Comparing more than two groups

Page 55: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Exploratory statistical analysis

• If instead of confirming hypothesis we want to generate them (finding patterns in data)

Page 56: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Multivariate methods for exploratory data analysis

Page 57: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Software for the analysis

• ABI– DataAssist

• Biogazelle– REST

• Bio-Rad– GENEX (Gene expression macro)

• Multid– GenEx

• Bioconductor– HTqPCR

• Integromics– StatMiner

Page 58: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Introduction to microarray data analysis

Page 59: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Esquema de la presentación

Introducción y objetivosAnálisis de datos de microarrays

Tipos de datos y Tipos de estudios. Herramientas. El proceso de análisis. Ejemplos

Críticas, consensos, consejos y “estado del arte” Críticas a los microarrays Consensos y consejos (“dos and don’ts”) MAQC-I, MAQC-II

De los microarrays al diagnóstico ¿Porque está siempre por llegar?

Page 60: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Para aprender más …

http://www.ub.es/stat/docencia/bioinformatica/microarrays/ADM/

Page 61: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Tipos de estudios

Page 62: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(1): Class comparison

Page 63: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(2): Class discovery

Page 64: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(3): Class prediction

Page 65: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Y muchos más …

Time Course Perfiles de expresión a lo largo del tiempo

Pathway Analysis-(Systems Biology) Reconstrucción de redes metabólicas a partir

de datos de expressión

Whole Genome, CGH, Alternative Splicing

Estudios con datos de distintos tipos Fusión o Integración de datos

Page 66: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Herramientas para el análisis

Page 67: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Programas de análisis de datos

Multitud de herramientas Gratuítas / Comerciales [R, BRB, MeV, dChip…] / [Partek, GeneSpring, Ingenuity] Descargables / En-linea [R, BRB, MeV…] / [Gepas,…] Aísladas / Parte de “suites” o de sitios [BRB, dChip] / [MeV (TM4), OntoTools]

A survey of free microarray data analysis tools: http://chagall.med.cornell.edu/I2MT/MA-tools.pdf

Page 68: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Programas de análisis libres

Programa

R/Bioconductor Potente, flexible, actualizado,

Unix/Windows/Mac

Consola, difícil de dominar

BRB tools Basado en Excel,

User-friendly

Si falla, falla.

Difícil de extender

dChip Expresión & SNP’s

User-frinedly

Solo Windows

Pocas opciones

Babelomics Web-based,

Multiples opciones,

Buen material

Web-based

Manejo algo rígido

Page 69: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Babelomics: Viaje al conocimiento

Page 70: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Programas de análisis comercialesPrograma

geneSpring Muy extendido

Gráficos potentes

Extensible (R)

ANOVA limitados

CARO

Partek ANOVA muy potente

Mult. tipos de datos

Visualización 3D

Sólo estadística “clásica”

No extensible. Caro

Ingenuity BD de anotacionesAnálisis de redes y de significación biológica

Centrada mayormente en datos de cáncer.

Caro.

Page 71: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

El proceso de análisis

Page 72: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Análisis de un experimento con microarrays

(1) Imágenes(Datos crudos)

(2) C. de calidad(bajo nivel)

(3) Preprocesado

(4) Exploración de la Matriz de Expresión

(5) Análisis

(6) Significación Biológica

Page 73: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(0) Diseño experimental

• Variabilidad– Sistemática

• Calibrar/Normalizar

– Aleatoria• Diseño Experimental• Inferencia

• Decidir acerca de– Réplicas, – Lotes (“Batch effect”)– Pools …

Awful design :-( Balanced design :-)Sample Treatment Sex Batch Sample Treatment Sex

1 A Male 1 1 A Male2 A Male 1 2 A Female3 A Male 1 3 A Male4 A Male 1 4 A Female5 B Female 2 5 B Male6 B Female 2 6 B Female7 B Female 2 7 B Male8 B Female 2 8 B Female

Page 74: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(1) Obtención de la imagen

•Entra: Microarrays•Salen:

– Imágenes (1/chip) – Ficheros de imagen

• Información para cada sonda individual

•Datos para el análisis de bajo nivel– Control de calidad– Preprocesado– Sumarización

1.cel, 1.chp 2.cel, 2.chp

Page 75: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(2) Control de calidad de bajo nivel

• Entra: – Imágenes (.CEL, ...)

• Proceso– Diagnósticos y

Control de calidad– Análisis basado en

modelos (PLM)

• Salen:– Gráficos– Estadísticos de

control de calidad

1.cel, 1.chp 2.cel, 2.chp

Page 76: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(3) Preprocesado

• Entra:– Fichero de Imágenes

(datos del escaner)

• Proceso– Eliminación de ruido

– Normalización

– Sumarización

– Filtrado

• Sale:– Matriz de expresión

1.cel, 1.chp 2.cel, 2.chp

C01-001.CEL C02-001.CEL C03-001.CEL1415670_at 8.954387 9.088924 8.8338631415671_at 10.700876 10.639307 10.6109531415672_at 10.377266 10.510106 10.4617011415673_at 7.320335 7.252635 7.1123131415674_a_at 8.381129 8.332256 8.3937181415675_at 8.120937 8.082713 8.0515141415676_a_at 10.322229 10.287371 10.2828121415677_at 9.038344 8.979641 8.905711

Page 77: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(4) Exploración

• Entra– Matriz de expresión

• Proceso– PCA, Cluster, MDS– Representaciones en

2D/3D– Agrupaciones

• Sale– Detectado efectos

batch– Verificación calidad

C01-001.CEL C02-001.CEL C03-001.CEL1415670_at 8.954387 9.088924 8.8338631415671_at 10.700876 10.639307 10.6109531415672_at 10.377266 10.510106 10.4617011415673_at 7.320335 7.252635 7.1123131415674_a_at 8.381129 8.332256 8.3937181415675_at 8.120937 8.082713 8.0515141415676_a_at 10.322229 10.287371 10.2828121415677_at 9.038344 8.979641 8.905711

Page 78: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(5) Análisis estadístico (i):Selección de genes diferencialmente expresados

•Entra:– Matriz expresión– Modelo de análisis

•Proceso– t-tests, ANOVA

• Ajustes de p-valores

• Sale– Listas de genes

• Fold change, p.values

– Gráficos– Perfiles de expresión

C01-001.CEL C02-001.CEL C03-001.CEL1415670_at 8.954387 9.088924 8.8338631415671_at 10.700876 10.639307 10.6109531415672_at 10.377266 10.510106 10.4617011415673_at 7.320335 7.252635 7.1123131415674_a_at 8.381129 8.332256 8.3937181415675_at 8.120937 8.082713 8.0515141415676_a_at 10.322229 10.287371 10.2828121415677_at 9.038344 8.979641 8.905711

ProbeSet gene ID logFC t P.Value adj.P.Val B1450826_a_at Saa3 1450826_a_at 4,911 63,544 6,21E-14 2,80E-10 22,2441457644_s_at Cxcl1 1457644_s_at 4,286 53,015 3,52E-13 7,69E-10 20,7911415904_at Lpl 1415904_at -4,132 -50,455 5,66E-13 7,69E-10 20,3731449450_at Ptges 1449450_at 5,164 49,483 6,82E-13 7,69E-10 20,2071419209_at Cxcl1 1419209_at 5,037 47,175 1,08E-12 9,71E-10 19,7941416576_at Socs3 1416576_at 3,372 42,107 3,19E-12 2,08E-09 18,7841450330_at Il10 1450330_at 4,519 42,056 3,23E-12 2,08E-09 18,7731455899_x_at Socs3 1455899_x_at 3,648 40,821 4,29E-12 2,12E-09 18,5021419681_a_at Prok2 1419681_a_at 3,709 40,645 4,48E-12 2,12E-09 18,4631436555_at Slc7a2 1436555_at 3,724 40,081 5,12E-12 2,12E-09 18,335

Page 79: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(5) Análisis estadístico (ii):Construcción & validación de un predictor

• Entra:– Matriz expresión

• Proceso– Selección variables– Ajuste modelo– Validación

• Sale– Modelos predictivos– Medidas de fiabilidad

/reproducibilidad

Page 80: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

(6) Significación biologica

• Entra– Listas de genes

• Proceso– GEA, GSEA, …

• Sale:– Clases GO /

Grupos de GenesPathwaysespecialmente representados

ProbeSet gene ID logFC1450826_a_at Saa3 1450826_a_at 4,9111457644_s_at Cxcl1 1457644_s_at 4,2861415904_at Lpl 1415904_at -4,1321449450_at Ptges 1449450_at 5,1641419209_at Cxcl1 1419209_at 5,0371416576_at Socs3 1416576_at 3,3721450330_at Il10 1450330_at 4,5191455899_x_at Socs3 1455899_x_at 3,6481419681_a_at Prok2 1419681_a_at 3,7091436555_at Slc7a2 1436555_at 3,724

Page 81: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Ejemplo de análisis de datos

Comparación de perfiles de expresión entre tumores BRCA1/BRCA2 y

Construcción de un predictor que permita distinguir entre ambos.

Page 82: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Fuente del ejemplo

Gene Expression Profiles in Hereditary Breast Cancer

•Hedenfalk, I, et. al., NEJM, Vol. 344, No. 8, pp 539-548.

Objetivo: Encontrar un predictor basado en perfiles de expresión para diferenciar tumores asociados a BRCA1 y BRCA2

Page 83: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Esquema del análisis

• Diseño experimental y datos para el análisis

• Preprocesado• Exploración • Selección de genes• Construcción de varios predictores y

selección del más apropiado

Page 84: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Diseño experimental

• RNA extraido de– 7 pacientess. BRCA1– 8 pacients BRCA2– 7 con cancer “esporádico”

• 6512 sondas– 5361 genes

• 3226 retenidos para el análisis

• Diseño de referencia– Cada muestra comparada

contra linea celular no tumorgénica (MCF-104)

Patient

ArrayPID

BRCA1 v BRCA2 v Sporadic

s1321 20 Sporadic

s1996 1 BRCA1

s1822 5 BRCA1

s1714 3 BRCA1

s1224 7 BRCA1

s1252 2 BRCA1

s1510 4 BRCA1

s1900 10 BRCA2

s1787 9 BRCA2

s1721 8 BRCA2

s1486 22 BRCA2

s1572 16 Sporadic

s1324 17 Sporadic

s1649 15 Sporadic

s1320 18 Sporadic

s1542 19 Sporadic

s1281 21 Sporadic

s1905 6 BRCA1

s1816 13 BRCA2

s1616 14 BRCA2

s1063 11 BRCA2

s1936 12 BRCA2

Page 85: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Datos: log ratios

Page 86: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Preprocesado: Filtrado y Normalización

Page 87: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Exploración (1)

Page 88: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Exploración (2)

Page 89: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Análisis (1). Selección de genes (class comparison)

• BRCA1 vs noBRCA1• Usamos un t-test y

un cutoff de 0.0001 – es decir declaramos

diferencialmenete expresados los genes cuyo p-valor sea inferior a 0.0001

• No hacemos ajustes– Mínimo FC– Multiple testing

Page 90: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Resultados (1): Lista de genes

Order FDR Fold-change Unique id Description Clone1 1.66e-05 0.0198 2.24 HV34H7 ESTs 2478182 2.17e-05 0.0198 2.03 UG5G3 minichromosome maintenance deficient (S. cerevisiae) 7 460193 2.3e-05 0.0198 0.31 HV17G6 keratin 8 8977814 3.37e-05 0.0198 1.89 HV18E8 SELENOPHOSPHATE SYNTHETASE ; Human selenium donor protein 8407025 3.63e-05 0.0198 2.21 HV32C7 ESTs 3078436 4.32e-05 0.0198 1.57 UG1F1 very low density lipoprotein receptor 260827 4.5e-05 0.0198 1.67 HV24F5 chromobox homolog 3 (Drosophila HP1 gamma) 5668878 4.92e-05 0.0198 2.02 LO3F1 butyrate response factor 1 (EGF-response factor 1) 3666479 9.43e-05 0.0338 1.85 HV9E3 "tumor protein p53-binding protein, 2" 212198

Parametric p-value

Page 91: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Análisis (2):Construcción de un predictor

• Construímos predictores por 6 métodos distintos.

• Genes candidatos por class-comparison.

• Elegimos el que presente menor tasa de error de predicción (estimada por leave one out)

Page 92: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Resumiendo…

El análisis de microarrays puede visualizarse como un proceso.

Es importante conocer Los métodos apropiados para cada problrma, los parámetros, el significado, las limitaciones de

cada paso.

Una aplicación adecuada del proceso proporciona información relevante como... una lista de genes diferencialmente expresados

(biomarcadores). un modelo con capacidad de predecir (firma)

Page 93: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Limitaciones del método

Críticas, consejos, consensos y “estado del arte”

Page 94: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Limitaciones de los microarrays

Page 95: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

An array of problems?

• Poca reproducibilidad entre estudios– Poca coincidencia entre las listas de genes– No reproducción de las predicciones en

nuevos conjuntos de test

• Falta de estándares• Falta de consenso en los métodos• El paso a la clínica siempre por llegar

• Mediados de la década: ¿Promesa o realidad?

Page 96: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Que no estamos tan mal...

Page 97: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Algunos consensos (Allison 2006)

• Diseño– Biological replication is essential – There is strength in numbers: power & sample size – Pooling biological samples can be useful

• Seleccion de genes diferencialmente expresados– Using FC alone as a differential expression test is not valid – 'Shrinkage' is a good thing – FDR is a good alternative to conventional multiple-testing approaches

• Clasificación y Predicción– Unsupervised classification is overused – Unsupervised classification should be validated using resampling-

– Supervised-classification requires independent cross-validation

Page 98: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

No todos los estudios se hacen bien...

• Dupuy & Simon estudian 90 publicaciones. – Análisis detallado de los métodos usados en 42.

• Ecuentran algunos errores comunes– Objetivos pobremente definidos.– No hay control de la multiplicidad

104 genes 104 tests P(Falso+) muy alta– Ni se informa bien de la fiabilidad de un predictor.– No se utiliza un conjunto de test independiente.– Se abusa por doquier del análisis de clusters.

Page 99: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Aunque es posible hacerlo bien si...

• Se procura... (do’s)– Definir bien objetivos.– Combinar el p-valor y

el FC al seleccionar genes.

– Usar la FDR para el control de multiplicidad.

– Validar un predictor con un conjunto de prueba independiente.

– Contar con un estadístico

• Se evita... (don’t)– Basar la selección tan

sólo en “Fold Change”– Usar p-valores de 0.05– Usar métodos de cluster

si lo que se deseara es clasificar muestras.

– Violar el principio básico de la validación (no debe usarse el cjto de prueba antes de la validación).

... Hasta 40 “do’s” y “don’ts” en la tabla 3 de Dupuy y Simon (JNCI 99 (2): 147-157).

Page 100: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Resumiendo

• Los microarrays tienen algunas limitaciones –razonables e intrínsecas-

• Un adecuado uso de los métodos de análisis puede generar información útil, fiable y reproducible.

• Aún así el paso de la clínica al diagnóstico es más lento de lo que se esperaba.

¿Por qué?

Page 101: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

De la investigación básica a los diagnóstico basados en microarrays

¿Para cuando?

Page 102: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

La idea está clara...

Page 103: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Pero hay muy pocos kits de diagnóstico...

Page 104: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Algunas de las dificultades

• Se precisan estudios muy grandes para establecer la potencia de un (kit) diagnóstico y validarlo en una cohorte independiente y suficientemente amplio.

• Hacen falta estandarizaciones y sistemas de control de calidad validados según criterios de laboratorios clínicos.

• Los tests de perfiles de expresión han de cumplir las normas de la Agencia Médica Europea y/o la FDA.

• Para justificar su desarrollo hay que hacer estudios de coste efectividad que sugieran una clara mejora en el tratamiento al paciente y retorno de inversión y beneficios en el medio/largo plazo.

Page 105: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Estado de los diagnósticos basados en microarrays

Lleno: , Vacío:

Page 106: Course VHIR-UCTS-UEB - Session 3 -  Statistical Analysis

Resumiendo

• Se espera que la creciente calidad y tamaño de los estudios genere nuevos perfiles de expresión transportables al diagnóstico.

• Aspectos como estandarización y automatización (robótica) para minimizar la intervención humana están cada vez mejor.

• Otros como la regulación por parte de las agencias y las políticas de reembolso a los inversores y los laboratorios deben de irse resolviendo.

• No es improbable un futuro en el que el “lab-on-a-chip” forme parte de las herramientas de los clínicos.