Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Data mining in genetics
Junior Barrera
BIOINFO-USP
Layout
• Introduction
• Data mining
• Mapping of rare genes
• Expression analysis: measure, genes differentially
expressed; clustering expression signals; identification of
gene regulation networks
Introduction
Knowledge evolution in genetics
• Heredity - Mendel (1866)
• The phenotypes of an individual depends on
genes of his parents.
Knowledge evolution in genetics
• Chromosome Theory - Morgan (1910)
• Genes were situated in chromosomes
Knowledge evolution in genetics
• The molecular structure of chromosomes
(Watson and Crick - 1953)
• DNA structure: the double helix
• Four basis: adenine(A), guanine(G),
thymine(T), cytosine(C)
• genes are sequences of nucleotides
Knowledge evolution in genetics
Knowledge evolution in genetics
• DNA manipulation
• cut, replication and decoding
Knowledge evolution in genetics
• Genetic engineering
• species modification, drug production
Knowledge evolution in genetics
• Genes control the metabolism
• Metabolism occurs by sequences of
enzyme-catalyzed reactions.
• Enzymes are specified by one or more
genes
Knowledge evolution in genetics
Knowledge evolution in genetics
• Gene expression
Data acquisition
Data acquisition
Data acquisition
Quantization - {-1,0,1}
Data mining
P1
P2
Pn
Pi : analytical and mining procedures (kernel parallel)
Objected oriented database
images Express
Table
Signal clusters
Dynamical
System
Sequence module
microarray module
Genbank
database
P1
P2
P3
Pn
P1
P2
P3
Pnquery
query
operational
operational
analitical
analitical
Access control module
Integrated Environment
query
Another
databases
Clinical data
Data Warehouse
M_database
Analysis tools writer SpreadSheet Dataminig
MOLAP/ROLAP server
IMS OthersRDMS
Wrapper Wrapper Wrapper
Integrator
Metadata Data
Mart
Users
1
2
3
4
System Architecture
DB/2DB/2
SybaseSybase
OracleOracle
InformixInformix
SliceSlice NNSliceSlice 11
Appl. n
Appl. 3
Appl. 2
Appl. 1
JavaJava
PBPB
DelphiDelphi
MATLAMATLA
N-2 slices
Consulte rules and libraries
Proteome
Transcriptome
Genome
Pathways
Σ
Wet LabWhat genes regulate the
pathway A->B->C->D ?
Knowledge
Medline
History of
discovering
Pi,Ni
P1
P2
Pi : analytical and mining procedures (kernel parallel)
Objected oriented database
Pn
N1
N2
Nn
Ni : knowledge discovering procedures (kernel parallel)
molecule
cell
tissue
organism
Protein structure
and dynamics
DNA, Protein,
Gene Expression,
Gene Networks
Population
E4000
E3500 V880
GRID Computer - DCC-IME-USP
Internet
and
Intranet
...
E4000:Databases and web application services
Processing services
(by SEFAZ)
10 processors
16 G-bytes main memory
1 T-bytes disk
(by CAGE-FAPESP)
4 processors
8 gigabytes
(by Malaria-FAPESP)
16 processors
32 G-bytes
V880?
(by eucalipto-CNPq ???)
16 processors
32 G-bytes
Mapping of rare genes
ACGAATCTAGAGAATTAATTAACCGAGTTAAGA
ACGAATCTAGAGAATTAATTAACCGAGTTAAGA
Exon
Training from known genes
Expression Analysis
Analysis Phases
MeasureClustering
G. Dif. Exp
Clustering
ClusteringG. R. Net.
Image Analysis
Selection of clones
1. Clusters of the same gene
5’ 3’Gene chosen
2. Choice of a representation for the gene
5’ 3’
Cuidado com sítios de Poliadenilação!!
Captura das Imagens Cy5 Cy3
Hibridization
cDNAs Misturas de cDNAsmarcados CY3 e CY5
Hirata R, Barrera J, Hashimoto R, Dantas D, Esteves G. In press, 2002.
Expression Calculus
Ratio
signal + noise
Forground
Background
Dispersion of cy3 and cy5 give an idea of the spot quality
Spot bom
Spot ruim
o Pixels do backgroundo Pixels do foreground
ScanAlyzeScanAlyze
Totalmente baseado em ROIs.
Amostragens de pixels simples
QuantArray®
SpotSpotSoftware todo automatizado
Jain, et al. Genome Research, 12: 325-332, 2002
Mas nem sempre funciona…
Então, correção manual!!!
Os pixels são escolhidos combase no histograma.
BIOINFO BIOINFO -- USPUSP
Example - Segmentation
Exemplo - Segmentação
cDNAs used
� Q gene - 637pb, 52,3% de C e G
� trpC - 338pb, 45,3% de C e G
� lysA - 303pb, 47,2% de C e G
� ST0280 (ORESTES) - 659pb, 34,6% de C e G
� IL6 - 948pb, 37,7% de C e G
� IRF1 - 2069pb, 52,5% de C e G
Dilution of fixed material
Each cDNA will be fixed in the dilutions 1/1, 1/2, 1/4, 1/8, 1/16
Legenda
B1
Diluição 1
Diluição 2
Diluição 3
Diluição 4
Diluição 5
B2
Legenda
Diluição 1
Diluição 2
Diluição 3
Diluição 4
Diluição 5
B3
Legenda
Diluição 1
Diluição 2
Diluição 3
Diluição 4
Diluição 5
B4
Legenda
Diluição 1
Diluição 2
Diluição 3
Diluição 4
Diluição 5
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B1 B2
B3 B4
B4B3
B2B1
For a good signal:
• the linear regression is a good estimator for
ratio (background estimation is avoided)
• Swap permits to normalize cy3 and cy5
• Confidence intervals increase inversely with
the signal intensity
Genes differentially expressed
Experiment
• Choose a population of men and measure
their physical characteristics
• Ask men to move a 150 kg object
• Separate men that succeed and the ones that
do not
• Find common characteristics between men
that succeed and between men that do not
Genes differentially expressed
• Choose a set of genes
• Measure the expression of these genes on
two different Biological states
• Choose subsets of genes that are enough to
characterize each Biological state
−3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1−1.5
−1
−0.5
0
0.5
1
1.5
2
phosphofructokinase, platelet
v−
erb
−b2 a
via
n e
ryth
robla
stic leukem
ia v
iral oncogene h
om
olo
g 2
LINEAR CLASSIFIER (DISPERSED−GAUSSIAN) w/ σ = 0.600
1−th (0.389593)
7−th (0.416607)
BRCA1 BRCA2/sporadic
the tumor sample from Patient 20
Clustering
1 2 3 4 5 6 7-6
-4
-2
0
2
4
6
timelo
g2(r
atio)
time course data
C1
C2
C3
-40 -20 0 20 40 60 80 100 120 140-20
-10
0
10
20
30
40
50
60KL plot
C1
C2
C3
Dendrogram
Time course data
KL plotmultidimensional space
different views of different views of different views of
Microarray dataMicroarray dataMicroarray datamis-classified
Original clusters
Clustered by dendrogram
Example
No error!
Tighter clusters due
to small variance
Results from Fuzzy c-means
Gene Regulation Networks
Cellpeptide
other signals
mRNAGENES
NETWORK
Translat.Transcr.Proteins
Pool Pathways
Metabolic
Division Steps
Cell Cycle Modeling
u1
I
u2
u5
v
vfp
w1
w2
w1fp
w2fp
w1f
w2f
x1
x6
y1
y2
z
x1f
x3f
x4f
x6f
x6fp
x1fp
y1fp
y2fp
.
.
.
.
.
.
z
Forward SignalFeedback to pFeedback to previous layer
y1f
y1f
y2f
y2f
p
Modeling Dynamical Systems
Simulator Architecture
Knockout
u1
I
u2
u5
v
vfp
w1
w2
w1fp
w2fp
w1f
w2f
x1
x6
y1
y2
z
x1f
x3f
x4f
x6f
x6fp
x1fp
y1fp
y2fp
.
.
.
.
.
.
z
Forward SignalFeedback to pFeedback to previous layer
y1f
y1f
y2f
y2f
p
Division Steps
Challanges
• Architecture identification
• Dynamics transition function identification