Upload
chibale
View
23
Download
0
Embed Size (px)
DESCRIPTION
KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights David Page, University of Wisconsin Co-chairs August 26, 2001 - PowerPoint PPT Presentation
Citation preview
KDD-2001 CupKDD-2001 CupThe Genomics ChallengeThe Genomics Challenge
Christos Hatzis, Silico InsightsChristos Hatzis, Silico InsightsDavid Page, University of WisconsinDavid Page, University of Wisconsin
Co-chairsCo-chairs
August 26, 2001August 26, 2001
Special thanks: DuPont Pharmaceuticals Research Laboratories for providing data set 1, Chris Kostas from Silico Insights for cleaning and organizing data sets 2 and 3
http://www.cs.wisc.edu/~dpage/kddcup2001/
KDD-2001 CupKDD-2001 Cup 2
The Genomics ChallengeThe Genomics Challenge
• High throughput technologies in genomics, High throughput technologies in genomics, proteomics and drug screening are creating proteomics and drug screening are creating large, complex datasetslarge, complex datasets
• Bioinformatics datasets are typically under-Bioinformatics datasets are typically under-determineddetermined– very large number of features (complex domain) – small number of instances (high cost per data point)
• Multi-relational nature of data Multi-relational nature of data – reflect complex interactions between molecules,
pathways and systems– Hierarchical organization of interacting layers
• Current tools and approaches do not Current tools and approaches do not adequately address the Genomics Challenge adequately address the Genomics Challenge
KDD-2001 CupKDD-2001 Cup 3
OverviewOverview
• Cup organizationCup organization• Dataset descriptionDataset description
– Thrombin binding– Gene function/localization prediction
• StatisticsStatistics
• Tasks and highlightsTasks and highlights
• Winners talk (3x10 min)Winners talk (3x10 min)
KDD-2001 CupKDD-2001 Cup 4
Cup OrganizationCup Organization
• KDD-2001 Cup web siteKDD-2001 Cup web site– Posting of datasets, Q&A, answer keys
• ScheduleSchedule– Training dataset available: May 31– Question period 1: June 1-10– Test set available: July 13– Question period 2: July 13-24– Entries due: July 26– Winners notified: August 1– Results to participants: August 7
• EvaluationEvaluation criteriacriteria– Task 1: weighted accuracy (average of true pos, true neg)– Tasks 2, 3: non-weighted accuracy
KDD-2001 CupKDD-2001 Cup 5
Dataset 1: Molecular BioactivityDataset 1: Molecular Bioactivity
Dataset provided by DuPont Pharmaceuticals for Dataset provided by DuPont Pharmaceuticals for the KDD-2001 Cup competitionthe KDD-2001 Cup competition
• Activity of compounds binding to thrombinActivity of compounds binding to thrombin• Library of compounds included:Library of compounds included:
– 1909 known molecules (42 actively binding thrombin)
• 139,351 binary features describe the 3-D 139,351 binary features describe the 3-D structure of each compoundstructure of each compound
• 636 new compounds with unknown capacity to 636 new compounds with unknown capacity to bind thrombinbind thrombin
KDD-2001 CupKDD-2001 Cup 6
Dataset 2: Protein Functional Annotation Dataset 2: Protein Functional Annotation
• Yeast Genome datasetYeast Genome dataset– Data on the protein-protein interactions from MIPS database
(Munich Information Centre for Protein Sequences)– Expression profiles: DeRisi et al. (1997) Science 278: 680
• Relational datasetRelational dataset– Gene information– Interaction information
• Predict function,Predict function,
localization of unknownlocalization of unknown
proteinsproteins Known Proteins 52%
Strong Similarity to Known Protein
4%
Weak Similarity to Known Protein
13%Similarity to
Unknown Protein
16%
Questionable ORFs
7%
No Similarity 8%
6449 total proteins
KDD-2001 CupKDD-2001 Cup 7
Statistics: I. ParticipationStatistics: I. Participation
• 136 unique groups, 200 total entries by about 300-400 136 unique groups, 200 total entries by about 300-400 participantsparticipants
• Almost 5-fold increase over previous yearsAlmost 5-fold increase over previous years• More than half of the entries from commercial sectorMore than half of the entries from commercial sector
KDD Cup Participation
16 21 2430
136
0
20
40
60
80
100
120
140
160
Cup 97 Cup 98 Cup 99 Cup 2000 Cup 2001
Nu
mb
er o
f P
arti
cip
ant
Gro
up
s
Total by Affiliation(200 submissions)
107
7
66
20
Com
Gov
Univ
Other
Total by Task(200 submissions)
114
41
45
Thrombin
Function
Localization
KDD-2001 CupKDD-2001 Cup 8
Statistics: II. Data Mining SoftwareStatistics: II. Data Mining Software
Note: Statistics from 157 responders who provided details on their approach
• Mostly custom software was usedMostly custom software was used• Especially for task 1, where the number of Especially for task 1, where the number of
features was too large for most commercial features was too large for most commercial systemssystems
• Gap points to need for commercial tools that Gap points to need for commercial tools that can cope with bioinformatics datasetscan cope with bioinformatics datasets
Task 1
535
21
Task 2
16
6
9
Task 3
19
6
12
Total
8817
42
Custom
Public Domain
Commercial
KDD-2001 CupKDD-2001 Cup 9
Statistics: III. AlgorithmsStatistics: III. Algorithms
• Feature selection used in almost 70% of the entries for Task 1Feature selection used in almost 70% of the entries for Task 1• Ensemble classifiers based on more than one algorithm used extensivelyEnsemble classifiers based on more than one algorithm used extensively• Decision trees among the most commonly used, with Naïve Bayes and k-NNDecision trees among the most commonly used, with Naïve Bayes and k-NN• Cross-validation to deal with small dataset size Cross-validation to deal with small dataset size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Feat
ure
Sele
ctio
n
Feat
ure
Con
stru
ctio
n
Dec
isio
n Tr
ee
Ense
mbl
e C
lass
ifier
Naï
ve B
ayes
k-N
eare
st N
eigh
bor
Boo
stin
g
Neu
ral N
et
Ass
ocia
tion
Rul
es
SVM
Bag
ging
Clu
ster
ing
Stat
isti
cal
Logi
stic
Reg
ress
ion
Bay
esia
n N
et
Gen
etic
Pro
gram
min
g
Dec
isio
n Ta
ble
Line
ar R
egre
ssio
n
OLA
P
ILP
Cro
ss V
alid
atio
n
Fra
ctio
n o
f Entr
ies
by T
ask
Task 1
Task 2
Task 3
KDD-2001 CupKDD-2001 Cup 10
Task 1 HighlightsTask 1 Highlights
• Test set was challenging second round of Test set was challenging second round of compounds made by chemists -- change in compounds made by chemists -- change in distribution.distribution.
• Far more features than data points; can’t run Far more features than data points; can’t run most commercial systems even with 1G RAM.most commercial systems even with 1G RAM.
• Varying degrees of correlation among Varying degrees of correlation among features.features.
• Better than 60% weighted accuracy is Better than 60% weighted accuracy is impressive.impressive.
• Pure binary prediction task, yet the winner is a Pure binary prediction task, yet the winner is a Bayes net learning system (after feature Bayes net learning system (after feature selection).selection).
KDD-2001 CupKDD-2001 Cup 11
Tasks 2 & 3: Relational PredictionTasks 2 & 3: Relational Prediction
ATTGCCATT--ATGGCCATT--ATC-CAATTTTATCTTC-TT--ACTGACC----AT*GCCATTTT
Gene Sequence
Structural Motifs
Chromosomal Location
Gene/Protein Level Interactions
Gene Expression
Clu
ster
D
Clu
ster
B
Clu
ster
E
Clu
ster
C
Clu
ster
A
Expression Clusters
-0.31 -0.12 0.32 0.30 -0.76 Cluster 20-0.50 -0.30 0.47 0.46 -0.65 Cluster 120.03 -0.04 0.05 0.06 -0.22 Cluster 13
-0.76 -0.65 0.73 0.72 -0.34 Cluster 9-0.22 -0.35 0.30 0.31 -0.04 Cluster 8-0.39 -0.56 0.47 0.48 0.14 Cluster 10-0.48 -0.64 0.53 0.55 0.22 Cluster 4-0.57 -0.59 0.51 0.52 0.29 Cluster 32-0.53 -0.65 0.52 0.53 0.41 Cluster 29-0.41 -0.58 0.46 0.48 0.27 Cluster 22-0.23 -0.38 0.28 0.29 0.27 Cluster 21-0.38 -0.57 0.40 0.41 0.53 Cluster 1-0.12 -0.32 0.20 0.22 0.25 Cluster 70.15 0.02 -0.14 -0.13 0.42 Cluster 270.23 0.02 -0.19 -0.18 0.57 Cluster 60.20 0.15 -0.25 -0.24 0.46 Cluster 300.21 0.18 -0.28 -0.28 0.51 Cluster 30.01 -0.01 -0.09 -0.08 0.48 Cluster 24
-0.21 -0.29 0.17 0.18 0.47 Cluster 23-0.05 -0.19 0.01 0.02 0.72 Cluster 34-0.07 -0.12 0.00 0.01 0.55 Cluster 2-0.06 -0.25 0.09 0.11 0.50 Cluster 33-0.11 -0.31 0.10 0.12 0.71 Cluster 260.24 0.27 -0.32 -0.32 0.39 Cluster 50.62 0.54 -0.66 -0.65 0.57 Cluster 310.38 0.25 -0.32 -0.32 0.21 Cluster 280.47 0.55 -0.55 -0.55 0.18 Cluster 150.28 0.30 -0.30 -0.30 0.11 Cluster 110.68 0.71 -0.70 -0.71 0.08 Cluster 250.56 0.65 -0.63 -0.64 0.13 Cluster 160.39 0.50 -0.35 -0.36 -0.53 Cluster 190.25 0.21 -0.19 -0.18 -0.20 Cluster 170.41 0.46 -0.37 -0.38 -0.35 Cluster 140.64 0.75 -0.65 -0.66 -0.26 Cluster 180.16 0.40 -0.20 -0.22 -0.60 Cluster 35
Proteomic Clusters
Protein Interactions
FUNCTIONLOCATION
KDD-2001 CupKDD-2001 Cup 12
Task 2 HighlightsTask 2 Highlights
• Average of about 3 functions per protein.Average of about 3 functions per protein.• Multi-relationalMulti-relational, as are many real-world , as are many real-world
databases.databases.• Yet top-scoring approaches were Yet top-scoring approaches were notnot pure pure
relational learners.relational learners.• But top-scoring approaches But top-scoring approaches diddid account for account for
multi-relational structure of the data.multi-relational structure of the data.– Krogel: novel form of feature construction to capture
relational information in a feature vector.– Sese, Hayashi, and Morishita: instance-based
learning, but using the interactions relation as part of the distance function.
KDD-2001 CupKDD-2001 Cup 13
Task 3 HighlightsTask 3 Highlights
• Similar to task 3, but only one localization per Similar to task 3, but only one localization per protein.protein.
• Similar lessons.Similar lessons.• High overlap in top scorers for both tasks.High overlap in top scorers for both tasks.• Question: did anyone “bootstrap” by using Question: did anyone “bootstrap” by using
their predictions for function to help predict their predictions for function to help predict localization, or vice-versa?localization, or vice-versa?
KDD-2001 CupKDD-2001 Cup 14
KDD-2001 Cup WinnersKDD-2001 Cup Winners
• Task 1: Task 1: Jie Cheng, CIBCJie Cheng, CIBC
• Task 2: Task 2: Mark-A. Krogel, Magdeburg Univ.Mark-A. Krogel, Magdeburg Univ.
• Task 3: Task 3: Hisashi Hayashi, Jun Sese, and Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Shinichi Morishita, Univ. of
TokyoTokyo
KDD-2001 CupKDD-2001 Cup 15
Task 1 WinnerTask 1 Winner
KDD Cup 2001 ResultsTask 1: Thrombin
Name: J ie ChengRank: 1Weighted Accuracy: 68.4435Accuracy: 71.1356
Positive NegativePositive 95 55 150Negative 128 356 484
223 411 634
True Positive Rate: 63.3%True Negative Rate: 73.6%
Actual
Predicted
Distribution of Prediction Accuracy Scores for Task 1: Thrombin Activity
68.444
1.000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
30 40 50 60 70 80 90 100
Score
Cum
ula
tive F
requency
KDD-2001 CupKDD-2001 Cup 16
Task 2 WinnerTask 2 Winner
KDD Cup 2001 ResultsTask 2: Function
Name: Mark-A. KrogelRank: 1Accuracy: 93.6258Weighted Accuracy: 84.8290
Positive NegativePositive 690 282 972Negative 58 4304 4362
748 4586 5334
True Positive Rate: 71.0%True Negative Rate: 98.7%
Predicted
Actual
Distribution of Prediction Accuracy Scores for Task 2: Function Prediction
93.626
1.000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
60 65 70 75 80 85 90 95 100
Score
Cum
ula
tive F
requency
KDD-2001 CupKDD-2001 Cup 17
Task 3 WinnerTask 3 Winner
KDD Cup 2001 ResultsTask 3: Localization
Name: Hisashi Hayashi, Jun Sese, and Shinichi MorishitaRank: 1Accuracy: 72.1785
Distribution of Prediction Accuracy Scores for Task 3: Localization Prediction
72.179
1.000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 20 40 60 80 100
Score
Cum
ula
tive
Fre
quency
KDD-2001 CupKDD-2001 Cup 18
KDD-2001 Honorable MentionsKDD-2001 Honorable Mentions
Task 1: Task 1: Silander, Univ. of HelsinkiSilander, Univ. of Helsinki
Task 2: Task 2: Lambert, Golden Helix;Lambert, Golden Helix; Sese & Hayashi & Morishita;Sese & Hayashi & Morishita; Vogel & Srinivasan, A.I. InsightVogel & Srinivasan, A.I. Insight
Task 3: Task 3: Schonlau & DuMouchel & Volinsky Schonlau & DuMouchel & Volinsky & &
Cortes, RAND and AT&T Labs;Cortes, RAND and AT&T Labs; Frasca & Zheng & Parekh & Kohavi,Frasca & Zheng & Parekh & Kohavi, Blue Martini Blue Martini
KDD-2001 CupKDD-2001 Cup 19
KDD-2001 Cup WinnersKDD-2001 Cup Winners
• Task 1: Task 1: JieJie Cheng Cheng, CIBC, CIBC• Task 2: Task 2: Mark-A. Mark-A. KrogelKrogel, Magdeburg Univ., Magdeburg Univ.• Task 3: Task 3: Hisashi Hisashi HayashiHayashi, Jun Sese, and , Jun Sese, and
Shinichi Morishita, Univ. of Shinichi Morishita, Univ. of TokyoTokyo