KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights

KDD-2001 CupKDD-2001 CupThe Genomics ChallengeThe Genomics Challenge

Christos Hatzis, Silico InsightsChristos Hatzis, Silico InsightsDavid Page, University of WisconsinDavid Page, University of Wisconsin

Co-chairsCo-chairs

August 26, 2001August 26, 2001

Special thanks: DuPont Pharmaceuticals Research Laboratories for providing data set 1, Chris Kostas from Silico Insights for cleaning and organizing data sets 2 and 3

http://www.cs.wisc.edu/~dpage/kddcup2001/

KDD-2001 CupKDD-2001 Cup 2

The Genomics ChallengeThe Genomics Challenge

• High throughput technologies in genomics, High throughput technologies in genomics, proteomics and drug screening are creating proteomics and drug screening are creating large, complex datasetslarge, complex datasets

• Bioinformatics datasets are typically under-Bioinformatics datasets are typically under-determineddetermined– very large number of features (complex domain) – small number of instances (high cost per data point)

• Multi-relational nature of data Multi-relational nature of data – reflect complex interactions between molecules,

pathways and systems– Hierarchical organization of interacting layers

• Current tools and approaches do not Current tools and approaches do not adequately address the Genomics Challenge adequately address the Genomics Challenge


OverviewOverview

• Cup organizationCup organization• Dataset descriptionDataset description

– Thrombin binding– Gene function/localization prediction

• StatisticsStatistics

• Tasks and highlightsTasks and highlights

• Winners talk (3x10 min)Winners talk (3x10 min)


Cup OrganizationCup Organization

• KDD-2001 Cup web siteKDD-2001 Cup web site– Posting of datasets, Q&A, answer keys

• ScheduleSchedule– Training dataset available: May 31– Question period 1: June 1-10– Test set available: July 13– Question period 2: July 13-24– Entries due: July 26– Winners notified: August 1– Results to participants: August 7

• EvaluationEvaluation criteriacriteria– Task 1: weighted accuracy (average of true pos, true neg)– Tasks 2, 3: non-weighted accuracy


Dataset 1: Molecular BioactivityDataset 1: Molecular Bioactivity

Dataset provided by DuPont Pharmaceuticals for Dataset provided by DuPont Pharmaceuticals for the KDD-2001 Cup competitionthe KDD-2001 Cup competition

• Activity of compounds binding to thrombinActivity of compounds binding to thrombin• Library of compounds included:Library of compounds included:

– 1909 known molecules (42 actively binding thrombin)

• 139,351 binary features describe the 3-D 139,351 binary features describe the 3-D structure of each compoundstructure of each compound

• 636 new compounds with unknown capacity to 636 new compounds with unknown capacity to bind thrombinbind thrombin


Dataset 2: Protein Functional Annotation Dataset 2: Protein Functional Annotation

• Yeast Genome datasetYeast Genome dataset– Data on the protein-protein interactions from MIPS database

(Munich Information Centre for Protein Sequences)– Expression profiles: DeRisi et al. (1997) Science 278: 680

• Relational datasetRelational dataset– Gene information– Interaction information

• Predict function,Predict function,

localization of unknownlocalization of unknown

proteinsproteins Known Proteins 52%

Strong Similarity to Known Protein

4%

Weak Similarity to Known Protein

13%Similarity to

Unknown Protein

16%

Questionable ORFs

7%

No Similarity 8%

6449 total proteins


Statistics: I. ParticipationStatistics: I. Participation

• 136 unique groups, 200 total entries by about 300-400 136 unique groups, 200 total entries by about 300-400 participantsparticipants

• Almost 5-fold increase over previous yearsAlmost 5-fold increase over previous years• More than half of the entries from commercial sectorMore than half of the entries from commercial sector

KDD Cup Participation

16 21 2430

136

0

20

40

60

80

100

120

140

160

Cup 97 Cup 98 Cup 99 Cup 2000 Cup 2001

Nu

mb

er o

f P

arti

cip

ant

Gro

up

s

Total by Affiliation(200 submissions)

107

7

66

20

Com

Gov

Univ

Other

Total by Task(200 submissions)

114

41

45

Thrombin

Function

Localization


Statistics: II. Data Mining SoftwareStatistics: II. Data Mining Software

Note: Statistics from 157 responders who provided details on their approach

• Mostly custom software was usedMostly custom software was used• Especially for task 1, where the number of Especially for task 1, where the number of

features was too large for most commercial features was too large for most commercial systemssystems

• Gap points to need for commercial tools that Gap points to need for commercial tools that can cope with bioinformatics datasetscan cope with bioinformatics datasets

Task 1

535

21

Task 2

16

6

9

Task 3

19

6

12

Total

8817

42

Custom

Public Domain

Commercial


Statistics: III. AlgorithmsStatistics: III. Algorithms

• Feature selection used in almost 70% of the entries for Task 1Feature selection used in almost 70% of the entries for Task 1• Ensemble classifiers based on more than one algorithm used extensivelyEnsemble classifiers based on more than one algorithm used extensively• Decision trees among the most commonly used, with Naïve Bayes and k-NNDecision trees among the most commonly used, with Naïve Bayes and k-NN• Cross-validation to deal with small dataset size Cross-validation to deal with small dataset size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Feat

ure

Sele

ctio

n

Feat

ure

Con

stru

ctio

n

Dec

isio

n Tr

ee

Ense

mbl

e C

lass

ifier

Naï

ve B

ayes

k-N

eare

st N

eigh

bor

Boo

stin

g

Neu

ral N

et

Ass

ocia

tion

Rul

es

SVM

Bag

ging

Clu

ster

ing

Stat

isti

cal

Logi

stic

Reg

ress

ion

Bay

esia

n N

et

Gen

etic

Pro

gram

min

g

Dec

isio

n Ta

ble

Line

ar R

egre

ssio

n

OLA

P

ILP

Cro

ss V

alid

atio

n

Fra

ctio

n o

f Entr

ies

by T

ask

Task 1

Task 2

Task 3


Task 1 HighlightsTask 1 Highlights

• Test set was challenging second round of Test set was challenging second round of compounds made by chemists -- change in compounds made by chemists -- change in distribution.distribution.

• Far more features than data points; can’t run Far more features than data points; can’t run most commercial systems even with 1G RAM.most commercial systems even with 1G RAM.

• Varying degrees of correlation among Varying degrees of correlation among features.features.

• Better than 60% weighted accuracy is Better than 60% weighted accuracy is impressive.impressive.

• Pure binary prediction task, yet the winner is a Pure binary prediction task, yet the winner is a Bayes net learning system (after feature Bayes net learning system (after feature selection).selection).


Tasks 2 & 3: Relational PredictionTasks 2 & 3: Relational Prediction

ATTGCCATT--ATGGCCATT--ATC-CAATTTTATCTTC-TT--ACTGACC----AT*GCCATTTT

Gene Sequence

Structural Motifs

Chromosomal Location

Gene/Protein Level Interactions

Gene Expression

Clu

ster

D

Clu

ster

B

Clu

ster

E

Clu

ster

C

Clu

ster

A

Expression Clusters

-0.31 -0.12 0.32 0.30 -0.76 Cluster 20-0.50 -0.30 0.47 0.46 -0.65 Cluster 120.03 -0.04 0.05 0.06 -0.22 Cluster 13

-0.76 -0.65 0.73 0.72 -0.34 Cluster 9-0.22 -0.35 0.30 0.31 -0.04 Cluster 8-0.39 -0.56 0.47 0.48 0.14 Cluster 10-0.48 -0.64 0.53 0.55 0.22 Cluster 4-0.57 -0.59 0.51 0.52 0.29 Cluster 32-0.53 -0.65 0.52 0.53 0.41 Cluster 29-0.41 -0.58 0.46 0.48 0.27 Cluster 22-0.23 -0.38 0.28 0.29 0.27 Cluster 21-0.38 -0.57 0.40 0.41 0.53 Cluster 1-0.12 -0.32 0.20 0.22 0.25 Cluster 70.15 0.02 -0.14 -0.13 0.42 Cluster 270.23 0.02 -0.19 -0.18 0.57 Cluster 60.20 0.15 -0.25 -0.24 0.46 Cluster 300.21 0.18 -0.28 -0.28 0.51 Cluster 30.01 -0.01 -0.09 -0.08 0.48 Cluster 24

-0.21 -0.29 0.17 0.18 0.47 Cluster 23-0.05 -0.19 0.01 0.02 0.72 Cluster 34-0.07 -0.12 0.00 0.01 0.55 Cluster 2-0.06 -0.25 0.09 0.11 0.50 Cluster 33-0.11 -0.31 0.10 0.12 0.71 Cluster 260.24 0.27 -0.32 -0.32 0.39 Cluster 50.62 0.54 -0.66 -0.65 0.57 Cluster 310.38 0.25 -0.32 -0.32 0.21 Cluster 280.47 0.55 -0.55 -0.55 0.18 Cluster 150.28 0.30 -0.30 -0.30 0.11 Cluster 110.68 0.71 -0.70 -0.71 0.08 Cluster 250.56 0.65 -0.63 -0.64 0.13 Cluster 160.39 0.50 -0.35 -0.36 -0.53 Cluster 190.25 0.21 -0.19 -0.18 -0.20 Cluster 170.41 0.46 -0.37 -0.38 -0.35 Cluster 140.64 0.75 -0.65 -0.66 -0.26 Cluster 180.16 0.40 -0.20 -0.22 -0.60 Cluster 35

Proteomic Clusters

Protein Interactions

FUNCTIONLOCATION



• Average of about 3 functions per protein.Average of about 3 functions per protein.• Multi-relationalMulti-relational, as are many real-world , as are many real-world

databases.databases.• Yet top-scoring approaches were Yet top-scoring approaches were notnot pure pure

relational learners.relational learners.• But top-scoring approaches But top-scoring approaches diddid account for account for

multi-relational structure of the data.multi-relational structure of the data.– Krogel: novel form of feature construction to capture

relational information in a feature vector.– Sese, Hayashi, and Morishita: instance-based

learning, but using the interactions relation as part of the distance function.



• Similar to task 3, but only one localization per Similar to task 3, but only one localization per protein.protein.

• Similar lessons.Similar lessons.• High overlap in top scorers for both tasks.High overlap in top scorers for both tasks.• Question: did anyone “bootstrap” by using Question: did anyone “bootstrap” by using

their predictions for function to help predict their predictions for function to help predict localization, or vice-versa?localization, or vice-versa?


KDD-2001 Cup WinnersKDD-2001 Cup Winners

• Task 1: Task 1: Jie Cheng, CIBCJie Cheng, CIBC

• Task 2: Task 2: Mark-A. Krogel, Magdeburg Univ.Mark-A. Krogel, Magdeburg Univ.

• Task 3: Task 3: Hisashi Hayashi, Jun Sese, and Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Shinichi Morishita, Univ. of

TokyoTokyo


Task 1 WinnerTask 1 Winner

KDD Cup 2001 ResultsTask 1: Thrombin

Name: J ie ChengRank: 1Weighted Accuracy: 68.4435Accuracy: 71.1356

Positive NegativePositive 95 55 150Negative 128 356 484

223 411 634

True Positive Rate: 63.3%True Negative Rate: 73.6%

Actual

Predicted

Distribution of Prediction Accuracy Scores for Task 1: Thrombin Activity

68.444

1.000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

30 40 50 60 70 80 90 100

Score

Cum

ula

tive F

requency



KDD Cup 2001 ResultsTask 2: Function

Name: Mark-A. KrogelRank: 1Accuracy: 93.6258Weighted Accuracy: 84.8290

Positive NegativePositive 690 282 972Negative 58 4304 4362

748 4586 5334

True Positive Rate: 71.0%True Negative Rate: 98.7%

Predicted

Actual

Distribution of Prediction Accuracy Scores for Task 2: Function Prediction

93.626

1.000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

60 65 70 75 80 85 90 95 100

Score

Cum

ula

tive F

requency



KDD Cup 2001 ResultsTask 3: Localization

Name: Hisashi Hayashi, Jun Sese, and Shinichi MorishitaRank: 1Accuracy: 72.1785

Distribution of Prediction Accuracy Scores for Task 3: Localization Prediction

72.179

1.000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100

Score

Cum

ula

tive

Fre

quency


KDD-2001 Honorable MentionsKDD-2001 Honorable Mentions

Task 1: Task 1: Silander, Univ. of HelsinkiSilander, Univ. of Helsinki

Task 2: Task 2: Lambert, Golden Helix;Lambert, Golden Helix; Sese & Hayashi & Morishita;Sese & Hayashi & Morishita; Vogel & Srinivasan, A.I. InsightVogel & Srinivasan, A.I. Insight

Task 3: Task 3: Schonlau & DuMouchel & Volinsky Schonlau & DuMouchel & Volinsky & &

Cortes, RAND and AT&T Labs;Cortes, RAND and AT&T Labs; Frasca & Zheng & Parekh & Kohavi,Frasca & Zheng & Parekh & Kohavi, Blue Martini Blue Martini


KDD-2001 Cup WinnersKDD-2001 Cup Winners

• Task 1: Task 1: JieJie Cheng Cheng, CIBC, CIBC• Task 2: Task 2: Mark-A. Mark-A. KrogelKrogel, Magdeburg Univ., Magdeburg Univ.• Task 3: Task 3: Hisashi Hisashi HayashiHayashi, Jun Sese, and , Jun Sese, and

Shinichi Morishita, Univ. of Shinichi Morishita, Univ. of TokyoTokyo

Documents

KDD-2001 Cup The Genomics Challenge Christos Hatzis, Silico Insights