ICDM 2003 Review Data Analysis - with comparison between 02 and 03 - Xindong Wu and Alex Tuzhilin...

Preview:

Citation preview

ICDM 2003 Review Data Analysis

- with comparison between 02 and 03 -

Xindong Wu and Alex Tuzhilin

Analyzed by Shusaku Tsumoto

Basic Statistics (Country)

• 37 countries, 486 Submissions

• Regular Papers: 58 (12%)

• Short Papers: 67 (14%)

• High Acceptance Ratio (Regular)– Israel: 4/11 (37%)– Hong Kong: 3/11 (33%)

Country Total Regular ShortAcceptance

Ratio

USA 189 35 28 33%

China 45 2 0 4%

Australia 29 3 5 28%

Canada 28 0 6 21%

Germany 19 2 4 32%

Japan 19 4 3 37%

France 18 1 2 17%

Taiwan 16 0 3 19%

Brazil 15 0 0 0%

Hong Kong 12 4 2 50%

UK 12 1 2 25%

Israel 11 4 2 55%

Italy 8 1 1 25%

Finland 7 1 1 29%

India 7 0 1 14%

Korea 6 0 1 17%

Top 15 441 58 61 27%

Total 486 58 67 26%

Comparison with 2002 (Top 5)

CountryAccepted

Ratio (2002)Country

AcceptanceRatio (2003)

Hong Kong 64.7% Israel 55.0%USA 47.9% Hong Kong 50.0%Canada 45.5% Japan 37.0%Finland 33.3% USA 33.0%France 33.3% Germany 32.0%

Basic Statistics (Topics)• Top 5 of Submissions:

– Mining text and semi-structured data, and mining temporal, spatial and multimedia data

– Data mining and machine learning algorithms and methods in traditional areas and in new areas

– Data mining applications in electronic commerce, bioinformatics, computer security, Web intelligence

– Soft computing and uncertainty management– Data pre-processing, data reduction, feature selection

and feature transformation

• High Acceptance Ratio (Regular)– Statistics and probability in large-scale data mining– Security, privacy and social impact of data mining

  Total Regular Short Acceptance Ratio

Mining text and semi-structured data, and mining temporal, spatial and multimedia data

81 10 12 27%

Data mining and machine learning algorithms and methods in traditional areas (such as classification, regression, clustering, probabilistic modeling, and association analysis), and in new areas

77 11 8 25%

Data mining applications in electronic commerce, bioinformatics, computer security, Web intelligence, intelligent learning database system

61 5 6 18%

Soft computing (including neural networks, fuzzy logic, evolutionary computation, and rough sets) and uncertainty management for data mining

46 2 9 24%

Data pre-processing, data reduction, feature selection and feature transformation 41 3 5 20%

Complexity, efficiency, and scalability issues in data mining 30 4 4 27%

Others 21 1 4 24%

Foundations of data mining 18 2 1 17%

Data and knowledge representation for data mining 16 3 1 25%

Human-machine interaction and visualization in data mining, and visual data mining

16 3 3 38%

Quality assessment and interestingness metrics of data mining results 16 2 3 31%

Statistics and probability in large-scale data mining 15 6 1 47%

High performance and distributed data mining 12 1 2 25%

Post-processing of data mining results 11 1 3 36%

Pattern recognition and scientific discovery 8 1 0 13%

Security, privacy and social impact of data mining 7 2 2 57%

Integration of data warehousing, OLAP and data mining 5 0 0 0%

Process-centric data mining and models of data mining process 5 1 3 80%

Total 486 58 67 26%

Comparison with 2002 (Top 5)

Top 5 in 2002AcceptedRatio Top 5 in 2003

AcceptedRatio

Graph Mining 75.0% Process-centric DM 80.0%Temporal Data 52.6% Security, privacy 57.0%Theory 42.9% Statistics and Probability 47.0%Text Mining 42.1% Visual Data Mining 38.0%Rule 41.7% Post-processing 41.7%

Review Scores

SCORE

5.00

4.50

4.00

3.50

3.00

2.50

2.00

1.50

1.00

.50

0.00

SCORE

度数

120

100

80

60

40

20

0

= .92 標準偏差

= 2.32平均

= 486.00有効数

SCORE2

4.50

4.00

3.50

3.00

2.50

2.00

1.50

1.00

.50

0.00

SCORE2

度数

100

80

60

40

20

0

= .90 標準偏差

= 2.35平均

= 347.00有効数

2002 2003 N 347 486Average: 2.39 2.32 SD 0.90 0.92

Box Plot

486347 = 有効数

TOTAL_YE

20032002

TOTA

L_SC

6

5

4

3

2

1

0

- 1

Comparison with 2002• Country vs Final Decision

– Regular: Hong Kong => Hong Kong, Israel– Short: USA => ?– Reject: Japan, Taiwan => Most of the countries

• Topics vs Final Decision– Regular: Temporal => Statistics and Probability

Text Visualization– Short: Similarity => Postprocessing– Reject: Bayesian => Feature Selection

Corresponding Analysis (Country vs Final Decision)

-2

-1

0

1

2

3

4

5

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

RejectRegular

Short

Belgium

Israel

Hong Kong

USA

r2=0.235

China

Brasil

France

Poland

Japan

r1=0.325

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-3 -2 -1 0 1 2

Corresponding Analysis (Topics vs Final Decision)

Reject

Short

Regular

Statistics and probability

Security, privacy

Process-centric DM

Integration of DTW, OLAP and DM

Post-processing

Human-machine interaction and visualization

r1=0.218

r2=0.200

Feature Selection

- 1

- 0.5

0

0.5

1

1.5

2

2.5

3

3.5

0 0.2 0.4 0.6 0.8

Corresponding Analysis (# of Authors vs Final Decision)

Reject

Short

RegularProcess-centric DM

1

Human-machine interaction and visualization

r1=0.218

r2=0.200

4

5 23

6

Corresponding Summaries• Country vs Final Decision

– Regular: Hong Kong, Israel

– Short: ?

– Reject: Most of the countries are located near this region.

• Topics vs Final Decision– Regular: Statistics and Probability, Visualization

– Short: Postprocessing

– Reject: Feature Selection

• # of Authors vs Final Decision– 1 or 4 : Regular– 2 or 3 : between Short and Regular

Corresponding Analysis (2002)(Country vs Final Decision)

• Rule: [R1=0] [R_2=0]:| [R_1=0] | |

[R_2=0] |• Rule Relations between Sets

• Relation between Supporting Sets are very important.– Rough Set / Granular Computing

• Index for Rule Induction: – P(R2|R1), P(R1|R2), or f(P(R2|R1))

– Relation between Information Granules-4

-3

-2

-1

0

1

2

3

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Reject

Short

Regular

Hong Kong

Austria

Japan

Taiwan

Australia

FinlandUSA

CanadaChina

Thailand

Corresponding Analysis in 2002(Category vs Final Decision)

-5

-4

-3

-2

-1

0

1

2

-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Reject

Short

Regular

Bayesian

Statistics

Similarity

Interestingness

Active LearningTheory

Temporal

Web Mining

Structured

Text Mining

SVM

Rule

TreeApplications

Association R

Comparison with 2002• Country vs Final Decision

– Regular: Hong Kong => Hong Kong, Israel– Short: USA => ?– Reject: Japan, Taiwan => Most of the countries

• Topics vs Final Decision– Regular: Temporal => Statistics and Probability

Text Visualization– Short: Similarity => Postprocessing– Reject: Bayesian => Feature Selection

Rule Mining

• Datasets– Sample Size: 486– Attributes: 5

• Paper No. : ordered by submission date• # of Authors• # of Characters in Title• Country• Category

– Analyzed by Clementine 7.1

Rule Mining (2)• C5.0

– [FINAL=long]<= [Country=Israel] & [# of Authors>2]

& [# of Chars in Title <= 75.0]

(Confidence 0.667, Support : 3)

– [FINAL=Reject]<=[# of Author >4] & [Paper No.>117]

& [# of Chars in Titles > 71.0]

(Confidence 0.857 , Support: 10)

• # of Authors, Paper No, # of Chars : Important Features

Rule Mining (3)• Generalized Rule Induction

– [FINAL = Reject]<=[PAPER No. < 67.500]

(Confidence: 90%, Support:10.7%)– [FINAL=Reject]<= [PAPER No. < 54.5]

& [# of Chars in Title > 49.5]

(Confidence: 100%, Support 4.73%)– [FINAL = long]<=[COUNTRY = Israel]

& [# of Chars in Title > 61.500] (Confidence: 60%, Support: 1.03%)

• Paper No.,# of Charits in Title: Important Features

Rule Mining in 2002

• C5.0– [# of Chars in Titles> 43]

=> Rejected (Conf. 0.669, Support: 303)

– [Paper No. <= 722] & [COUNTRY = USA] & [Category =Temporal Data Mining]=> Regular (Conf. 0.833, Support :4)

Rule Mining in 2002

• (Association) Rules– Rejected <= [Paper No.< 542.5] (Conf: 0.88, Suport :41)– Rejected <= [Paper No.< 542.5] & [# of Chars > 53.5 ]

(Conf: 0.833, Support :29)

– Regular <= [Country=Canada] & [Category=Text Mining] (Conf: 0.6, Support: 5)

• Paper No., Country, Category

Comparison with 2002

• Important Features in 2003– # of Authors, Paper No, # of Chars– Early 57 papers, Long Titles, 2 authors

• Important Features in 2002– Paper No, # of Chars, Country, Category

– Early 52 papers, Long Titles

Conclusions

• Do not submit a paper too fast ! – Reflection not only on the contents, but also on the titl

es needed

• Mining Text/Web/Semi-structured Data are very popular now.

• Statistics and Probability is a very stronger topic.• Security and Privacy Issues become stronger.• Visualization/Interaction are emerging in ICDM 2003:

– Visualization/Human-Machine Interaction– Postprocessing of DM Results– Process-centric DM

Recommended