Upload
diana-sparks
View
222
Download
3
Embed Size (px)
Citation preview
ICDM 2003 Review Data Analysis
- with comparison between 02 and 03 -
Xindong Wu and Alex Tuzhilin
Analyzed by Shusaku Tsumoto
Basic Statistics (Country)
• 37 countries, 486 Submissions
• Regular Papers: 58 (12%)
• Short Papers: 67 (14%)
• High Acceptance Ratio (Regular)– Israel: 4/11 (37%)– Hong Kong: 3/11 (33%)
Country Total Regular ShortAcceptance
Ratio
USA 189 35 28 33%
China 45 2 0 4%
Australia 29 3 5 28%
Canada 28 0 6 21%
Germany 19 2 4 32%
Japan 19 4 3 37%
France 18 1 2 17%
Taiwan 16 0 3 19%
Brazil 15 0 0 0%
Hong Kong 12 4 2 50%
UK 12 1 2 25%
Israel 11 4 2 55%
Italy 8 1 1 25%
Finland 7 1 1 29%
India 7 0 1 14%
Korea 6 0 1 17%
Top 15 441 58 61 27%
Total 486 58 67 26%
Comparison with 2002 (Top 5)
CountryAccepted
Ratio (2002)Country
AcceptanceRatio (2003)
Hong Kong 64.7% Israel 55.0%USA 47.9% Hong Kong 50.0%Canada 45.5% Japan 37.0%Finland 33.3% USA 33.0%France 33.3% Germany 32.0%
Basic Statistics (Topics)• Top 5 of Submissions:
– Mining text and semi-structured data, and mining temporal, spatial and multimedia data
– Data mining and machine learning algorithms and methods in traditional areas and in new areas
– Data mining applications in electronic commerce, bioinformatics, computer security, Web intelligence
– Soft computing and uncertainty management– Data pre-processing, data reduction, feature selection
and feature transformation
• High Acceptance Ratio (Regular)– Statistics and probability in large-scale data mining– Security, privacy and social impact of data mining
Total Regular Short Acceptance Ratio
Mining text and semi-structured data, and mining temporal, spatial and multimedia data
81 10 12 27%
Data mining and machine learning algorithms and methods in traditional areas (such as classification, regression, clustering, probabilistic modeling, and association analysis), and in new areas
77 11 8 25%
Data mining applications in electronic commerce, bioinformatics, computer security, Web intelligence, intelligent learning database system
61 5 6 18%
Soft computing (including neural networks, fuzzy logic, evolutionary computation, and rough sets) and uncertainty management for data mining
46 2 9 24%
Data pre-processing, data reduction, feature selection and feature transformation 41 3 5 20%
Complexity, efficiency, and scalability issues in data mining 30 4 4 27%
Others 21 1 4 24%
Foundations of data mining 18 2 1 17%
Data and knowledge representation for data mining 16 3 1 25%
Human-machine interaction and visualization in data mining, and visual data mining
16 3 3 38%
Quality assessment and interestingness metrics of data mining results 16 2 3 31%
Statistics and probability in large-scale data mining 15 6 1 47%
High performance and distributed data mining 12 1 2 25%
Post-processing of data mining results 11 1 3 36%
Pattern recognition and scientific discovery 8 1 0 13%
Security, privacy and social impact of data mining 7 2 2 57%
Integration of data warehousing, OLAP and data mining 5 0 0 0%
Process-centric data mining and models of data mining process 5 1 3 80%
Total 486 58 67 26%
Comparison with 2002 (Top 5)
Top 5 in 2002AcceptedRatio Top 5 in 2003
AcceptedRatio
Graph Mining 75.0% Process-centric DM 80.0%Temporal Data 52.6% Security, privacy 57.0%Theory 42.9% Statistics and Probability 47.0%Text Mining 42.1% Visual Data Mining 38.0%Rule 41.7% Post-processing 41.7%
Review Scores
SCORE
5.00
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
.50
0.00
SCORE
度数
120
100
80
60
40
20
0
= .92 標準偏差
= 2.32平均
= 486.00有効数
SCORE2
4.50
4.00
3.50
3.00
2.50
2.00
1.50
1.00
.50
0.00
SCORE2
度数
100
80
60
40
20
0
= .90 標準偏差
= 2.35平均
= 347.00有効数
2002 2003 N 347 486Average: 2.39 2.32 SD 0.90 0.92
Box Plot
486347 = 有効数
TOTAL_YE
20032002
TOTA
L_SC
6
5
4
3
2
1
0
- 1
Comparison with 2002• Country vs Final Decision
– Regular: Hong Kong => Hong Kong, Israel– Short: USA => ?– Reject: Japan, Taiwan => Most of the countries
• Topics vs Final Decision– Regular: Temporal => Statistics and Probability
Text Visualization– Short: Similarity => Postprocessing– Reject: Bayesian => Feature Selection
Corresponding Analysis (Country vs Final Decision)
-2
-1
0
1
2
3
4
5
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
RejectRegular
Short
Belgium
Israel
Hong Kong
USA
r2=0.235
China
Brasil
France
Poland
Japan
r1=0.325
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-3 -2 -1 0 1 2
Corresponding Analysis (Topics vs Final Decision)
Reject
Short
Regular
Statistics and probability
Security, privacy
Process-centric DM
Integration of DTW, OLAP and DM
Post-processing
Human-machine interaction and visualization
r1=0.218
r2=0.200
Feature Selection
- 1
- 0.5
0
0.5
1
1.5
2
2.5
3
3.5
0 0.2 0.4 0.6 0.8
Corresponding Analysis (# of Authors vs Final Decision)
Reject
Short
RegularProcess-centric DM
1
Human-machine interaction and visualization
r1=0.218
r2=0.200
4
5 23
6
Corresponding Summaries• Country vs Final Decision
– Regular: Hong Kong, Israel
– Short: ?
– Reject: Most of the countries are located near this region.
• Topics vs Final Decision– Regular: Statistics and Probability, Visualization
– Short: Postprocessing
– Reject: Feature Selection
• # of Authors vs Final Decision– 1 or 4 : Regular– 2 or 3 : between Short and Regular
Corresponding Analysis (2002)(Country vs Final Decision)
• Rule: [R1=0] [R_2=0]:| [R_1=0] | |
[R_2=0] |• Rule Relations between Sets
• Relation between Supporting Sets are very important.– Rough Set / Granular Computing
• Index for Rule Induction: – P(R2|R1), P(R1|R2), or f(P(R2|R1))
– Relation between Information Granules-4
-3
-2
-1
0
1
2
3
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Reject
Short
Regular
Hong Kong
Austria
Japan
Taiwan
Australia
FinlandUSA
CanadaChina
Thailand
Corresponding Analysis in 2002(Category vs Final Decision)
-5
-4
-3
-2
-1
0
1
2
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
Reject
Short
Regular
Bayesian
Statistics
Similarity
Interestingness
Active LearningTheory
Temporal
Web Mining
Structured
Text Mining
SVM
Rule
TreeApplications
Association R
Comparison with 2002• Country vs Final Decision
– Regular: Hong Kong => Hong Kong, Israel– Short: USA => ?– Reject: Japan, Taiwan => Most of the countries
• Topics vs Final Decision– Regular: Temporal => Statistics and Probability
Text Visualization– Short: Similarity => Postprocessing– Reject: Bayesian => Feature Selection
Rule Mining
• Datasets– Sample Size: 486– Attributes: 5
• Paper No. : ordered by submission date• # of Authors• # of Characters in Title• Country• Category
– Analyzed by Clementine 7.1
Rule Mining (2)• C5.0
– [FINAL=long]<= [Country=Israel] & [# of Authors>2]
& [# of Chars in Title <= 75.0]
(Confidence 0.667, Support : 3)
– [FINAL=Reject]<=[# of Author >4] & [Paper No.>117]
& [# of Chars in Titles > 71.0]
(Confidence 0.857 , Support: 10)
• # of Authors, Paper No, # of Chars : Important Features
Rule Mining (3)• Generalized Rule Induction
– [FINAL = Reject]<=[PAPER No. < 67.500]
(Confidence: 90%, Support:10.7%)– [FINAL=Reject]<= [PAPER No. < 54.5]
& [# of Chars in Title > 49.5]
(Confidence: 100%, Support 4.73%)– [FINAL = long]<=[COUNTRY = Israel]
& [# of Chars in Title > 61.500] (Confidence: 60%, Support: 1.03%)
• Paper No.,# of Charits in Title: Important Features
Rule Mining in 2002
• C5.0– [# of Chars in Titles> 43]
=> Rejected (Conf. 0.669, Support: 303)
– [Paper No. <= 722] & [COUNTRY = USA] & [Category =Temporal Data Mining]=> Regular (Conf. 0.833, Support :4)
Rule Mining in 2002
• (Association) Rules– Rejected <= [Paper No.< 542.5] (Conf: 0.88, Suport :41)– Rejected <= [Paper No.< 542.5] & [# of Chars > 53.5 ]
(Conf: 0.833, Support :29)
– Regular <= [Country=Canada] & [Category=Text Mining] (Conf: 0.6, Support: 5)
• Paper No., Country, Category
Comparison with 2002
• Important Features in 2003– # of Authors, Paper No, # of Chars– Early 57 papers, Long Titles, 2 authors
• Important Features in 2002– Paper No, # of Chars, Country, Category
– Early 52 papers, Long Titles
Conclusions
• Do not submit a paper too fast ! – Reflection not only on the contents, but also on the titl
es needed
• Mining Text/Web/Semi-structured Data are very popular now.
• Statistics and Probability is a very stronger topic.• Security and Privacy Issues become stronger.• Visualization/Interaction are emerging in ICDM 2003:
– Visualization/Human-Machine Interaction– Postprocessing of DM Results– Process-centric DM