26
Data Mining: Extracting Knowledge from Past Data Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University

Data Mining: Extracting Knowledge from Past Data

  • Upload
    bree

  • View
    72

  • Download
    2

Embed Size (px)

DESCRIPTION

Data Mining: Extracting Knowledge from Past Data. Ming-Syan Chen Network Database Laboratory Electrical Engineering Department National Taiwan University. Outline. An introduction to data mining Challenging issues on data mining. Data Mining. Data mining: Knowledge discovery in databases - PowerPoint PPT Presentation

Citation preview

Page 1: Data Mining: Extracting Knowledge from Past Data

Data Mining: Extracting Knowledge from Past Data

Ming-Syan Chen

Network Database Laboratory

Electrical Engineering Department

National Taiwan University

Page 2: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 2

Outline

• An introduction to data mining

• Challenging issues on data mining

Page 3: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 3

Data Mining

• Data mining: Knowledge discovery in databases– extraction of interesting knowledge (rules,

regularities, patterns, constraints) from data in large databases

– Relevant fields: AI, database, statistics

• We are buried in data, but looking for knowledge

Page 4: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 4

Knowledge Discovering Process

Data

………………………

………………………

Knowledge

Target Data

PreprocessedData

TransformedData

Patterns

Selection

Preprocessing

Transformation

Interpretation/Evaluation

Data Mining

Page 5: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 5

Mining Capabilities

• Association

• Classification

• Clustering

• Traversal patterns

• Sequential patterns

• and many others

Page 6: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 6

E.g., Mining Association Rules• Transaction data analysis: Mining association rules

– Given: (1) a database of transactions(2) each tx has a list of items purchased

• Find all asso. rules: the presence of one set of items implies the presence of another set of items in the same tx

• Two primary approaches(1) Apriori-Based(2) FP-Tree-Based

Page 7: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 7

Two Parameters

• Confidence (how true)– the rule X&Y => Z has 90% conf. means 90%

of customers who bought X and Y also bought Z

• Support (how useful the rule is)– useful rules should have some minimum tx

support

Page 8: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 8

Applications• 依據不同產業需求提出產業別應用

金融保險業

零售業 製造業 連鎖業 醫療業 電信業 生技業 教育業 廣告業 非營利組織

信用評等、客製化金融服務、授信、客戶之資產管理、壞帳分析、道德危機分析、逆向選擇風險分析、潛在客戶名單分析

即時輔助購買決策之依據,並且提供貨品、架位、物流整合及配置之輔助決策支援系統

生產過程中作為最佳化生產因素決定之專家輔助決策系統,並且提供最佳化之存貨控管與供應鏈暨顧客利潤率分析

作為展店店址之選擇,以及分店貨品品項選擇,並且作為物流倉庫位址決策輔助工具,以及物流產能輔助配置之依據

作業成本管理之動因分析、作為顧客利潤率分析、或客戶客製化服務之來源

提供最佳化之網路交通配置,暨、客製化服務,並且提供即時之線上客製化輔助資訊系統、客製化之入口網站及輔助促銷功能

提供研發平台以及分析所需工具,加速累積研發能量

作為潛在學生之來源名單分析,並且運用資訊勘測作為入學申請暨獎學金申請評等之分析,及學生課程規劃與職涯規劃之依據

廣告點閱來源分析、回應率分析、行銷策略提供

作為勸募捐款信函與通信之聯繫名單方式

Page 9: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 9

Remarks• Data mining is very application dependent

– Small team with good skill and domain knowledge

• Lots of work has been done in other areas• Emerging issues:

– Journals, ACM TODS, ACM TKDD (from 2007), IEEE TKDE, DMKD, KAIS, IS, Pattern Recognition

– SIGKDD, ICDM (from 2001), SIAM-SDM (from 2001), SIGMOD, ICDE, VLDB, CIKM, ICML, SIGIR, WI, PAKDD, etc.

Page 10: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 10

What is the Next for Data Mining

• Privacy-preserving mining

• Data stream mining

• Mining for bioinformatics

• Mining to assist content-based data management

Page 11: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 11

Data Streams: Computation Model

• Stream processing requirements– Single pass: Each record is examined at most once

– Bounded storage: Limited Memory for storing synopsis– Real-time: Per record processing time must be low

Stream ProcessingEngine

(Approximate) Answer

Synopsis in Memory

Data Streams

Page 12: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 12

Outline

• An introduction to data mining

• Challenging Issues on data mining

Page 13: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 13

Challenging Issues for Data Mining

• Identifying data source for desired knowledge– Mining purposes: knowledge or auxiliary meta data

• Data collection methods (in Web, wireless, tx)– Different types of data from different environment

• Usefulness and certainty of mining results– Support and confidence

• Interactive mining with different data granularities– e.g., generalized association rules

Page 14: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 14

Issues (cont’d)• Mining in data streaming environments

– Look at data only once; the amount of data is huge

– incremental mining (temporal and spatial)

• Efficiency and scalability of mining algorithms– Sampling methods (frequency tuned wrt data or wrt

result accuracy)

• Hardware-enhanced mining– E.g., PDA, STB, devices for LBS

Page 15: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 15

Issues (cont’d)

• Interestingness of mining results– Have to know the original likelihood

• Evaluation of mining results– How to measure the advantage gained

• Expression of various kinds of mining results

• Protection of privacy and data security– Data hiding

Page 16: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 16

Ongoing Works in NetDB Lab

• Web usage mining

• Web content mining

• Mining in mobile environments

• Scalable clustering techniques tuned with domain knowledge

• Incremental mining (temporal and spatial)

• Hardware-enhanced mining

Page 17: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 17

Summary

• Data mining is an area of growing importance– Increasing demand for intelligence

– Fast advance in IT techniques

• Mining will be of increasing impact to Web and wireless applications.– Huge amount of digital data

– Nature of applications and their users

Page 18: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 18

Data cleaningData integration

Filtering

Database ordata warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledgebase

Database Data

warehouse

Page 19: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 19

Incremental Mining

• Due to the increasing use of the record-based databases, recent important applications have called for the need of incremental mining– Such applications include Web log records,

stock market data, grocery sales data, transactions in electronic commerce, and daily weather/traffic records, to name a few

Page 20: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 20

Incremental Mining

• To mine the transaction database for a fixed amount of most recent data (say, data in the last 12 months)

• One has to not only include new data (i.e., data in the new month) into, but also remove the old data (i.e., data in the most obsolete month) from the mining process.

data for 1/2000

data for 2/2000

data for 12/2000

data for 1/2001

dbi, j

Pi+1

Pj

Pj+1

dbi+1, j+1

Pi

Page 21: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 21

E.g., Redundant Rules

• For the same support and confidence, if we have a rule {a,d}=>{c,e,f,g}, what do we have– {a,d}=>{c,e,f}– {a}=>{c,e,f,g}– {a,d,c}=>{e,f,g}– {a}=>{d,c,e,f,g}

Page 22: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 22

E.g., Generalized Asso. Rules

• Which data granularities should be used for data mining

• To mine meaningful rules (proper data units) and be as specific as possible– similar dilemma for other mining capabilities

Page 23: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 23

Clothes

Outerwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Freg. Itemset Itemset support

Jacket 2 Outerwear 3 Clothes 4 Shoes 2 Hiking Boots 2 Footwear 4Outerwear, Hiking Boots 2Clothes, Hiking Boots 2Outerwear, Footwear 2Clothes, Footwear 2

Database Tx Items bought

100 Shirt 200 Jacket, Hiking Boots300 Ski Pants, Hiking Boots400 Shoes500 Shoes600 Jacket

sup(30%) conf(60%)Outerwear → Hiking Boots 33% 66%Outerwear → Footwear 33% 66%Hiking Boots → Outwear 33% 100%Hiking Boots → Clothes 33% 100%However,Jacket → Hiking Boots 16% 50%Ski Pants → Hiking Boots 16% 100%

Page 24: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 24

E.g., Interestingness of Rules

• In a school of 5000 students– 60% (3000) play basketball and 75% (3750) eat

cereal; and 40% (2000) do both

• Say, minimal sup is 2000 and min conf is 60%, one gets the rule – “play basketball => eat cereal” so ... does that

mean promoting the basketball activities will help the sales of cereal?

Page 25: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 25

Interestingness (Cont’d)

• In fact, P(A and B)/P(A) should be greater than P(B) to make the rule “A=>B” be interesting– how about for the rule {A,K,}=>{B,L,V} to be

interesting

Page 26: Data Mining: Extracting Knowledge from Past Data

M.-S. Chen NTU 26

Related Training

• Database

• AI: machine learning

• Statistics