View
4
Download
0
Category
Preview:
Citation preview
Stats 415 - Overview Ji Zhu, Michigan Statistics 1
Overview
Ji Zhu445C West Hall
734-936-2577jizhu@umich.edu
Stats 415 - Overview Ji Zhu, Michigan Statistics 2
What is Data Mining?• Data mining is a multi-disciplinary field of study
concerned with the design of algorithms that allowcomputers to learn from large data repositories.
• Non-trivial extraction of implicit, previouslyunknown and potentially useful information fromdata
• There are many other definitions.
Stats 415 - Overview Ji Zhu, Michigan Statistics 3
Data Mining Examples and Non-Examples• Data mining
– Certain names are more prevalent in certain USlocations (O’Brien, O’Rurke, O’Reilly... in Bostonarea)
– Group together similar documents returned bysearch engine according to their context (e.g.Amazon rainforest, Amazon.com, etc.)
• Not data mining– Look up phone number in phone directory– Query a web search engine for information
about “Amazon”
Stats 415 - Overview Ji Zhu, Michigan Statistics 4
Stats 415 - Overview Ji Zhu, Michigan Statistics 5
Why Mine Data? Scientific Viewpoint• Data collected and stored at enormous speeds
(GB/hour)– remote sensors on a satellite (NASA)– telescopes scanning the skies (SDSS)– microarrays generating gene expression data
(MEDLINE)– scientific simulations generating terabytes of
data (GIS)
Stats 415 - Overview Ji Zhu, Michigan Statistics 6
• Data mining may help scientists– in classifying and segmenting data– in hypothesis formation– etc
Stats 415 - Overview Ji Zhu, Michigan Statistics 7
Why Mine Data? Commercial Viewpoint• Lots of data is being collected and warehoused
– Web data, e-commerce (Google, Yahoo, Amazon,Ebay)
– Purchases at department and grocery stores(Walmart)
– Bank/credit card transactions (Bank of America,Visa, Mastercard)
Stats 415 - Overview Ji Zhu, Michigan Statistics 8
• Competitive pressure is strong– Provide better, customized services for an edge
Stats 415 - Overview Ji Zhu, Michigan Statistics 9
Mining Large Data Sets - Motivation• There is often information “hidden” in the data that
is not readily evident.
• Human analysts may take months to discoveruseful information.
• Much of the data is never analyzed at all.
Stats 415 - Overview Ji Zhu, Michigan Statistics 10
Technological Driving Factors• Larger, cheaper memory
• Faster, cheaper processors– the CRAY of 20 years ago is now on your desk
• Success of relational databases and the Web– everybody is a “data owner”
• New ideas in machine learning and statistics
Stats 415 - Overview Ji Zhu, Michigan Statistics 11
Origins of Data Mining• Draws ideas from machine learning, pattern
recognition, statistics and database systems.
Statistics
DatabaseSystems
MachineLearning/PatternRecognition
Data Mining
Stats 415 - Overview Ji Zhu, Michigan Statistics 12
• Traditional techniques may be unsuitable due to– enormity of data– high dimensionality of data– heterogeneous, distributed nature of data
Stats 415 - Overview Ji Zhu, Michigan Statistics 13
Data Mining vs Statistics• Traditional statistics
– first hypothesize, then collect data, then analyze– often model-oriented
• Data mining– few if any a priori hypothesis– often algorithm-oriented rather than
model-oriented
Stats 415 - Overview Ji Zhu, Michigan Statistics 14
• Different?– Yes, in terms of culture, motivation; however...– Statistical ideas are very useful in data mining, e.g.,
in validating whether discovered knowledge isuseful
• Increasing overlap at the boundary of statistics anddata mining: use the tools of probability andstatistics to provide a mathematical framework for– posing data mining problems– formulating solutions to those problems
Stats 415 - Overview Ji Zhu, Michigan Statistics 15
Data Mining vs Machine Learning• To first-order, very little difference...
– Data mining relies heavily on ideas frommachine learning (and from statistics)
• Some differences between data mining andmachine learning– More emphasis in data mining on scalability– Data mining is somewhat more
applications-oriented
Stats 415 - Overview Ji Zhu, Michigan Statistics 16
Two Types of Data Mining Tasks• Prediction methods: Use some variables to predict
unknown or future values of other variables
• Description methods: Find human-interpretablepatterns that describe the data
Stats 415 - Overview Ji Zhu, Michigan Statistics 17
Examples of Data Mining Tasks• Visualization (Descriptive)
• Classification (Predictive)
• Regression (Predictive)
• Association analysis (Descriptive)
• Clustering (Descriptive)
Stats 415 - Overview Ji Zhu, Michigan Statistics 18
Classification: Definition• Given a collection of data points (training set)
– Each data point contains a set of variables, oneof the variables is the class
• Find a model for the class variable as a function ofthe values of other variables
• Goal: previously unseen data points should beassigned a class as accurately as possible– A test set is used to determine the accuracy of
the model
Stats 415 - Overview Ji Zhu, Michigan Statistics 19
Classification Example: Customer Scoring• Example: a bank has a database of 1 million past
customers, 10% of whom took out mortgages
• Use data mining to predict whether a newcustomer will take out a “mortgage or not” basedon the customer data
• Customer data– Other credit data– Demographic data on the customer
Stats 415 - Overview Ji Zhu, Michigan Statistics 20
Classification Example: Spam DetectionCustomize an email spam detection system forindividual user. Relative frequencies in a message ofmost commonly occurring words and punctuationmarks.
george you your hp free re removespam 0.00 2.26 1.38 0.02 0.52 0.13 0.28email 1.27 1.27 0.44 0.90 0.07 0.42 0.01
Stats 415 - Overview Ji Zhu, Michigan Statistics 21
Classification Example: Microarray
SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104
BREAS
TRE
NAL
MELAN
OMA
MELAN
OMA
MCF7D
-repro
COLON
COLON
K562B-
repro
COLON NSCLC
LEUKEM
IARE
NAL
MELAN
OMA
BREAS
TCN
SCN
SRE
NAL
MCF7A
-repro
NSCLC
K562A-
repro
COLON CN
SNS
CLCNS
CLCLEU
KEMIA CNS
OVAR
IANBR
EAST
LEUKEM
IAME
LANOM
AME
LANOM
AOV
ARIAN
OVAR
IANNS
CLC RENA
LBR
EAST
MELAN
OMA
OVAR
IANOV
ARIAN
NSCLC RENA
LBR
EAST
MELAN
OMA
LEUKEM
IACO
LONBR
EAST
LEUKEM
IACO
LON CNS
MELAN
OMA
NSCLC
PROS
TATE
NSCLC RENA
LRE
NAL
NSCLC RENA
LLEU
KEMIA
OVAR
IANPR
OSTAT
ECO
LONBR
EAST
RENA
LUN
KNOW
N
Stats 415 - Overview Ji Zhu, Michigan Statistics 22
Deviation/Anomaly Detection• Detect significant deviations from normal behavior
• Applications:– Credit card fraud detection– Network intrusion detection
Stats 415 - Overview Ji Zhu, Michigan Statistics 23
Fraud Detection• Credit card fraud detection
– Credit card losses in the US are over 1 billion $per year
– Roughly 1 in 50k transactions are fraudulent
• Fair-Issac’s fraud detection software based onneural networks, led to reported fraud decreases of30 to 50%
• Issues: false alarm rate vs missed detection – whatis the tradeoff?
Stats 415 - Overview Ji Zhu, Michigan Statistics 24
Regression• Predict a value of a given continuous valued
variable based on the values of other variables,assuming a linear or nonlinear model ofdependency
• Examples– Predicting sales amounts of new product based
on advertising expenditure– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc– Time series prediction of stock market indices
Stats 415 - Overview Ji Zhu, Michigan Statistics 25
Clustering: Definition• Given a set of data points, each having a set of
variables, find clusters such that– data points in one cluster are more “similar” to
one another, and– data points in separate clusters are “less similar”
to one another.
Stats 415 - Overview Ji Zhu, Michigan Statistics 26
• Similarity measures– Euclidean distance if variables are continuous– Other problem-specific measures
Stats 415 - Overview Ji Zhu, Michigan Statistics 27
Clustering Example: Market Segmentation• Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably beselected as a market target to be reached with adistinct marketing mix.
• Collect different variables of customers based ontheir geographical and lifestyle related information
• Find clusters of similar customers
Stats 415 - Overview Ji Zhu, Michigan Statistics 28
Clustering Examples• Document clustering: find groups of documents
that are similar to each other based on theimportant terms appearing in them
• Clustering stocks based on their movements everyday
Stats 415 - Overview Ji Zhu, Michigan Statistics 29
Association Rule Discovery: Definition• Given a set of records each of which contains some
number of items from a given collection– Produce dependency rules which will predict
occurrence of an item based on occurrences ofother items
• Goal is to discover interesting local patterns in thedata rather than to characterize the data globally
Stats 415 - Overview Ji Zhu, Michigan Statistics 30
ID Items1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk
Rules discovered
• {Coke} =⇒ {Milk}
• {Diaper, Milk} =⇒ {Beer}
Stats 415 - Overview Ji Zhu, Michigan Statistics 31
Association Rule Discovery Example• Supermarket shelf management
– Goal: To identify items that are bought togetherby sufficiently many customers
– A classic rule: If a customer buys diaper andmilk, then he is very likely to buy beer
• Amazon, Netflix
Stats 415 - Overview Ji Zhu, Michigan Statistics 32
Data Mining: the Downside• Hype?
• Data dredging, snooping and fishing– Finding spurious structure in data that is not
real
• Historically, “data mining” was derogatory term inthe statistics community– Rhine paradox– The Super Bowl fallacy– Bangladesh butter prices and the US stock
market
Recommended