Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better,...

Stats 415 - Overview Ji Zhu, Michigan Statistics 1

Overview

Ji Zhu445C West Hall

734-936-2577jizhu@umich.edu

What is Data Mining?• Data mining is a multi-disciplinary field of study

concerned with the design of algorithms that allowcomputers to learn from large data repositories.

• Non-trivial extraction of implicit, previouslyunknown and potentially useful information fromdata

• There are many other definitions.

Data Mining Examples and Non-Examples• Data mining

– Certain names are more prevalent in certain USlocations (O’Brien, O’Rurke, O’Reilly... in Bostonarea)

– Group together similar documents returned bysearch engine according to their context (e.g.Amazon rainforest, Amazon.com, etc.)

• Not data mining– Look up phone number in phone directory– Query a web search engine for information

about “Amazon”

Why Mine Data? Scientific Viewpoint• Data collected and stored at enormous speeds

(GB/hour)– remote sensors on a satellite (NASA)– telescopes scanning the skies (SDSS)– microarrays generating gene expression data

(MEDLINE)– scientific simulations generating terabytes of

data (GIS)

• Data mining may help scientists– in classifying and segmenting data– in hypothesis formation– etc

Why Mine Data? Commercial Viewpoint• Lots of data is being collected and warehoused

– Web data, e-commerce (Google, Yahoo, Amazon,Ebay)

– Purchases at department and grocery stores(Walmart)

– Bank/credit card transactions (Bank of America,Visa, Mastercard)

• Competitive pressure is strong– Provide better, customized services for an edge

Mining Large Data Sets - Motivation• There is often information “hidden” in the data that

is not readily evident.

• Human analysts may take months to discoveruseful information.

• Much of the data is never analyzed at all.

Technological Driving Factors• Larger, cheaper memory

• Faster, cheaper processors– the CRAY of 20 years ago is now on your desk

• Success of relational databases and the Web– everybody is a “data owner”

• New ideas in machine learning and statistics

Origins of Data Mining• Draws ideas from machine learning, pattern

recognition, statistics and database systems.

Statistics

DatabaseSystems

MachineLearning/PatternRecognition

Data Mining

• Traditional techniques may be unsuitable due to– enormity of data– high dimensionality of data– heterogeneous, distributed nature of data

Data Mining vs Statistics• Traditional statistics

– first hypothesize, then collect data, then analyze– often model-oriented

• Data mining– few if any a priori hypothesis– often algorithm-oriented rather than

model-oriented

• Different?– Yes, in terms of culture, motivation; however...– Statistical ideas are very useful in data mining, e.g.,

in validating whether discovered knowledge isuseful

• Increasing overlap at the boundary of statistics anddata mining: use the tools of probability andstatistics to provide a mathematical framework for– posing data mining problems– formulating solutions to those problems

Data Mining vs Machine Learning• To first-order, very little difference...

– Data mining relies heavily on ideas frommachine learning (and from statistics)

• Some differences between data mining andmachine learning– More emphasis in data mining on scalability– Data mining is somewhat more

applications-oriented

Two Types of Data Mining Tasks• Prediction methods: Use some variables to predict

unknown or future values of other variables

• Description methods: Find human-interpretablepatterns that describe the data

Examples of Data Mining Tasks• Visualization (Descriptive)

• Classification (Predictive)

• Regression (Predictive)

• Association analysis (Descriptive)

• Clustering (Descriptive)

Classification: Definition• Given a collection of data points (training set)

– Each data point contains a set of variables, oneof the variables is the class

• Find a model for the class variable as a function ofthe values of other variables

• Goal: previously unseen data points should beassigned a class as accurately as possible– A test set is used to determine the accuracy of

the model

Classification Example: Customer Scoring• Example: a bank has a database of 1 million past

customers, 10% of whom took out mortgages

• Use data mining to predict whether a newcustomer will take out a “mortgage or not” basedon the customer data

• Customer data– Other credit data– Demographic data on the customer

Classification Example: Spam DetectionCustomize an email spam detection system forindividual user. Relative frequencies in a message ofmost commonly occurring words and punctuationmarks.

george you your hp free re removespam 0.00 2.26 1.38 0.02 0.52 0.13 0.28email 1.27 1.27 0.44 0.90 0.07 0.42 0.01

Classification Example: Microarray

SID42354SID31984SID301902SIDW128368SID375990SID360097SIDW325120ESTsChr.10SIDW365099SID377133SID381508SIDW308182SID380265SIDW321925ESTsChr.15SIDW362471SIDW417270SIDW298052SID381079SIDW428642TUPLE1TUP1ERLUMENSIDW416621SID43609ESTsSID52979SIDW357197SIDW366311ESTsSMALLNUCSIDW486740ESTsSID297905SID485148SID284853ESTsChr.15SID200394SIDW322806ESTsChr.2SIDW257915SID46536SIDW488221ESTsChr.5SID280066SIDW376394ESTsChr.15SIDW321854WASWiskottHYPOTHETICALSIDW376776SIDW205716SID239012SIDW203464HLACLASSISIDW510534SIDW279664SIDW201620SID297117SID377419SID114241ESTsCh31SIDW376928SIDW310141SIDW298203PTPRCSID289414SID127504ESTsChr.3SID305167SID488017SIDW296310ESTsChr.6SID47116MITOCHONDRIAL60ChrSIDW376586HomosapiensSIDW487261SIDW470459SID167117SIDW31489SID375812DNAPOLYMERSID377451ESTsChr.1MYBPROTOSID471915ESTsSIDW469884HumanmRNASIDW377402ESTsSID207172RASGTPASESID325394H.sapiensmRNAGNALSID73161SIDW380102SIDW299104

-repro

K562B-

COLON NSCLC

LEUKEM

-repro

K562A-

COLON CN

CLCLEU

KEMIA CNS

LEUKEM

CLC RENA

NSCLC RENA

LEUKEM

LON CNS

NSCLC RENA

Deviation/Anomaly Detection• Detect significant deviations from normal behavior

• Applications:– Credit card fraud detection– Network intrusion detection

Fraud Detection• Credit card fraud detection

– Credit card losses in the US are over 1 billion $per year

– Roughly 1 in 50k transactions are fraudulent

• Fair-Issac’s fraud detection software based onneural networks, led to reported fraud decreases of30 to 50%

• Issues: false alarm rate vs missed detection – whatis the tradeoff?

Regression• Predict a value of a given continuous valued

variable based on the values of other variables,assuming a linear or nonlinear model ofdependency

• Examples– Predicting sales amounts of new product based

on advertising expenditure– Predicting wind velocities as a function of

temperature, humidity, air pressure, etc– Time series prediction of stock market indices

Clustering: Definition• Given a set of data points, each having a set of

variables, find clusters such that– data points in one cluster are more “similar” to

one another, and– data points in separate clusters are “less similar”

to one another.

• Similarity measures– Euclidean distance if variables are continuous– Other problem-specific measures

Clustering Example: Market Segmentation• Goal: subdivide a market into distinct subsets of

customers where any subset may conceivably beselected as a market target to be reached with adistinct marketing mix.

• Collect different variables of customers based ontheir geographical and lifestyle related information

• Find clusters of similar customers

Clustering Examples• Document clustering: find groups of documents

that are similar to each other based on theimportant terms appearing in them

• Clustering stocks based on their movements everyday

Association Rule Discovery: Definition• Given a set of records each of which contains some

number of items from a given collection– Produce dependency rules which will predict

occurrence of an item based on occurrences ofother items

• Goal is to discover interesting local patterns in thedata rather than to characterize the data globally

ID Items1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules discovered

• {Coke} =⇒ {Milk}

• {Diaper, Milk} =⇒ {Beer}

Association Rule Discovery Example• Supermarket shelf management

– Goal: To identify items that are bought togetherby sufficiently many customers

– A classic rule: If a customer buys diaper andmilk, then he is very likely to buy beer

• Amazon, Netflix

Data Mining: the Downside• Hype?

• Data dredging, snooping and fishing– Finding spurious structure in data that is not

• Historically, “data mining” was derogatory term inthe statistics community– Rhine paradox– The Super Bowl fallacy– Bangladesh butter prices and the US stock

market

Overview - University of Michiganjizhu/jizhu/fudan/s415overview.pdf · Œ Provide better,...

Documents

The Fudan Fulcrum

Advanced Microeconomics (II) - Fudan University

Fudan-UC Dispatch Fudan-UC Dispatch · 2019-03-04 · Fudan-UC Dispatch 1 FUDAN AND UC SCHOLARS’ RESEARCH ON CHINA Maximillian Auffhammer, UC Berkeley Field: Agricultural Economics

EndNote on Windows: Class Notes - Fudan University...EndNote on Windows: Class Notes - Fudan University ... 1 . 1

Fudan Newsletter Fall 2013

Entrepreneurship Summer School - Fudan University

Analysis of Cross Section and Panel Data Yan Zhang School of Economics, Fudan University CCER, Fudan University

Welcome to Fudan...Buddy Program •From year 2010 on, Fudan launches Buddy Program for incoming international exchange students. Fudan will recruit volunteers to match with exchange

As featured in - Fudan University

Work @ Fudan University

2012 - Fudan University · of Universitas 21, an international consortium of research-driven universities. FUDAN UNIVERSITY ... 2012 FUDAN UNIVERSITY INTERNATIONAL SUMMER SESSION

FUDAN UNIVERSITY INTERNATIONAL SUMMER SESSIONiso.fudan.edu.cn/downloads/sqgjxm20121115.pdf · Fudan University International Summer Session 2013 offers 14 ... Application Materials:

Correction - Fudan University

Fudan 12 09 European Journalism Culture

China july 2013 Fudan University

China Russia India Brazil - SKOLKOVO · School of Management Fudan University Fudan School of Management, Fudan University (FDSM) is a leading Business school in China. Structured

Orientation - Fudan

Some Aspects of the Spline Smoothing Approach to Non …jizhu/jizhu/wuke/Silverman-JRSSB... · 2009-03-13 · Some Aspects of the Spline Smoothing Approach to Non-Parametric Regression

Fudan MBA iLab - Business FinlandOverview of Fudan University and School of Management Content . Overview of Fudan University and FDSM 01 . The century -old university, officially

Cambridge Summer Institute - Fudan University