Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
DAMI:
Introduc.on to Data Mining
Panagio(s Papapetrou, PhD Associate Professor, Stockholm University Adjunct Professor, Aalto University
Short Bio § BSc: University of Ioannina, Greece, 2003
Short Bio § PhD: Boston University, USA, 2009
Short Bio § 2009 -‐ 2012: Aalto University, Finland § Postdoc: Data Mining Group
Short Bio § 2009 -‐ 2012: Aalto University, Finland § Postdoc: Data Mining Group
Short Bio § 2012 -‐ 2013: Birkbeck, University of London, UK § Lecturer and director of the ITApps Programme
Short Bio § September 2013: Senior Lecturer at DSV
!
Course logisNcs • Course webpage:
– hPps://ilearn2.dsv.su.se/course/view.php?id=225
• Schedule: – Lectures: Nov 4 – Dec 17 – Exercise sessions: Nov 17, Dec 1, Dec 15 – WriPen Exam: Jan 14 – Re-‐exam: Feb 23
• Instructors: – PanagioNs Papapetrou: panagio([email protected] – Lars Asker: [email protected] – Henrik Boström: [email protected]
• Course Assistant: – Jing Zhao: [email protected]
• Office hours: by appointment only
ILearn2
Topics to be covered
• AssociaNon Rules • Clustering • Data RepresentaNon • ClassificaNon • Similarity Matching • Model EvaluaNon • Time Series Analysis • Ranking
Syllabus Nov 4 Introduc.on to data mining
Nov 5 AssociaNon Rules
Nov 10, 14 Clustering and Data RepresentaNon
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 ClassificaNon
Nov 24, 26 Similarity Matching and Model EvaluaNon
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM
Course workload
• Homeworks 3hp
• WriPen Exam 4.5hp
• Online quizzes
Homework Assignments • Three assignments (30pts each, total 90pts) • 3-‐5 online quizzes (total of 10 + 20pts) • To be done individually • Will involve some programming in R • Three in-‐class exercise sessions • Submissions:
– Before each exercise session – No submissions allowed amer that!
• Grade scheme: A-‐F
Quizzes
• 3-‐5 short online quizzes • Material to be examined
– The latest lecture • Available at the end of each lecture and to be completed before the next lecture
• Only one aPempt per quiz
• Will offer – 10pts towards the Homework Assignments – 20pts BONUS towards the Homework Assignments
• No make-‐up quizzes are possible
To Pass the Course
• Pass the Homework Assignments
– at least 50/100 pts (including the BONUS pts from Quiz)
• Pass the WriPen Exam
– at least 50/100 pts
• Ask quesNons
• Enjoy it J
Learning ObjecNves
• Become familiar with fundamental data mining algorithms
• Be able to idenNfy a correct algorithmic soluNon to a given data mining problem
• Be able to apply these algorithmic soluNons to solve pracNcal problems
• Be able to perform basic data mining tasks on real data using the R tool
Textbooks Main:
Data Mining: PracNcal Machine Learning Tools and Techniques, Third EdiNon Publisher: Morgan Kaufmann Year: 2011 ISBN: 978-‐0123748560
Addi.onal: An IntroducNon to StaNsNcal Learning with applicaNons in R Publisher: Springer Year: 2013 ISBN: 978-‐1-‐4614-‐7138-‐7 URL: hPp://www-‐bcf.usc.edu/~gareth/ISL/
Research papers (pointers will be provided)
Recommended prerequisites • Basic algorithms: sorNng, set manipulaNon, hashing • Analysis of algorithms: O-‐notaNon and its variants, NP-‐
hardness
• Programming: some programming knowledge, ability to do small experiments reasonably quickly
• Probability: concepts of probability and condiNonal probability, expectaNons, random walks
• Some linear algebra: e.g., eigenvector and eigenvalue computaNons
Above all
• The goal of the course is to learn and enjoy
• The basic principle is to ask quesNons when you don’t understand
• Say when things are unclear; not everything can be clear from the beginning
• ParNcipate in the class as much as possible
IntroducNon to data mining • Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science and mathemaNcs
• Some (basic) data mining tasks
Why do we need data analysis
• Really really lots of raw data data!! – Moore’s law: more efficient processors, larger memories – CommunicaNons have improved too
– Measurement technologies have improved dramaNcally
– It is possible to store and collect lots of raw data
– The data analysis methods are lagging behind
• Need to analyze the raw data to extract knowledge
The data is also very complex
• MulNple types of data: tables, Nme series, images, graphs, etc
• SpaNal and temporal aspects
• Large number of different variables
• Lots of observaNons à large datasets
Example: transacNon data
• Billions of real-‐life customers: – COOP, ICA – Tele2
• Billions of online customers: – amazon – expedia
Example: document data
• Web as a document repository: 50 billion of web pages
• Wikipedia: 4 million arNcles (and counNng)
• Online collecNons of scienNfic arNcles
Example: network data • Web: 50 billion pages linked via hyperlinks
• Facebook: 200 million users
• MySpace: 300 million users
• Instant messenger: 1 billion users
• Blogs: 250 million blogs worldwide
Example: genomic sequences
• hPp://www.1000genomes.org/page.php
• Full sequence of 1000 individuals
• 3 billion nucleoNdes per person • Lots more data in fact: medical history of the persons, gene expression data
Example: environmental data • Climate data (just an example) hPp://www.ncdc.gov/oa/climate/ghcn-‐monthly/index.php
• “a database of temperature, precipitaNon and pressure records managed by the NaNonal ClimaNc Data Center, Arizona State University and the Carbon Dioxide InformaNon Analysis Center”
• “6000 temperature staNons, 7500 precipitaNon staNons, 2000 pressure staNons”
We have large datasets…so what? • Goal: obtain useful knowledge from large masses of data
• “Data mining is the analysis of (omen large) observaNonal data sets to find unsuspected relaNonships and to summarize the data in novel ways that are both understandable and useful to the data analyst”
• Tell me something interesNng about the data; describe the data
What can data-‐mining methods do?
• Extract frequent paPerns – There are lots of documents that contain the phrases “Stockholm”, “Housing” and “^#@$&^#$@”
• Extract associaNon rules – 80% of the ICA customers who buy beer and sausage also buy mustard
• Extract rules – If occupaNon = PhD student, then Salary < 30,000 SEK
What can data-‐mining methods do?
• Rank web-‐query results – What are the most relevant web-‐pages to the query: “Student housing Stockholm University”?
• Find good recommendaNons for users – Recommend amazon customers new books – Recommend facebook users new friends/groups
• Find groups of enNNes that are similar (clustering) – Find groups of facebook users that have similar friends/interests
– Find groups amazon users that buy similar products – Find groups of ICA customers that buy similar products
Goal of this course • Describe some problems that can be solved using data-‐mining methods
• Discuss the intuiNon behind data mining methods that solve these problems
• Illustrate the theoreNcal underpinnings of these methods (this is very important!!)
• Show how these methods can be real applicaNon scenarios (this is also very important!!)
Data mining and related areas
• How does data mining relate to machine learning?
• How does data mining relate to staNsNcs?
• Other related areas?
Data mining vs. machine learning • Machine learning methods are used for data mining – ClassificaNon, clustering
• Amount of data makes the difference – Data mining deals with much larger datasets and scalability becomes an issue
• Data mining has more modest goals – AutomaNng tedious discovery tasks – Helping users, not replacing them
Data mining vs. staNsNcs • “tell me something interesNng about this data” – what else
is this than staNsNcs?
– The goal is similar
– Different types of methods
– In data mining one invesNgates lots of possible hypotheses
– Data mining is more exploratory data analysis
– In data mining there are much larger datasetsà algorithmics/scalability is an issue
Data mining and databases
• Ordinary database usage: deducNve
• Knowledge discovery: inducNve • New requirements for database management systems
• Novel data structures, algorithms and architectures are needed
Machine learning
The machine learning area deals with artificial systems that are able to improve their performance with experience
Supervised learning Experience: objects that have been assigned class labels Performance: typically concerns the ability to classify new (previously unseen) objects Unsupervised learning Experience: objects for which no class labels have been given Performance: typically concerns the ability to output useful characterizations (or groupings) of objects
Predictive data mining
Descriptive data mining
• Email classificaNon (spam or not)
• Customer classificaNon (will leave or not)
• Credit card transacNons (fraud or not)
• Molecular properNes (toxic or not)
Examples of supervised learning
Examples of unsupervised learning
• find useful email categories
• find interesNng purchase paPerns
• describe normal credit card transacNons
• find groups of molecules with similar properNes
Data mining: input • Standard requirement: each case is represented by one row in one table • Possible addiNonal requirements
-‐ only numerical variables -‐ all variables have to be normalized -‐ only categorical variables -‐ no missing values
• Possible generalizaNons -‐ mulNple tables -‐ recursive data types (sequences, trees, etc.)
An example: email classificaNon Features (aPributes)
Exam
ples (o
bservaNo
ns)
Ex. All
caps
No. excl.
marks
Missing
date
No. digits
in From:
Image
fraction
Spam
e1 yes 0 no 3 0 yes
e2 yes 3 no 0 0.2 yes
e3 no 0 no 0 1 no
e4 no 4 yes 4 0.5 yes
e5 yes 0 yes 2 0 no
e6 no 0 no 0 0 no
Spam = yes Spam = no
Spam = yes
Data mining: output
Data mining: output
Data mining: output • Interpretable representaNon of findings
-‐ equaNons, rules, decision trees, clusters
321 1.32.25.425.0 xxxy +−+=
0.18.1&0.3 21 =≤> yxx then if
0.85]:Confidence0.05,:[SupportBuysJuicesBuysCereal&BuysMilk →
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
CRISP-‐DM: CRoss Industry Standard Process for Data Mining
Shearer C., “The CRISP-‐DM model: the new blueprint for data mining”, Journal of Data Warehousing 5 (2000) 13-‐22 (see also www.crisp-‐dm.org)
CRISP-‐DM • Business Understanding
– understand the project objecNves and requirements from a business perspecNve
– convert this knowledge into a data mining problem definiNon
– create a preliminary plan to achieve the objecNves
CRISP-‐DM • Data Understanding
– iniNal data collecNon – get familiar with the data – idenNfy data quality problems
– discover first insights – detect interesNng subsets
– form hypotheses for hidden informaNon
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
CRISP-‐DM • Data Prepara.on
– construct the final dataset to be fed into the machine learning algorithm
– tasks here include: table, record, and aPribute selecNon, data transformaNon and cleaning
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
CRISP-‐DM • Modeling
– various data mining techniques are selected and applied
– parameters are learned – some methods may have specific requirements on the form of input data
– going back to the data preparaNon phase may be needed
The Knowledge Discovery Process Knowledge Discovery in Databases (KDD) is the nontrivial process of iden(fying valid, novel, poten(ally useful, and
ul(mately understandable paFerns in data.
U.M. Fayyad, G. Piatetsky-‐Shapiro and P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine 17(3): 37-‐54 (1996)
CRISP-‐DM • Evalua.on
– current model should have high quality from a data mining perspecNve
– before final deployment, it is important to test whether the model achieves all business objecNves
CRISP-‐DM • Deployment
– just creaNng the model is not enough
– the new knowledge should be organized and presented in a usable way
– generate a report – implement a repeatable data mining process for the user or the analyst
Tools
• Many data mining tools are freely available • Some opNons are:
Tool URL
WEKA www.cs.waikato.ac.nz/ml/weka/
Rule Discovery System www.compumine.com
R www.r-‐project.org/
RapidMiner rapid-‐i.com/
More options can be found at www.kdnuggets.com
Some simple data-‐analysis tasks • Given a stream or set of numbers (idenNfiers, etc)
• How many numbers are there?
• How many disNnct numbers are there?
• What are the most frequent numbers?
• How many numbers appear at least K Nmes?
• How many numbers appear only once?
• etc
Finding the majority element • Given a stream of labeled elements, e.g.,
{C, B, C, C, A, C, C, A, B, C}
• IdenNfy the majority element: element that occurs more than 50% of the Nme
• How can you find it? • … using no more than a few memory locaNons?
CounNng sort • Given a stream of labeled elements, e.g.,
{C, B, C, C, A, C, C, A, B, C} • Count the number of objects that have each disNnct key value
• Complexity: O(N + k) – N: number of items – k: range of items (largest-‐smallest)
• May fail for small N << k
Finding the majority element (Moore’s VoNng Algorithm)
• Complexity: O(N) – N: number of items
• Can we do bePer? – No! Unless we skip reading some items
The Set Cover Problem
• A trickier data mining task…
• A common algorithmic problem…
• One of the MOST USEFUL problems in CS!
The Set Cover Problem • The mayor of a city wants to place fire staNons so as to cover each neighborhood
• Each fire staNon covers: – own neighborhood – all adjacent ones
Challenge: • Where shall we place the fire staNons so as to minimize the city’s expenses?
• Each fire-‐staNon costs X SEK per month
The Set Cover Problem
The Set Cover Problem
• A set of objects • Some sets T that cover the objects
• Find the set of Ts that cover all objects!
• Find the smallest set!
Formal DefiniNon
• SeVng: – Universe of N elements U = {U1,…,UN}
– A set of n sets T = {T1,…,Tn}
– Find a collecNon C of sets in T (C subset of T) such that C contains all elements from U
• Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered
• SoluNon?
Formal DefiniNon
Trivial algorithm
• Try all sub-‐collecNons of T
• Select the smallest one that covers all the elements in U
• The running Nme of the trivial algorithm is O(2|T||U|)
• This is way too slow
Formal DefiniNon
• Set-‐cover problem: Find the smallest collecNon C of sets from T such that all elements in the universe U are covered
• The set cover problem is NP-‐hard
• Simple approxima(on algorithms with provable properNes are available and very useful in pracNce
Greedy algorithm for set cover
• Select first the largest-‐cardinality set t from T
• Remove the elements of t from U
• Recompute the sizes of the remaining sets in T
• Go back to the first step
The Greedy algorithm
• X = U • C = {} • while X is not empty do
– For all tєT let at=|t intersec.on X| – Let t be such that at is maximal – C = C U {t} – X = X\ t
Recall… • We want to find a set of Ts such that we cover all the objects
• What would the greedy algorithm find?
Example • Select biggest set: T1 • Remove all elements covered by T1
Current solu.on: X = {T1}
Example • Select the next biggest set: T4 • Remove all elements covered by T4
Current solu.on: X = {T1}
Example • Select the next biggest set: T5 • Remove all elements covered by T5
Current solu.on: X = {T1, T4}
Example • Select the next biggest set: T5 • Remove all elements covered by T5
Current solu.on: X = {T1, T4, T5}
Example • Select the next biggest set: T6 • Done!
Current solu.on: X = {T1, T4, T5, T6}
Example • What is the opNmal soluNon?
• Recall: we want the smallest possible set!
Greedy solu.on: X = {T1, T4, T5, T6}
An op.mal solu.on: X* = {T3, T4, T5}
How can this go wrong?
• No global consideraNon of how good or bad a selected set is going to be…
• How good is the proposed greedy algorithm?
Do your best then.
NP-hardness
Approximation Algorithms
Find an algorithm that will return solu(ons that are
guaranteed to be close to an op(mal solu(on
Constant factor approxima.on algorithms:
SOL <= f OPT
for some constant f
• OPT: value of an opNmal soluNon • SOL: value of the soluNon that our algorithm returns
The key of designing a polytime approximation algorithm is to obtain a good (lower or upper) bound to the optimal solution
For an NP-hard problem, we cannot compute an optimal solution in polynomial time
The general strategy (for an optimization problem) is:
OPT SOL OPT ≤ SOL ≤ f · OPT, if f > 1
Approximation Algorithms
SOL f · OPT ≤ SOL ≤ OPT, if f < 1
minimizaNon maximizaNon
How good is the greedy algorithm for the Set Cover Problem?
• Consider a soluNon I: – Let a(I) be the cost of the approximate soluNon – Let a*(I) be the cost of the opNmal soluNon – e.g., a*(I): is the minimum number of sets in S that cover all elements in U
• An algorithm for a minimizaNon problem has
approximaNon factor f if for all instances I we have that a(I) ≤ f a*(I)
How about the set cover greedy algorithm?
• The greedy algorithm for set cover has approximaNon factors: – f = |smax|
• Proof: From CLR “IntroducNon to Algorithms”
• The set cover cannot be approximated with f becer than O (log |smax|)
• What does that mean?
Today… • Why do we need data analysis?
• What is data mining?
• Examples where data mining has been useful
• Data mining and other areas of computer science and mathemaNcs
• Some (basic) data mining prototype problems
Next Nme… Nov 4 IntroducNon to data mining
Nov 5 Associa.on Rules
Nov 10, 14 Clustering and Data RepresentaNon
Nov 17 Exercise session 1 (Homework 1 due)
Nov 19 ClassificaNon
Nov 24, 26 Similarity Matching and Model EvaluaNon
Dec 1 Exercise session 2 (Homework 2 due)
Dec 3 Combining Models
Dec 8, 10 Time Series Analysis
Dec 15 Exercise session 3 (Homework 3 due)
Dec 17 Ranking
Jan 13 Review
Jan 14 EXAM
Feb 23 Re-‐EXAM
• AssociaNon rules
Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Examples of association rules
{Diaper} → {Beer}, {Milk, Bread} → {Diaper,Coke}, {Beer, Bread} → {Milk},
Next Nme…
TODOs
• Online R-‐tutorial: – Install R – Learn how to load files – Learn how to use the help command – Learn how to install packages – Learn how to print basic data staNsNcs
hPp://dist.stat.tamu.edu/pub/rvideos/