Upload
shyann-ashwell
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
Data Mining in Practice:
Techniques and Practical Applications
Junling Hu
May 14, 2013
What is data mining?
2
Mining patterns from data
Is it statistics? Functional form? Computation speed concern? Data size Variable size
Is it machine learning? Big data issue New methods: network mining
Examples of data mining
3
Frequently bought together Movie recommendation
More examples of data mining
4
Keyword suggestions Genome & disease mining
Heart monitoring
Overview of data mining
5
Frequent pattern mining Machine Learning
Supervised Unsupervised
Stream mining Recommender system Graph mining Unstructured data
Text, Audio Image and Video
Big data technology
Frequent Pattern Mining
6
Diaper and Beer
Product assortment Click behavior Machine breakdown
?
The case of Amazon
7
User Items1 {Princess dress, crown, gloves, t-shirt}2 {Princess dress, crown, gloves, pink dress, t-shirt }3 {Princess dress, crown, gloves, pink dress, jeans}4 { Princess dress, crown, gloves, pink dress}5 {crown, gloves }
Count frequency of co-occurrence Efficient algorithm
Machine Learning Process
8
Machine Learning
9
Supervised
Unsupervised (clustering)
Binary classification
10
Checking Duration (years)
Savings($k)
Current Loans
Loan Purpose
Risky?
Yes 1 10 Yes TV 0Yes 2 4 No TV 1No 5 75 No Car 0Yes 10 66 No Repair 1Yes 5 83 Yes Car 0Yes 1 11 No TV 0Yes 4 99 Yes Car 0
Input features
Output class
Data point
Classification (1)
11
Decision tree
Classification (2): Neural network
12
Perceptron
Multi-layer neural netowrk
Head pose detection
13
Support Vector Machine (SVM)
14
Search for a separating hyperplane Maximize margin
Perceived advantage of SVM
15
Transform data into higher dimension
Applications of SVM: Spam Filter
16
Input Features:
Transmission IP address --167.12.24.555 Sender URL -- one-spam.com
Email header From --“[email protected]” To --“undisclosed” cc
Email Body # of paragraphs # words
Email structure # of attachments # of links
Logistic regression
17
Advantage: Simple functional form Can be parallelized Large scale
Applications of logistic regression
18
Click prediction Search ranking (web pages, products) Online advertising Recommendation
The model Output: Click/no click Input features:
page content, search keyword, User information
Regression
19
Linear regression
Non-linear regression
Application: • Stock price prediction• Credit scoring• employment forecast
History of Supervised learning
20
Semi-supervised learning
21
Application: Speech dialog system
Unsupervised learning: Clustering
22
No labeled data
Methods K-means
Categories of machine learning
23
Applications of Clustering
24
Malware detection Document clustering: Topic detection
Graphs in our life
25
Social network Molecular compound
Friend recommendation Drug discovery
Graph and its matrix representation
26
1 2 3 4 5 6
1 0 1 0 0 0 12 1 0 1 1 0 03 0 1 0 1 1 04 0 1 1 0 1 05 0 0 1 1 0 16 1 0 0 0 1 0
12
6
3
5
4
Adjacency matrix
The web graph
27
Anchor text
Anchor text
Anchor text
Anchor text
HyperlinkPage 1 Page 2
Page 3
PageRank as a steady state
28
1 2 3 4 5 61 0 0.33 0.33 0 0 0.332 0.5 0 0.5 0 0 03 0.25 0.25 0 0.25 0.25 04 0 1 0 0 0 05 0 0 0.33 0.33 0 0.336 0.5 0 0 0 0.5 0
Transition matrix
P=
PageRank is a probability vector such that P
Discover influencers on Twitter
29
The Twitter graph Node Link
A PageRank approach: TwitterRank
2
13
5
4
Following
Facebook graph search
30
Entity graph Natural language search “Restaurants liked
by my friends”
Recommending a game
31
32
Recommendation in Travel site
33
Prediction Problems
Rating Prediction Given how an user rated other items, predict the user’s rating
for a given item
Top-N Recommendation Given the list of items liked by an user, recommend new items
that the user might like
**** ?
34
Explicit vs. Implicit Feedback Data Explicit feedback
Ratings and reviews
Implicit feedback (user behavior) Purchase behavior: Recency, frequency, …
Browsing behavior: # of visits, time of visit, time of staying, clicks
35
Collaborative Filtering Hypotheses
User/Item Similarities Similar users purchase similar items Similar items are purchased by similar users
Matching characteristics Match exists between user’s and item’s characteristics
36
User-User similarity User’s movie rating
Out of Africa
Star Wars
Air Force One
Liar, Liar
John 4 4 5 1
Adam 1 1 2 5
Laura
? 4 5 2
37
Item-item similarity
Out of Africa
Star Wars
Air Force One
Liar, Liar
John 4 4 5 1 Adam 1 1 2 5
Laura
? 4 5 2
Application of item-item similarity
38
Amazon
39
SVD (Singular Value Decomposition)
40
Latent factors
Application of Latent Factor Model
41
GetJar
Ranking-based recommendation
42
Application in LinkedIn
43
Ranking-based model