Upload
diane-wu
View
361
Download
2
Tags:
Embed Size (px)
Citation preview
Classic Chocolate chip cookies Pa6y’s best chocolate cookies
Peanut bu6er cookies Sugar cookies with fros=ng Gooey bu6er cookies Banana pumpkin cookies
Black and white cookies Halloween cookies
Disambigua=ng searches
Bigrams +
Trigrams
Candidate labels
Ingr1 Ingr2 Ingr3 Ingr4
Ingr4 Ingr9 Ingr12
Recipe 1 Recipe 2
Ingredients in both recipes
Ingredients in either recipe Jaccard =
Defining distance measure
Challenges of big data
• Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds
Challenges of big data
• Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds
• Pre-‐calculate jaccard distances between every pair of recipes (1.6 billion pairs!)
Challenges of big data
• Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds
• Pre-‐calculate jaccard distances between every pair of recipes (1.6 billion pairs!)
0
1000
2000
3000
4000
0 10 20 30 40# Ingredients in recipe
# R
ecip
es
Challenges of big data
• Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds
• Pre-‐calculate jaccard distances between every pair of recipes (1.6 billion pairs!)
0
1000
2000
3000
4000
0 10 20 30 40# Ingredients in recipe
# R
ecip
es
0
300
600
900
1 2 5 10 50 100 500 1000 5000 10000# recipes containing ingredient
# in
gred
ient
s
Challenges of big data
• Most clustering algorithms (k-‐means, hierarchical, graph-‐based) take >30 seconds
• Pre-‐calculate jaccard distances between every pair of recipes (1.6 billion pairs!)
• MapReduce on Amazon EMR • Preload into networkx graph
Tools
• Yummly API • Python
– Pycurl – Nltk wordnet
• MySQL
Back end
• Numpy, Scipy • Nltk, networkx
• Python, R • Amazon EMR
Analysis
• HTML/CSS/JavaScript
• Twi6er Bootstrap
• Flask • Amazon AWS
Front end