19

Diane wu insight final demo

Embed Size (px)

Citation preview

Recipe  search

Recipe  search

Make  sense  of  recipes  and  bake  like  a  pro

BakeSearch

Classic  Chocolate  chip  cookies Pa6y’s  best  chocolate  cookies

Peanut  bu6er  cookies Sugar  cookies  with  fros=ng Gooey  bu6er  cookies Banana  pumpkin  cookies

Black  and  white  cookies Halloween  cookies

Disambigua=ng  searches

Bigrams    +  

Trigrams

Candidate  labels

Ingr1  Ingr2  Ingr3  Ingr4

Ingr4  Ingr9  Ingr12  

Recipe  1 Recipe  2

Ingredients  in  both  recipes

Ingredients  in  either  recipe Jaccard  =

Defining  distance  measure

Cluster  recipes  based  on  ingredient

Cluster  recipes  based  on  ingredient

Challenges  of  big  data

•  Most  clustering  algorithms  (k-­‐means,  hierarchical,  graph-­‐based)  take  >30  seconds  

Challenges  of  big  data

•  Most  clustering  algorithms  (k-­‐means,  hierarchical,  graph-­‐based)  take  >30  seconds  

•  Pre-­‐calculate  jaccard  distances  between  every  pair  of  recipes  (1.6  billion  pairs!)  

Challenges  of  big  data

•  Most  clustering  algorithms  (k-­‐means,  hierarchical,  graph-­‐based)  take  >30  seconds  

•  Pre-­‐calculate  jaccard  distances  between  every  pair  of  recipes  (1.6  billion  pairs!)  

0

1000

2000

3000

4000

0 10 20 30 40# Ingredients in recipe

# R

ecip

es

Challenges  of  big  data

•  Most  clustering  algorithms  (k-­‐means,  hierarchical,  graph-­‐based)  take  >30  seconds  

•  Pre-­‐calculate  jaccard  distances  between  every  pair  of  recipes  (1.6  billion  pairs!)  

0

1000

2000

3000

4000

0 10 20 30 40# Ingredients in recipe

# R

ecip

es

0

300

600

900

1 2 5 10 50 100 500 1000 5000 10000# recipes containing ingredient

# in

gred

ient

s

Challenges  of  big  data

•  Most  clustering  algorithms  (k-­‐means,  hierarchical,  graph-­‐based)  take  >30  seconds  

•  Pre-­‐calculate  jaccard  distances  between  every  pair  of  recipes  (1.6  billion  pairs!)  

•  MapReduce  on  Amazon  EMR  •  Preload  into  networkx  graph

Find  enriched/depleted  ingredients

abs(Log-­‐2  ra=o)  >2

Domain-­‐specific  data  munging

•  Ingredients:  nltk  dic=onary  •  Domain  knowledge  •  Unit  parsing  

Tools

•  Yummly  API  •  Python  

–  Pycurl  –  Nltk  wordnet  

•  MySQL  

Back  end

•  Numpy,  Scipy  •  Nltk,  networkx  

•  Python,  R  •  Amazon  EMR  

Analysis

•  HTML/CSS/JavaScript  

•  Twi6er  Bootstrap  

•  Flask  •  Amazon  AWS  

Front  end

Diane  Wu

•  PhD  Gene=cs,  Stanford  University,  CA  •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  

Diane  Wu

•  PhD  Gene=cs,  Stanford  University,  CA  •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada  

Diane  Wu

•  PhD  Gene=cs,  Stanford  University,  CA  •  BSc  Compu=ng  Science,  Simon  Fraser,  Canada