28
Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon ([email protected]) Alternatively: Generating search query suggestions - natural language processing using Markov chains

Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon ([email protected])

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Statistical Research in the Tech Industry: Google Suggest & Instant

Donal McMahon ([email protected])

Alternatively: Generating search query suggestions - natural language processing using Markov chains

Page 2: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Big questionsData-driven Design

● Google has access to lots of data ○ search queries, emails, maps, social network data...

● How would you use it to improve its products? ● How would you know people liked these new products?

○ Best way to set up experiments? ○ What methods to evaluate performance?○ What is the best way to balance privacy and

usefulness?

Forty-one shades of blue!

Page 3: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Original "Suggest"

Page 4: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Original "Suggest"

● Just on search boxes - when you type in "h" it autocompletes to "hotmail", "hulu", "home depot", etc...

● How would YOU generate reasonable suggestions?

Page 5: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Why is it useful?

3 players:- Google- Advertisers- Users

Page 6: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

How would you generate suggestions?

● If you saw only one letter○ "t"○ What's the most likely next letter?

● When a second letter is entered○ "th"○ What would you guess then?

● How about after a couple of words?

● How would you build up a dictionary?

Page 7: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

How would you generate suggestions?

● What data would you use?

● Is there supplemental information you might have on the user?

● What about the location of the query?

● What about spam? How might it arrive and how to remove it?

Page 8: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Changes by location

Page 9: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Natural Language Processing● Intersection between computer science, linguistics and

statistics.

● "The goal of NLP is to design and build software that will analyze, understand, and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person."

● Some examples: ● Automatic summarisation● Sentiment analysis● Speech recognition● Machine translation

Page 10: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-grams

● An n-gram is a contiguous sequence of n items from a given sequence of text or speech.○ bigram○ trigram○ ...○ n-gram

● Build a predictive model for Xi based on Xi-(n-1),Xi-(n-2),...,Xi-1

● P(Xi|Xi-(n-1),Xi-(n-2),...,Xi-1)

Page 11: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Google N-grams● Not allowed give most recent data - but you can guess!!● As of 2006:

○ Processed 1,024,908,267,229 words of running text ○ Published the counts for all 1,176,470,663 five-word

sequences that appear at least 40 times.○ Data available in: http://www.ldc.upenn.edu/

File sizes: approx. 24 GB compressed (gzip'ed) text filesNumber of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663

Page 12: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Google N-grams● Not allowed give most recent data - but you can guess!!● As of 2006:

○ Processed 1,024,908,267,229 words of running text ○ Published the counts for all 1,176,470,663 five-word

sequences that appear at least 40 times.○ Data available in: http://www.ldc.upenn.edu/

File sizes: approx. 24 GB compressed (gzip'ed) text filesNumber of tokens: 1,024,908,267,229Number of sentences: 95,119,665,584Number of unigrams: 13,588,391Number of bigrams: 314,843,401Number of trigrams: 977,069,902Number of fourgrams: 1,313,818,354Number of fivegrams: 1,176,470,663

Page 13: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Google N-gramExample of 3-gram data in corpus:

ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection | 59ceramics collections , 66ceramics collections . 60ceramics combined with 46

Example of 4-gram data in corpus:

serve as the incoming 92serve as the incubator 99serve as the independent 794serve as the index 223serve as the indication 72serve as the indicator 120serve as the indicators 45serve as the indispensable 111serve as the indispensible 40serve as the individual 234serve as the industrial 52serve as the industry 607serve as the info 42

Page 14: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 15: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 16: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 17: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Extending to Instant Search

Page 18: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Google Instant● What's the actual effect for users?● For the general economy?● What about for the backend?

Areas here are suspicious!

Page 19: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Other Statistical Projects● Run experiments to see if users, advertisers and we like

potential feature launches○ How best to assign to experiments?○ How would you evaluate performance?

● The basic Backrub/Pagerank model● The advertising auction● How often to index the web?● Predicting flu trends● Self-driving cars● ....

Page 20: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Flu Trends

Page 21: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Thank you

Questions?

Jobs: http://www.google.com/about/jobs/

Contact: [email protected]

Page 22: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Appendix

Page 23: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Big questionsData-driven Design

● Google has access to lots of data ○ search queries, emails, maps, social network data...

● How would you use it to improve its products? ● How would you know people liked these new products?

○ Best way to set up experiments? ○ What methods to evaluate performance?○ What is the best way to balance privacy and

usefulness?

Forty-one shades of blue!

Page 24: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Why is this useful?Users● Speed for users● Spelling mistakes avoided - somewhat● Better search experience - how would we measure this?

Advertisers● Don't have to think about keyword targeting ● Could use this same methodology in finding similar

searches (kind of) - how would you extend it

Google● Fewer spurious searches● Can store good results and serve them quicker

Page 25: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 26: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 27: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

N-gram viewer

Page 28: Suggest & Instant Tech Industry: Google Statistical ... · Google Confidential and Proprietary Statistical Research in the Tech Industry: Google Suggest & Instant Donal McMahon (donalmc@google.com)

Google Confidential and Proprietary

Self-driving Cars