35
Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Embed Size (px)

Citation preview

Page 1: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google BooksWhere we're going and how we got here

Jon OrwantEngineering ManagerGoogle Books

Page 2: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Overview

• Why and how Google scans books • The Google Books settlement• From pages to ideas

Page 3: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Why  and How Google Scans Books

 

Page 4: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Google’s mission

Online contentBillions of web pages

Offline contentBillions of items becoming

indexed

To organize the world’s information and make it universally accessible and useful.

Page 5: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Limited previews from publishers & authors

Page 6: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

http://books.google.com

Page 7: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Google Books in a nutshell

Page 8: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Vital stats

Scans• Number of books scanned: 15M+• Number of pages: 4B• Number of words: 2T• Libraries: 40+• Publishers: 30K+

Metadata• Number of books: 130M• Number of records: 4B• Number of metadata fields: 1T

Page 9: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Identifying the book

Library of Congress

title

author

publisher

year

Books in Print

Lord of the Rings, v.1 The Fellowship of the Ring

John Roland Reuel Tolkien J.R.R. Tolkien

Houghton Mifflin Ballantine Books

1954 1994

Page 10: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

How Google Handles Metadata

1.Collect data from 100+ sources (libraries, commercial aggregators, union catalogs, publishers, retailers)

2.Parse the records into our internal format MARC, ONIX, others... "UVA stores item data and call numbers in 955$a..."

• Cluster the records into expressions and manifestations• Create a "best of" record for each cluster• Index and display elements of that record on

books.google.com

Page 11: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

478 languages

Kashubian: 14Kara-kalpak: 102Kabyle: 50Kachin: 18Kalaallisut: 82Kamba: 29Kannada: 2600Karen: 50Kashmiri: 289Kanuri: 25Kawi: 106Kazakh: 1871

Kabardian: 16Khasi: 78Khoisan: 53Khotanese: 21Kikuyu, Gikuyu: 48Kinyarwanda: 77Kirghiz, Kyrgyz: 702Kimbundu: 14Konkani: 83Komi: 48Kongo: 134Korean: 35905

Kosraean: 10Kpelle: 6Karachay-balkar: 17Karelian: 28Kru: 26Kurukh: 30Kuanyama: 9Kumyk: 16Kurdish: 220Kutenai: 0Klingon: 3Kalmyk: 26

Page 12: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Translit-aware similarity metrics for names and titles

Page 13: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Material content & form

<datafield tag="245" ind1=" " ind2=" ">  <subfield code="a">[Turkey probe]</subfield>

<datafield tag="260" ind1=" " ind2=" ">  <subfield code="a">Syracuse : Betty Crocker Supplies, ca 1987</subfield>

<datafield tag="300" ind1=" " ind2=" ">  <subfield code="a">1 pointy thing , 46 cm. </subfield>

<datafield tag="650" ind1=" " ind2=" ">  <subfield code="a">Microwave cookery</subfield> <datafield tag="650" ind1=" " ind2=" ">  <subfield code="a">April Fool's Day</subfield>

Page 14: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Cover generation

Page 15: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Parsing Uncertain Dates

• 18??• [196-?]• 1957/8• late 14th century• finita quarto nonas Januarias [1490]• mense Septembri: Anno Millesimo q[ui]ngentesimo

decimonono• mense iulio, anno M.D.XXXX• (Hebrew year 5751 = Gregorian 1990/1 CE) התשנ״א• ١٣٧٣ (either Islamic year 1373 AH = Gregorian 1953/4 CE or

Persian year 1373 AP = Gregorian 1994/5 CE)

Page 16: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Annotations

Page 17: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

The Google Books Settlement 

Page 18: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Books Settlement

If approved, resolves lawsuit brought against Google by AAP & AGBenefits:

o Rightsholder controlo Snippets => 20%o Library subscriptionso Free terminal in every US public library buildingo Downloadable books for purchaseo Access for the print-disabledo Book Rights Registry: a non-profit organization to find and pay

rightsholderso Research corpus

Page 19: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Linguistic Analysis

"Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books."

Page 20: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

From Pages to Ideas 

Page 21: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Books as a corpus of human knowledge

• Understand one book• Understand all books• Understand relations

between books

Page 22: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Insights into human progress

Source: Matthew Gray & Yuan K. Shen

oxide of leadmay be thusa heavy firea striking proofmiles distant fromterms of peacepresents the appearancemore than mortalvexation of spiritzeal and devotion

lesbian and gayhealth care professionalsabuse and neglectthe overall processshift away fromthe power elitea research projectthe poor countriesprobability of failureincreased awareness of

Old-fashioned trigrams New-fangled trigrams

Page 23: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Semantic Stack

Page 24: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Semantic Stack (video remix)

 

Page 25: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Reframing the Victorians (Cohen & Gibbs, GMU)

 

Page 26: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Victorian terms   

 

Page 27: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Discipline-specific progress occurs by...

...moving up one level

...or improving the results at one level by creating a reusable data set

...or reasonably using one level as a proxy for a higher level

Page 28: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Reframing the Victorians

...reasonably using one level as a proxy for a higher level

Page 29: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Interdisciplinary progress occurs by...

...moving up one level

...or improving the results at one level

...by creating infrastructure that can be used by others

Page 30: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Intralanguage translations (Efron, U. Illinois)

Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques

Page 31: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Intralanguage translations

improving the results at one level

...by creating infrastructure that can be used by others

Page 32: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Grammar inference(Abney & Szymanski, Univ. Michigan)

Automatic Identification and Extraction of Structured Linguistic Passages in Texts

Page 33: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Google Confidential and Proprietary

Grammar inference

 moving up one level

...by creating infrastructure that can be used by others

Page 34: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

The "Great Man" theory

 

Page 35: Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

Thank You!Q&A