Google Books: The Metadata Mess

Embed Size (px)

Citation preview

  • 8/8/2019 Google Books: The Metadata Mess

    1/29

  • 8/8/2019 Google Books: The Metadata Mess

    2/29

  • 8/8/2019 Google Books: The Metadata Mess

    3/29

  • 8/8/2019 Google Books: The Metadata Mess

    4/29

  • 8/8/2019 Google Books: The Metadata Mess

    5/29

  • 8/8/2019 Google Books: The Metadata Mess

    6/29

    Three ways of using GBS!"Batch processing": data mining and ""electronic philology" !

    "It's only reporters and computational linguists whocare if [hit-count estimation] is really precise." PeterNorvig, Google !

    Text databases and the "new philologies": !The importance of language to social, intellectual, andpolitical history & literary study !

    Coincides emergence of large-scale historical textdatabases !

    When did happiness replace felicity in 17 th c?!

    Plotting the rise & fall of propaganda !

    How did liberalismspread in the early nineteenth-centuryEuropean context?. "6!

  • 8/8/2019 Google Books: The Metadata Mess

    7/29

    Good enough for scholarship? !Will GBS be an adequate resource for scholarlyneeds now and in the future? !

    Depends on: !Quality of imaging !Reliability and robustness of search tools !

    Quality and reliability of metadata !e.g., date, edition history, author, subject classication,

    etc.!

    7!

  • 8/8/2019 Google Books: The Metadata Mess

    8/29

    Good enough for scholarship? !Will GBS be an adequate resource for scholarlyneeds now and in the future? !

    Depends on:!

    Quality of imaging !

    Reliability and robustness of search tools !

    Quality and reliability of metadata !e.g., date, edition history, author, subject classication,

    etc.!

    But GBS metadata are awful. !

    8!

  • 8/8/2019 Google Books: The Metadata Mess

    9/29

    Quality Issues : !

    Botched Scans, OCR, &c.!

    9!

  • 8/8/2019 Google Books: The Metadata Mess

    10/29

    10!

    Metadata Issues: !

    1899, annus mirabilis !

  • 8/8/2019 Google Books: The Metadata Mess

    11/29

    Random Dates!

    11!

    1905!

    1900!

    1848!

    1888!

  • 8/8/2019 Google Books: The Metadata Mess

    12/29

    The pervasiveness of

    misdatings !527 hits returned for"Internet" before 1950 !

    12!

    1899 !

    1905 !

    1878 !

    1905 !

    1946 !

    1905 !

    1905 !

    1905 !

    1939 !

  • 8/8/2019 Google Books: The Metadata Mess

    13/29

    Famous before their lifetime !182 hits reported for "CharlesDickens" before birthdate(1812) !

    Cf Jimi Hendrix, 81; LedZeppelin, 59 etc. !

    13!

    1878 !

    1905 !

    1946 !

    1905 !

    1905 !

  • 8/8/2019 Google Books: The Metadata Mess

    14/29

    Ego-surng,Edgar Cayce

    Style!

    14!

    "Our reputationprecedes us" !

  • 8/8/2019 Google Books: The Metadata Mess

    15/29

    The frequency of misdatings !

    15!

    Search on "candy bar" < 1920yields 66 hits, 46 of themmisdated (70%) !

  • 8/8/2019 Google Books: The Metadata Mess

    16/29

    Classication Errors!

    16!

  • 8/8/2019 Google Books: The Metadata Mess

    17/29

    Classication Errors!

    17!

  • 8/8/2019 Google Books: The Metadata Mess

    18/29

    The Pervasiveness of

    Misclassication!

    18!

    family and relationships (4)

    fiction (4)

    biography and autobiography (1)

    Unlabeled (1)(others classified as "music,""history," "literary collections")

    Classications of rst 10 hits for !Tristram Shandy !

  • 8/8/2019 Google Books: The Metadata Mess

    19/29

    The Pervasiveness of

    Misclassication!

    19!

    First 10 hits for Leaves of Grass classify it as: "

    Juvenile Nonction"Poetry !Fiction!

    Literary Criticism!Biography & Autobiography,!

    Counterfeits and Counterfeiting !

  • 8/8/2019 Google Books: The Metadata Mess

    20/29

    More bad metadata !

    20!

  • 8/8/2019 Google Books: The Metadata Mess

    21/29

    More bad metadata !

    21!

    Reader, Imarketed him .

  • 8/8/2019 Google Books: The Metadata Mess

    22/29

    Other metadata issues !Books ascribed to authors of introductions, orgiven no author at all. !

    22!

  • 8/8/2019 Google Books: The Metadata Mess

    23/29

    Other metadata issues !Titles linked to unrelated works. !

    23!

  • 8/8/2019 Google Books: The Metadata Mess

    24/29

    Other metadata issues !Strange bedfellows !

    24!

  • 8/8/2019 Google Books: The Metadata Mess

    25/29

    Who is to blame and what is

    to be done? !"We got the metadata from the libraries": !

    yes, sometimes but libraries didn't classify Hamlet as"antiques and collectibles" or Speculum as "Health & Fitness" !

    Libraries don't use BISAC headings like "Antiques andCollectibles" and "Health & Fitness" in the rst place !

    And publishers didn't assign BISAC codes to bookspublished before the 1980's !

    25!

  • 8/8/2019 Google Books: The Metadata Mess

    26/29

    The world according to BISAC!Making space for Bambi & Bullwinkle !

    and Schiller, Petrarch & Verlaine !

    26!

  • 8/8/2019 Google Books: The Metadata Mess

    27/29

    The world according to BISAC!Making shelf space for Bambi & Bullwinkle !

    and scrunching together Schiller, Petrarch & Verlaine !

    27!Squeezing the universal library into a sububan bookstore !

  • 8/8/2019 Google Books: The Metadata Mess

    28/29

    Correcting the Problem!Google: "We're on it (but it isn't a rst priority)" !

    Correcting errors as noticed (like bad scans)? !

    Crowd Sourcing? !

    But errors/bad metadata affect 000,000's of records !

    "Error correction" doesn't address poor & missingmetadata, inconsistent/confusing/inappropriateclassication schemes !

    Why should the metadata decisions be left to Googleengineers? !

    28!

  • 8/8/2019 Google Books: The Metadata Mess

    29/29

    Correcting the Problem!HathiTrust to the rescue? !

    But HathiTrust makes available only out-of-copyrightworks, has (relatively) limited computational resources !

    Why should Google have no obligations to doGBS right? !

    Google Book Search is "a tremendous public good forstudents, for teachers, for scholars, for everyone."Derek Slater, Google !

    But a public good implies a public trust !

    29!