Upload
andrew-hazlett
View
225
Download
0
Embed Size (px)
Citation preview
8/8/2019 Google Books: The Metadata Mess
1/29
8/8/2019 Google Books: The Metadata Mess
2/29
8/8/2019 Google Books: The Metadata Mess
3/29
8/8/2019 Google Books: The Metadata Mess
4/29
8/8/2019 Google Books: The Metadata Mess
5/29
8/8/2019 Google Books: The Metadata Mess
6/29
Three ways of using GBS!"Batch processing": data mining and ""electronic philology" !
"It's only reporters and computational linguists whocare if [hit-count estimation] is really precise." PeterNorvig, Google !
Text databases and the "new philologies": !The importance of language to social, intellectual, andpolitical history & literary study !
Coincides emergence of large-scale historical textdatabases !
When did happiness replace felicity in 17 th c?!
Plotting the rise & fall of propaganda !
How did liberalismspread in the early nineteenth-centuryEuropean context?. "6!
8/8/2019 Google Books: The Metadata Mess
7/29
Good enough for scholarship? !Will GBS be an adequate resource for scholarlyneeds now and in the future? !
Depends on: !Quality of imaging !Reliability and robustness of search tools !
Quality and reliability of metadata !e.g., date, edition history, author, subject classication,
etc.!
7!
8/8/2019 Google Books: The Metadata Mess
8/29
Good enough for scholarship? !Will GBS be an adequate resource for scholarlyneeds now and in the future? !
Depends on:!
Quality of imaging !
Reliability and robustness of search tools !
Quality and reliability of metadata !e.g., date, edition history, author, subject classication,
etc.!
But GBS metadata are awful. !
8!
8/8/2019 Google Books: The Metadata Mess
9/29
Quality Issues : !
Botched Scans, OCR, &c.!
9!
8/8/2019 Google Books: The Metadata Mess
10/29
10!
Metadata Issues: !
1899, annus mirabilis !
8/8/2019 Google Books: The Metadata Mess
11/29
Random Dates!
11!
1905!
1900!
1848!
1888!
8/8/2019 Google Books: The Metadata Mess
12/29
The pervasiveness of
misdatings !527 hits returned for"Internet" before 1950 !
12!
1899 !
1905 !
1878 !
1905 !
1946 !
1905 !
1905 !
1905 !
1939 !
8/8/2019 Google Books: The Metadata Mess
13/29
Famous before their lifetime !182 hits reported for "CharlesDickens" before birthdate(1812) !
Cf Jimi Hendrix, 81; LedZeppelin, 59 etc. !
13!
1878 !
1905 !
1946 !
1905 !
1905 !
8/8/2019 Google Books: The Metadata Mess
14/29
Ego-surng,Edgar Cayce
Style!
14!
"Our reputationprecedes us" !
8/8/2019 Google Books: The Metadata Mess
15/29
The frequency of misdatings !
15!
Search on "candy bar" < 1920yields 66 hits, 46 of themmisdated (70%) !
8/8/2019 Google Books: The Metadata Mess
16/29
Classication Errors!
16!
8/8/2019 Google Books: The Metadata Mess
17/29
Classication Errors!
17!
8/8/2019 Google Books: The Metadata Mess
18/29
The Pervasiveness of
Misclassication!
18!
family and relationships (4)
fiction (4)
biography and autobiography (1)
Unlabeled (1)(others classified as "music,""history," "literary collections")
Classications of rst 10 hits for !Tristram Shandy !
8/8/2019 Google Books: The Metadata Mess
19/29
The Pervasiveness of
Misclassication!
19!
First 10 hits for Leaves of Grass classify it as: "
Juvenile Nonction"Poetry !Fiction!
Literary Criticism!Biography & Autobiography,!
Counterfeits and Counterfeiting !
8/8/2019 Google Books: The Metadata Mess
20/29
More bad metadata !
20!
8/8/2019 Google Books: The Metadata Mess
21/29
More bad metadata !
21!
Reader, Imarketed him .
8/8/2019 Google Books: The Metadata Mess
22/29
Other metadata issues !Books ascribed to authors of introductions, orgiven no author at all. !
22!
8/8/2019 Google Books: The Metadata Mess
23/29
Other metadata issues !Titles linked to unrelated works. !
23!
8/8/2019 Google Books: The Metadata Mess
24/29
Other metadata issues !Strange bedfellows !
24!
8/8/2019 Google Books: The Metadata Mess
25/29
Who is to blame and what is
to be done? !"We got the metadata from the libraries": !
yes, sometimes but libraries didn't classify Hamlet as"antiques and collectibles" or Speculum as "Health & Fitness" !
Libraries don't use BISAC headings like "Antiques andCollectibles" and "Health & Fitness" in the rst place !
And publishers didn't assign BISAC codes to bookspublished before the 1980's !
25!
8/8/2019 Google Books: The Metadata Mess
26/29
The world according to BISAC!Making space for Bambi & Bullwinkle !
and Schiller, Petrarch & Verlaine !
26!
8/8/2019 Google Books: The Metadata Mess
27/29
The world according to BISAC!Making shelf space for Bambi & Bullwinkle !
and scrunching together Schiller, Petrarch & Verlaine !
27!Squeezing the universal library into a sububan bookstore !
8/8/2019 Google Books: The Metadata Mess
28/29
Correcting the Problem!Google: "We're on it (but it isn't a rst priority)" !
Correcting errors as noticed (like bad scans)? !
Crowd Sourcing? !
But errors/bad metadata affect 000,000's of records !
"Error correction" doesn't address poor & missingmetadata, inconsistent/confusing/inappropriateclassication schemes !
Why should the metadata decisions be left to Googleengineers? !
28!
8/8/2019 Google Books: The Metadata Mess
29/29
Correcting the Problem!HathiTrust to the rescue? !
But HathiTrust makes available only out-of-copyrightworks, has (relatively) limited computational resources !
Why should Google have no obligations to doGBS right? !
Google Book Search is "a tremendous public good forstudents, for teachers, for scholars, for everyone."Derek Slater, Google !
But a public good implies a public trust !
29!