46
Peter Brantley Internet Archive The Presidio 11.09

Books and Webs: Pulling the Down Rows

Embed Size (px)

DESCRIPTION

Elementary explanation of the difficulties of combining indexes for web pages and books, and means by which book index data can optimize general web searches at scale.

Citation preview

Page 1: Books and Webs: Pulling the Down Rows

Peter  Brantley        Internet  Archive        The  Presidio      11.09  

Page 2: Books and Webs: Pulling the Down Rows

Essential  premise  :  

combining  web  search  with  book  search  is  an  

engineering  challenge  

Page 3: Books and Webs: Pulling the Down Rows

I.    Presenting  combined  search  

Page 4: Books and Webs: Pulling the Down Rows

 For  several  years,  I  served  the  University  of  California  as  the  Director  of  Technology  for  the  California  Digital  Library.  

 (the  digital  library  group  for  the  UC  system)  

Page 5: Books and Webs: Pulling the Down Rows

We  held  various  conversations  over  time  with  Google  engineers  in  similar  spaces  ...  

grappling  with  the  indexing,  search,  and  user  interface  issues  with  combined  but    disparate  content  pools  (books,  journals,  web,  image,  video).      

(an  important  issue  for  digital  libraries)  

Page 6: Books and Webs: Pulling the Down Rows

 In  academic  info  markets,  “metasearch”  –  distributed  queries  with  central  resolution,  contested  for  primacy  with  search  over  aggregated  content.      

 To  an  extent,  only  LANL  and  commercial  search  pursued  aggregation  at  scale.  

 Aggregation  wins.      

Page 7: Books and Webs: Pulling the Down Rows

 “Google  is  undertaking  the  most  radical  change  to  its  search  results  ever,  introducing  a  "Universal  Search"  system  that  will  blend  listings  from  its  news,  video,  images,  local  and  book  search  engines  among  those  it  gathers  from  crawling  web  pages.”  

 “With  Universal  Search,  Google  will  hit  a  range  of  its  vertical  search  engines,  then  decide  if  the  relevancy  of  a  result  from  book  search  is  higher  than  a  match  from  web  page  search.”  

 Danny  Sullivan,  “Google  2.0”,  May  16  2007,    Search  Engine  Land  

Page 8: Books and Webs: Pulling the Down Rows

Simple  search  box  ...  but  

User  search  intentionality    for  books  vs.  web  can  differ  

“mark  twain  hawai’i”  

Page 9: Books and Webs: Pulling the Down Rows

Google  Scholar  is  vertical  search  engine.  

Explicit  opt-­‐in  discovery  service  for  STM  journal  content,  utilized  in  HE  academia.  

 Many  concerns  with  combining  the  Scholar  product  with  Big  Daddy.    User  search  goals  differ;  content  distinct;  different  indexing.      

Page 10: Books and Webs: Pulling the Down Rows

 From  2007  –  early  2009,  I  was  the  Director  of  the  Digital  Library  Federation.      I  made  a  request  of  Google  to  update  members  on  GBS  status  at  DLF’s  Fall  Forum,  Nov.  2008.  

 They  issued  an  explicit  request  for  HE  CS/EE  attention  to  the  problem  of  integrating  book  and  web  search.    Paraphrasing:  “Not  a  well  solved  problem”.    

Page 11: Books and Webs: Pulling the Down Rows

Some  comparisons  between  web  pages  

and  books.  

Page 12: Books and Webs: Pulling the Down Rows

 web:            short  doc  (web  page)  length    

 books:            long  doc  (book)  length  

Page 13: Books and Webs: Pulling the Down Rows

 web:        high  data  density  (per  doc  size)    

 books:          highly  variant  data  density        (e.g.  fiction  vs.  non-­‐fiction)  

Page 14: Books and Webs: Pulling the Down Rows

 web:          trillions  of  unique  web  pages  

 books:          (low)  millions  of  unique  books    

Page 15: Books and Webs: Pulling the Down Rows

 web:        many  complex  media  types  

 books:        text  and  image  media  

Page 16: Books and Webs: Pulling the Down Rows

 web:          dynamic  over  time        (avg.  TTL  of  web  pages  is  short)  

 books:          static  over  time        (print  books  permanently  fixed)  

Page 17: Books and Webs: Pulling the Down Rows

 web:        single  instances  (web  pages)  

 books:        duplicate  instances  (copies),        similar  instances  (editions),        in  multiple  languages  

Page 18: Books and Webs: Pulling the Down Rows

 web:        hyperlinked  in/out        (useful  in  relevance)  

 books:          normally  quiescent          (sometimes  citations)  

Page 19: Books and Webs: Pulling the Down Rows

 web:        designed  component  structure        {page  hierarchy  >  web  site}  

 books:          artificial  component  structure          {page  images  >  book}  

Page 20: Books and Webs: Pulling the Down Rows

Bibliographic  data  cf.  full  text  (book)  data:  

The  Melvyl  Recommender  Project  Full  Text  Extension  

(Supplementary  Report)  California  Digital  Library  

October  2006  

Funded  by  the  Andrew  W.  Mellon  Foundation  

Page 21: Books and Webs: Pulling the Down Rows

Project  Lead    Peter  Brantley,  Director  of  Technology  

Implementation  Team    Kirk  Hastings,  Text  Systems  Designer    Martin  Haye,  Programmer  (Contractor)    Steve  Toub,  Web  Design  Manager    Colleen  Whitney,  Programmer  and  Coordinator  

Assessment  Team    Jane  Lee,  Assessment  Analyst    Felicia  Poe,  Assessment  Coordinator    Lisa  Schiff,  Digital  Ingest  Programmer  

Page 22: Books and Webs: Pulling the Down Rows

Often  many  different  editions  of  popular  books.  Can  easily  artificially  boost  search  (n_copies).  

e.g.  “Moby  Dick”  published  100s  of  times      (and  in  many  languages)  

Depending  on  publication  date:      either  public  domain  (dep.  on  country)    or  in-­‐copyright  (out-­‐of-­‐print  or  in-­‐print)  

Page 23: Books and Webs: Pulling the Down Rows

 In  CDL  tests,  for  texts  vs.  bib  records:  

 Search  scoring  for  full  text  documents  was  typically  10  -­‐  100  times  larger  than  for  metadata-­‐only  records.    

 (Probably  approximate  magnitude        cf.  to  representative  web  pages).  

Page 24: Books and Webs: Pulling the Down Rows

 Easy  for  a  single  work  to  overwhelm  web  pages  in  relevance  for  a  well-­‐fitting  query.      

 E.g.  “English  working  class  labor  industrial”  

  The  making  of  the  English  working  class.    Author:  E  P  Thompson      Publisher:  New  York,  Pantheon  Books      [1964,  ©1963]  

Page 25: Books and Webs: Pulling the Down Rows

Books  are  long  strings  of  many  words,  split  into  n_sized  chunks  for  parsing.  

 Term  indexing  based  on  overlapping  and  variant  length  “word  vectors”    

   “battle”    “of”    “britain”        “battle  of”    “britain”      “battle”    “of  britain”        “battle  of  britain”  

Page 26: Books and Webs: Pulling the Down Rows

{Search  Term}  and  {Document}  weights  

1.  How  often  is  a  search  term  found  within  a  given  sized  chunk  of  text?  

2.  How  many  chunks  of  text  is  the  term  found  within?  

3.  How  many  chunks  of  text  does  the  document  contain?  

Page 27: Books and Webs: Pulling the Down Rows

 Which  is  better?  

1.    Adequate  matches  over  many  fields,    2.    Better  matches  in  fewer  fields.    

 Metrics  vary  between  books  and  web.    One  learns  from  one’s  mistakes.      More  books,  more  mistakes.    

Page 28: Books and Webs: Pulling the Down Rows

1.  Books  are  sooo  much  longer  than  web  pages.  2.  Books  produce  1000’s  more  chunks  than  web.  3.  Term  weighting  is  very  complex  for  long  docs.  4.  Indexes  must  be  integrated  for  web  and  books.  5.  But  source  term  indexes  are  biased  differently.  

Page 29: Books and Webs: Pulling the Down Rows

II.  What  you  get  from  books  

Page 30: Books and Webs: Pulling the Down Rows

 The  dialectic  between  books  and  web  provides  benefits  from  their  integration  (no  matter  the  pain).  

Books  enrich  general  web  search,  not  just  via  the  data  within  books,    but  also  by  books-­‐as-­‐data.  

Page 31: Books and Webs: Pulling the Down Rows

All  search  is  made  smarter  by  analysis.  

1.  structure  2.  contextualization  3.  relatedness  4.  normalization  5.  association  

Page 32: Books and Webs: Pulling the Down Rows

Because  of  digitization,  books  have  complications  cf.    web  pages;  a  result  of  OCR.  

1.  Language  detection  2.  Determining  which  words  get  indexed  

(–  stop  words  like  “of”  “a”  “the”  etc.)  3.  OCR  mistakes  hamper  word  recognition  

Page 33: Books and Webs: Pulling the Down Rows

Common  OCR  traps:  

    embedded  languages      Latin  or  archaic  spelling        complex  scripts  (e.g.  captions)      hyphenated  words    

Page 34: Books and Webs: Pulling the Down Rows

  ricain    ricaine    ricaines    ricana    ricanai    ricains    rical    rically    ricals  

  ricanant    ricanante    ricane    ricamente    ricanement    ricanements    rican    ricanes    ricans  

Page 35: Books and Webs: Pulling the Down Rows

More  words  from  more  books,    more  spelling  mistakes.  

   This  is  a  good  thing!  

 Leads  to  improved  spelling  correction    (in  multiple  languages)  and      more  sensitive  translation.    

Page 36: Books and Webs: Pulling the Down Rows

 “Our  understanding  of  language  is,  in  large  part,  built  inductively  from  statistical  analysis  of  large  samples  of  language  as  used  ‘in  the  wild,’  and  the  larger  the  sample,  the  better  our  understanding.”  

           -­‐  Hank  Bromley,  IA  

Page 37: Books and Webs: Pulling the Down Rows

 “Before  the  1930’s,  and  even  40’s  or  50’s  in  some  parts,    at  harvest  time,  a  horse  or  mule  drawn  wagon  would  go  through  the  field,  straddling  two  rows  of  corn.    Adults  working  on  each  side  of  the  wagon  would  pull  the  corn  from  the  standing  corn  stalks  and  toss  it  into  the  wagon.    The  unfortunate  younger  ones  would  have  to  pull  corn  from  the  down  rows  –  stoop  labor  in  its  worst  form.”                        -­‐  JDB  

Page 38: Books and Webs: Pulling the Down Rows

 Statistical  analysis  of  which  terms  tend  to  appear  in  the  vicinity  of  which  others),  useful  not  only  for  context-­‐sensitive  OCR,  but  more  significantly,  for  building  semantic  maps  and  other  kinds  of  knowledge  representation.    

 “dead  as  a  door  nail”  –  the  term  “door  nail”        is  not  commonly  found  elsewhere.  

Page 39: Books and Webs: Pulling the Down Rows

 Analysis  via  co-­‐occurrence  enables  one  to  construct  a  better  general  search  engine  by  enhancing  the  ability  to  distinguish  among  multiple  meanings  of  a  given  word  based  on  the  context  in  which  the  word  occurs.  

Page 40: Books and Webs: Pulling the Down Rows

 LSA  is  an  CS  term  referring  to  a  technique  in  “natural  language  processing  ...  of  analyzing  relationships  between  a  set  of  documents  and  the  terms  they  contain  by  producing  a  set  of  concepts  related  to  the  documents  and  terms.”    

           -­‐  Wikipedia.org  

Page 41: Books and Webs: Pulling the Down Rows

 (LSI  =  LSA  in  context  of  info  retrieval  (IR).)  

 “Clustering  is  a  way  to  group  documents  based  on  their  conceptual  similarity  to  each  other  ...  .    This  is  very  useful  when  dealing  with  an  unknown  collection  of  unstructured  text.”  

Page 42: Books and Webs: Pulling the Down Rows

 “Because  it  uses  a  strictly  mathematical  approach,  LSI  is  inherently  independent  of  language.    This  enables  LSI  to  elicit  the  semantic  content  of  information  written  in  any  language  without  requiring  the  use  of  auxiliary  structures,  such  as  dictionaries  and  thesauri.”  

Page 43: Books and Webs: Pulling the Down Rows

 “[Q]ueries  can  be  made  in  one  language,  such  as  English,  and  conceptually  similar  results  will  be  returned  even  if  they  are  composed  of  an  entirely  different  language  or  of  multiple  languages.”  

Page 44: Books and Webs: Pulling the Down Rows

 “LSI  automatically  adapts  to  new  and  changing  terminology,  and  it  has  been  shown  to  be  very  tolerant  of  noise  (i.e.,  misspelled  words,  typo-­‐graphical  errors,  unreadable  characters,  etc.).    

   “This  is  especially  important  for  applications  using  text  derived  from  Optical  Character  Recognition  (OCR)    ...”                -­‐  Wikipedia.org  

Page 45: Books and Webs: Pulling the Down Rows

 The  More  Data,  The  Better  ...    

 The  More  Books,  The  Better  Web  Search.  

Page 46: Books and Webs: Pulling the Down Rows

Contact  information:  

peter  brantley      internet  archive  @naypinya  (twitter)      peter  @  archive.org