View
213
Download
0
Embed Size (px)
Citation preview
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Million Book Project
Michael I. Shamos, Ph.D., J.D.Director, Universal Library
School of Computer ScienceCarnegie Mellon University
Pittsburgh, Pennsylvania, USA
Where is Pittsburgh?
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Universal Library
• Project of Carnegie Mellon University
• All published works of mankind digitized and online
• Instantly available
• Free to read
• In any language
• Anywhere in the world
• Searchable and browsable by humans and machines
• DEMO
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Why Digitize?
• Books are inefficient carriers of information
• Heavy, expensive
• Environmentally harmful
• Linear, not hyperlinked
• Poorly indexed
• Not searchable
• Not easily transported
• MOST IMPORTANT: not everyone has every book
• IN FACT, no one has every book
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
How Do We Convey Information?
• Books
• Orally
• Observation
• Teaching (a combination of the above)
• The book is
– Information
– AND a physical carrier
• The information can be conveyed digitally
• We don’t CARE about the carrier
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Objections to Digital Books
• People can’t read books from a screen
• Books are convenient
– You can carry them
– You can write in them
– You can put a place marker in them
– You can lend them to people
• Books are beautiful
• Books smell nice
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
How Many Books Are There?
• 1996 World published output: 800,000 books• Total book titles ever published ~ 100M• 1 book = 500 pp., 2000 char/page
= 1 megabyte uncompressed (about 1 floppy disk)– 108 books = 1014 bytes = 100 terabytes– Disk costs HK$10 per gigabyte– 100 terabytes costs about HK$1 million
• Total books in WorldCat = 41,000,000– Requires only 41 terabytes, HK$410,000
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
We Can Store Everything
100 terabytes can store:
3,000,000,000 photographs (compressed)
100,000,000 books
10,000 movies
300 years of music
100 terabytes occupies 240 cubic feet on DVD
= 1 van 6 x 4 x 10 feet
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
We Can Send Everything
Human speech: 30 bits/sec
Gigabit Internet: 1,000,000,000 bits/sec
(This talk: < 1 millisecond including slides)
Feb. 2002 Fujitsu achieved 5 terabits per second on one optical fiber
100 terabytes = 800 terabits
It would take less than 3 minutes to transmit every book ever published
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Why a Universal Library?
• The largest library in the world (U.S. Library of Congress) has less than 20% of all books– Two hours to retrieve one book
– Must travel to Washington, DC
– No copying allowed
• Largest university library: 14 million (Harvard )
• Hong Kong University: 3 million
• Typical large U.S. university: 1 million
• Largest high school: 130,000 (Philips Andover)
• Largest public high schools: 30,000 (U.S.)
• Average high school: 5,000 (U.S.)
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Universal Library Goals
• Democratization of information– Knowledge is power
• Education, distance learning– “Library” for distance education
• Research, technology transfer
• Promotion of understanding
• Preservation of human culture
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Million Book Project
• A million books is a lot. CMU just reached 1 million.• Idea: scan 1 million books in each of several
countries. Make them available to everyone• NSF provided $3 million to buy scanners for China
and India• China and India are each providing 500 full-time
people for scanning• Each country is scanning 1 million books over the
next 3 years• CMU is hosting, indexing, building infrastructure
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Million Book Project Operation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Effect of the Million Book Project
• All books scanned (in many languages) will be available free to read to everyone over the Internet
• Many cultural artifacts and treasures are being scanned
• All works are fully keyword-indexed and searchable• All participating countries will have complete copies
(mirrors) of all content• Knowledge will be available to all
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Partners
• China – Beijing University – Chinese Academy of Science – Fudan University – Ministry of Education of China – Nanjing University – Shanghai Jiaotung University– State Planning Commission of China – Tsinghua University – Zhejiang University
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Partners
• India – Arulmigu Kalasalingam College Of Engineering – Goa University – Indian Institute of Information Technology - Allahabad – Indian Institute of Science – International Institute of Information Technology - Hyderabad – Shanmugha Arts,Science,Technology & Research Academy – Tirumala Tirupati Devasthanams – Maharashtra Industrial Development Corporation – University of Pune
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Copyright Problem
• Compulsory License– Owner CAN’T refuse; user MUST pay– Limited in US (Music: 1.55¢/min, 8.0¢/song)– Extensive compulsory licensing in Japan
• Flat-fee subscription (e.g. HBO)• Free (subsidized by government)• Public Lending Right (UK)• “Buy” button• Metered use (electric company)• Micropayments
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Roadblocks
• Biggest obstacle: librarians• Belief that the project is too large• No funding
– In the U.S., everyone assumes it is being done– Outside the U.S., everyone assume the U.S. is doing it
• Copyright• Myriad of small independent digital libraries
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Policy Challenges
• Convenience displaces quality (Gresham)• What to digitize first?• Suitable copyright law• Economics (Who pays? Who gets?)• Privacy• Reliability of information• Change in the nature of teaching, learning
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
LAYERED UL MODEL
UNIVERSAL LIBRARY:DIGITIZED ITEMS
NAVIGATION TOOLSRETRIEVER SERVICE
CUSTOMCATALOGS
HYPERTEXTGENERATORS
SEARCHERS
TRANSLATORSNEWS AGENTS
HUMANUSERS
DIRECTMACHINE
USERS
HUMANUSERS
ENCYCLOPEDIA
VALUE-ADDED SERVICES
BASELINE UL SERVICES
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Universal Dictionary
• A glossary containing every word in every language, with a translation
• Use: indexing the Universal Library• Now has 1 million words (26 languages)• 2 million by February (50 languages) • 3 million by May (80 languages)
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
QA&
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Multilingual Searching
• Find all documents containing “elephant”• Find all documents about elephants
– Even if the word “elephant” does not occur in the document
• Translation, transliteration– Book titles, works of art, proper names– Idioms, colloquial phrases
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Use of © Content
• Philosophy: must pay for use– Authors, publishers must not lose
• Implied license• Bulk licensing • Compulsory licensing
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
The Universal Dictionary
• Lexicon of all words in all languages, with English translations, e.g.
• Obtained from– Web dictionaries– Scanning + OCR– Publishers machine-readable form
• Uses:– Indexing the Universal Library– Machine translation– Spelling correction– Linguistic studies
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Technological Challenges
• Input (scanning, digitizing, OCR)• Data representation
– text, kset, notations, images, web pages• Navigation and Search• Multilingual Issues• Output (voice, pictures, virtual reality)• Synthetic documents
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Navigation
• Keyword searching does not scale– Imagine 106 hits
• Browsing, finding, searching, flying • Fractal view
– Keys are granularity and connectivity• View whole collections or one glyph
– Hyperbolic trees, virtual reality, discovered similarities
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Hyperbolic Tree Navigation
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Multilingual Issues
• Character sets
• RepresentationsÍîäà ôèçè÷åñêè íàõîäèòñÿ â çäàíèè Èçâåñòèé
Нода физически находится в здании Известий
• Multilingual navigation
• Translation assistance
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
UNIVERSAL LIBRARY STATUS
• >10,000 digital volumes• Public-domain issues of the New York Times• Portal to hundreds of other collections• Art, music, video, Internet radio• Magazines, newspapers, journals• Installing 1.25 terabytes
Visit www.ulib.org
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Language Identification
• Given a string x, which language(s) is it from?– What language is “peogwir” from?
• Given x, which language(s) does it seem to be from?– “contrefaçon” “dazs” “chalupa” “mbwewe”
• Character set may be unknown• Brief input (e.g. single word)• Intermixed languages
– “Zeitgeist Fever”• Neologisms, slang, abbreviations
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Generative Approach
• Assume that the lexicon of a language L is generated by a probabilistic finite-state machine ML
<
a
b
z
a
z
>
a
z
>
a
z
>
a
z
>
STARTOF WORD
PROB THAT WORDSTARTS WITH A
PROB THAT WORDSTARTS WITH Z
PROB (a|<a)
PROB (>|<a)
PROB (a|<z)
PROB (z|<z)
PROB (z|<za)
> PRODUCT =PROB (<aza>)
> PRODUCT =PROB (<zaz>)
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Problems
• Where do all the required probabilities come from?• How can they all be stored?• If string x does not actually occur in a language, its
probability will be zero. Won’t work for neologisms or misspellings.
• “Moving trigrams” work
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Generative Approach
• Let pL(y| x) be the probability that string x is followed by string y
in language L (i.e. the probability given a prefix x the suffix is y)
• Then pL(x), the probability that x= <x1 x2 x3 ... xn > was generated
by L, is pL(x1 |<) pL ( x2 |<x1 ) pL(x3| <x1 x2) pL(x4| <x1 x2 x3)
… pL(xn| <x1 x2 x3 ... xn-1) pL(>| <x1 x2 x3 ... xn-1 xn)
• This computation requires huge memory, so approximate:Assume pL(xn| <x1 x2 x3 ... xn-1) pL(xn| xn-2 xn-1)
• So pL(x) pL(x3| <x1 x2) pL(x4|x2 x3) … pL(>| xn-1 xn)
• Try it
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Searching Mathematics
0
2sin2
dxxe x
Has this integral ever been evaluated?
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Searching Mathematics
0
2sin2
dxxe x
4/92
22
MATHEMATICA C.F.:
Integrate[
Times[Power[E,Times[
-1,Power[V1,2]]],
Sin[Power[V1,2]]],
{V1,0,Infinity}]
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Hierarchical Nature of Aboutness
• What does it mean to say that a book is “about” chemistry? Can a word be about chemistry?
• If one paragraph is about chemistry, is the book about chemistry?
• If the book is about chemistry, is every sentence in it about chemistry?
• Aboutness is central to cataloging and retrieval
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Aboutness HierarchyUniverse
Word
Sentence
Paragraph
Section
Chapter
Collection
BookNewspaper
Article
Photograph
Object
3D Artifact
Glyph
KEYWORD SEARCHINGOCCURS HERE
SUBJECT SEARCHINGOCCURS HERE
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Thesauri and Aboutness
• A set of numbered thesaurus entries defines a topic• Thesaurus is topic-hierarchical• 1011 Hindrance
– 1011.5 barrier, bar, gate, fence, wall, rampart, dam, moat …
• A word is “about” any topic to which it belongs Dam:– 241.1 lake– 293.7 close (v.)– 560.11 mother– 757.2 horse– 856.11 put a stop to (v.)– 1011.5 barrier
Thesaurus + aboutness hierarchy canbe used to disambiguate meaningswithout “understanding”
Note: topic numbers are languageindependent
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
Set Theory of Aboutness
• Given a finite universe W of objects (e.g. all words)• Define a topic T W to be a subset of W (a wordlist)• Topic inclusion (defines the hierarchy):
– Topic T includes topic S iff S T • Definition of aboutness:
– A subset P W of the universe (e.g., a book) is about topic T iff P T (intersection is nonempty)
• Hierarchical nature of aboutness:– If P is about S and T includes S, then P is also about T
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS
We Can Search a Few Things
• Text• In the Roman alphabet• “Hidden” databases effectively unsearchable• No images or two-dimensional structures
– math– music– dance notation . . .
• No subject index of photographs or art– Corbis is one of the “best”