46
HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania, USA

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Million Book Project

Michael I. Shamos, Ph.D., J.D.Director, Universal Library

School of Computer ScienceCarnegie Mellon University

Pittsburgh, Pennsylvania, USA

Page 2: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

Where is Pittsburgh?

Page 3: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Library

• Project of Carnegie Mellon University

• All published works of mankind digitized and online

• Instantly available

• Free to read

• In any language

• Anywhere in the world

• Searchable and browsable by humans and machines

• DEMO

Page 4: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Why Digitize?

• Books are inefficient carriers of information

• Heavy, expensive

• Environmentally harmful

• Linear, not hyperlinked

• Poorly indexed

• Not searchable

• Not easily transported

• MOST IMPORTANT: not everyone has every book

• IN FACT, no one has every book

Page 5: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

How Do We Convey Information?

• Books

• Orally

• Observation

• Teaching (a combination of the above)

• The book is

– Information

– AND a physical carrier

• The information can be conveyed digitally

• We don’t CARE about the carrier

Page 6: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Objections to Digital Books

• People can’t read books from a screen

• Books are convenient

– You can carry them

– You can write in them

– You can put a place marker in them

– You can lend them to people

• Books are beautiful

• Books smell nice

Page 7: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

How Many Books Are There?

• 1996 World published output: 800,000 books• Total book titles ever published ~ 100M• 1 book = 500 pp., 2000 char/page

= 1 megabyte uncompressed (about 1 floppy disk)– 108 books = 1014 bytes = 100 terabytes– Disk costs HK$10 per gigabyte– 100 terabytes costs about HK$1 million

• Total books in WorldCat = 41,000,000– Requires only 41 terabytes, HK$410,000

Page 8: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Store Everything

100 terabytes can store:

3,000,000,000 photographs (compressed)

100,000,000 books

10,000 movies

300 years of music

100 terabytes occupies 240 cubic feet on DVD

= 1 van 6 x 4 x 10 feet

Page 9: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Send Everything

Human speech: 30 bits/sec

Gigabit Internet: 1,000,000,000 bits/sec

(This talk: < 1 millisecond including slides)

Feb. 2002 Fujitsu achieved 5 terabits per second on one optical fiber

100 terabytes = 800 terabits

It would take less than 3 minutes to transmit every book ever published

Page 10: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Why a Universal Library?

• The largest library in the world (U.S. Library of Congress) has less than 20% of all books– Two hours to retrieve one book

– Must travel to Washington, DC

– No copying allowed

• Largest university library: 14 million (Harvard )

• Hong Kong University: 3 million

• Typical large U.S. university: 1 million

• Largest high school: 130,000 (Philips Andover)

• Largest public high schools: 30,000 (U.S.)

• Average high school: 5,000 (U.S.)

Page 11: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Universal Library Goals

• Democratization of information– Knowledge is power

• Education, distance learning– “Library” for distance education

• Research, technology transfer

• Promotion of understanding

• Preservation of human culture

Page 12: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Million Book Project

• A million books is a lot. CMU just reached 1 million.• Idea: scan 1 million books in each of several

countries. Make them available to everyone• NSF provided $3 million to buy scanners for China

and India• China and India are each providing 500 full-time

people for scanning• Each country is scanning 1 million books over the

next 3 years• CMU is hosting, indexing, building infrastructure

Page 13: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 14: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 15: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 16: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 17: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 18: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Million Book Project Operation

Page 19: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Effect of the Million Book Project

• All books scanned (in many languages) will be available free to read to everyone over the Internet

• Many cultural artifacts and treasures are being scanned

• All works are fully keyword-indexed and searchable• All participating countries will have complete copies

(mirrors) of all content• Knowledge will be available to all

Page 20: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Partners

• China – Beijing University – Chinese Academy of Science – Fudan University – Ministry of Education of China – Nanjing University – Shanghai Jiaotung University– State Planning Commission of China – Tsinghua University – Zhejiang University

Page 21: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Partners

• India – Arulmigu Kalasalingam College Of Engineering – Goa University – Indian Institute of Information Technology - Allahabad – Indian Institute of Science – International Institute of Information Technology - Hyderabad – Shanmugha Arts,Science,Technology & Research Academy – Tirumala Tirupati Devasthanams – Maharashtra Industrial Development Corporation – University of Pune

Page 22: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Copyright Problem

• Compulsory License– Owner CAN’T refuse; user MUST pay– Limited in US (Music: 1.55¢/min, 8.0¢/song)– Extensive compulsory licensing in Japan

• Flat-fee subscription (e.g. HBO)• Free (subsidized by government)• Public Lending Right (UK)• “Buy” button• Metered use (electric company)• Micropayments

Page 23: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Roadblocks

• Biggest obstacle: librarians• Belief that the project is too large• No funding

– In the U.S., everyone assumes it is being done– Outside the U.S., everyone assume the U.S. is doing it

• Copyright• Myriad of small independent digital libraries

Page 24: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Policy Challenges

• Convenience displaces quality (Gresham)• What to digitize first?• Suitable copyright law• Economics (Who pays? Who gets?)• Privacy• Reliability of information• Change in the nature of teaching, learning

Page 25: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

LAYERED UL MODEL

UNIVERSAL LIBRARY:DIGITIZED ITEMS

NAVIGATION TOOLSRETRIEVER SERVICE

CUSTOMCATALOGS

HYPERTEXTGENERATORS

SEARCHERS

TRANSLATORSNEWS AGENTS

HUMANUSERS

DIRECTMACHINE

USERS

HUMANUSERS

ENCYCLOPEDIA

VALUE-ADDED SERVICES

BASELINE UL SERVICES

Page 26: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Dictionary

• A glossary containing every word in every language, with a translation

• Use: indexing the Universal Library• Now has 1 million words (26 languages)• 2 million by February (50 languages) • 3 million by May (80 languages)

Page 27: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

QA&

Page 28: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Multilingual Searching

• Find all documents containing “elephant”• Find all documents about elephants

– Even if the word “elephant” does not occur in the document

• Translation, transliteration– Book titles, works of art, proper names– Idioms, colloquial phrases

Page 29: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Use of © Content

• Philosophy: must pay for use– Authors, publishers must not lose

• Implied license• Bulk licensing • Compulsory licensing

Page 30: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

The Universal Dictionary

• Lexicon of all words in all languages, with English translations, e.g.

• Obtained from– Web dictionaries– Scanning + OCR– Publishers machine-readable form

• Uses:– Indexing the Universal Library– Machine translation– Spelling correction– Linguistic studies

Page 31: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Technological Challenges

• Input (scanning, digitizing, OCR)• Data representation

– text, kset, notations, images, web pages• Navigation and Search• Multilingual Issues• Output (voice, pictures, virtual reality)• Synthetic documents

Page 32: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Navigation

• Keyword searching does not scale– Imagine 106 hits

• Browsing, finding, searching, flying • Fractal view

– Keys are granularity and connectivity• View whole collections or one glyph

– Hyperbolic trees, virtual reality, discovered similarities

Page 33: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Hyperbolic Tree Navigation

Page 34: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Multilingual Issues

• Character sets

• RepresentationsÍîäà ôèçè÷åñêè íàõîäèòñÿ â çäàíèè Èçâåñòèé

Нода физически находится в здании Известий

• Multilingual navigation

• Translation assistance

Page 35: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

UNIVERSAL LIBRARY STATUS

• >10,000 digital volumes• Public-domain issues of the New York Times• Portal to hundreds of other collections• Art, music, video, Internet radio• Magazines, newspapers, journals• Installing 1.25 terabytes

Visit www.ulib.org

Page 36: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Language Identification

• Given a string x, which language(s) is it from?– What language is “peogwir” from?

• Given x, which language(s) does it seem to be from?– “contrefaçon” “dazs” “chalupa” “mbwewe”

• Character set may be unknown• Brief input (e.g. single word)• Intermixed languages

– “Zeitgeist Fever”• Neologisms, slang, abbreviations

Page 37: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Generative Approach

• Assume that the lexicon of a language L is generated by a probabilistic finite-state machine ML

<

a

b

z

a

z

>

a

z

>

a

z

>

a

z

>

STARTOF WORD

PROB THAT WORDSTARTS WITH A

PROB THAT WORDSTARTS WITH Z

PROB (a|<a)

PROB (>|<a)

PROB (a|<z)

PROB (z|<z)

PROB (z|<za)

> PRODUCT =PROB (<aza>)

> PRODUCT =PROB (<zaz>)

Page 38: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Problems

• Where do all the required probabilities come from?• How can they all be stored?• If string x does not actually occur in a language, its

probability will be zero. Won’t work for neologisms or misspellings.

• “Moving trigrams” work

Page 39: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Generative Approach

• Let pL(y| x) be the probability that string x is followed by string y

in language L (i.e. the probability given a prefix x the suffix is y)

• Then pL(x), the probability that x= <x1 x2 x3 ... xn > was generated

by L, is pL(x1 |<) pL ( x2 |<x1 ) pL(x3| <x1 x2) pL(x4| <x1 x2 x3)

… pL(xn| <x1 x2 x3 ... xn-1) pL(>| <x1 x2 x3 ... xn-1 xn)

• This computation requires huge memory, so approximate:Assume pL(xn| <x1 x2 x3 ... xn-1) pL(xn| xn-2 xn-1)

• So pL(x) pL(x3| <x1 x2) pL(x4|x2 x3) … pL(>| xn-1 xn)

• Try it

Page 40: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Searching Mathematics

0

2sin2

dxxe x

Has this integral ever been evaluated?

Page 41: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Searching Mathematics

0

2sin2

dxxe x

4/92

22

MATHEMATICA C.F.:

Integrate[

Times[Power[E,Times[

-1,Power[V1,2]]],

Sin[Power[V1,2]]],

{V1,0,Infinity}]

Page 42: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Hierarchical Nature of Aboutness

• What does it mean to say that a book is “about” chemistry? Can a word be about chemistry?

• If one paragraph is about chemistry, is the book about chemistry?

• If the book is about chemistry, is every sentence in it about chemistry?

• Aboutness is central to cataloging and retrieval

Page 43: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Aboutness HierarchyUniverse

Word

Sentence

Paragraph

Section

Chapter

Collection

BookNewspaper

Article

Photograph

Object

3D Artifact

Glyph

KEYWORD SEARCHINGOCCURS HERE

SUBJECT SEARCHINGOCCURS HERE

Page 44: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Thesauri and Aboutness

• A set of numbered thesaurus entries defines a topic• Thesaurus is topic-hierarchical• 1011 Hindrance

– 1011.5 barrier, bar, gate, fence, wall, rampart, dam, moat …

• A word is “about” any topic to which it belongs Dam:– 241.1 lake– 293.7 close (v.)– 560.11 mother– 757.2 horse– 856.11 put a stop to (v.)– 1011.5 barrier

Thesaurus + aboutness hierarchy canbe used to disambiguate meaningswithout “understanding”

Note: topic numbers are languageindependent

Page 45: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

Set Theory of Aboutness

• Given a finite universe W of objects (e.g. all words)• Define a topic T W to be a subset of W (a wordlist)• Topic inclusion (defines the hierarchy):

– Topic T includes topic S iff S T • Definition of aboutness:

– A subset P W of the universe (e.g., a book) is about topic T iff P T (intersection is nonempty)

• Hierarchical nature of aboutness:– If P is about S and T includes S, then P is also about T

Page 46: HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS The Million Book Project Michael I. Shamos, Ph.D., J.D. Director, Universal Library School of

HONG KONG UNIVERSITY JANUARY 2003 © 2003 MICHAEL I. SHAMOS

We Can Search a Few Things

• Text• In the Roman alphabet• “Hidden” databases effectively unsearchable• No images or two-dimensional structures

– math– music– dance notation . . .

• No subject index of photographs or art– Corbis is one of the “best”