34
8/31/2000 Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

What is Information? The Nature, Growth and

Characteristics of InformationUniversity of California, Berkeley

School of Information Management and Systems

SIMS 202: Information Organization and Retrieval

Page 2: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

What is Information?• There is no “correct” definition• Can involve philosophy, psychology, signal

processing, physics • Cookie Monster’s definition:

– “news or facts about something”• Oxford English Dictionary

– information: informing, telling; thing told, knowledge, items of knowledge, news

– knowledge: knowing familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known

Page 3: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Assignment 1

• What is information, according to your background or area of expertise?

Page 4: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Types of Information

• Differentiation by form.

• Differentiation by content.

• Differentiation by quality.

• Differentiation by associated information.

Page 5: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Properties

• Information can be communicated electronically– Broadcasting– Networking

• Information can be easily duplicated and shared– Problems of Ownership– Problems of Control

Adapted from ‘Silicon Dreams’ by Robert W. Lucky

Page 6: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Intuitive Notion (Losee 97)

• Information must– Be something, although the exact nature (substance,

energy, or abstract concept) is not clear;

– Be “new”: repetition of previously received messages is not informative

– Be “true”: false or counterfactual information is “mis-information”

– Be “about” something

• This human-centered approach emphasizes meaning and use of message

Page 7: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information from the Human Perspective

• Levels in cognitive processing– perception– observation/attention– reasoning, assimilating, forming inferences

• Knowledge: justified true belief

• Belief: an idea held based on some support; an internally accepted statement, result of inductive processes combining observed facts with a reasoning process

• Does information require a human mind?– Communication and information transfer among ants– A tree falls in the forest … is there information there?– Existence of quarks

Page 8: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Meaning vs. Form• Form of information as the information itself• Meaning of a signal vs. the signal itself

– What aspects of a document are information?

• Representation (Norman 93)– Why do we write things down?

• Socrates thought writing would obliterate serious thought• Sounds and gestures fade away

– Artifacts help us to reason– Anything not present in the representation can be ignored– Things left out of the representation are often what we don’t

know how to represent

Page 9: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Hierarchy

Wisdom

Knowledge

Information

Data

Page 10: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Hierarchy• Data

– The raw material of information

• Information– Data organized and presented by someone

• Knowledge– Information read, heard or seen and understood

• Wisdom– Distilled and integrated knowledge and understanding

Page 11: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information

Where is the Life we have lost in living?Where is the wisdom we have lost in knowledge?Where is the knowledge we have lost in information?

-- T.S. Eliot, “The Rock”

Where is the information we have lost in data?

Page 12: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Origins

• Very early history of content representation– Sumerian tokens and “envelopes”– Alexandria - pinakes– Indices

Page 13: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Origins

• Biblical Indexes and Concordances– Hugo de St. Caro – 1247 A.D. : 500 Monks -- KWOC– Book indexes (Nuremburg Chronicle)

• Library Catalogs• Journal Indexes• “Information Explosion” following WWII

– Cranfield Studies of indexing languages and information retrieval

– Development of bibliographic databases • Index Medicus -- production and Medlars searching

Page 14: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Theory

• Claude Shannon, 1940’s, studying communication

• Ways to measure information – Communication: producing the same message at its destination as

that seen at its source

– Problem: a “noisy channel” can distort the message

• Between transmitter and receiver, the message must be encoded

• Semantic aspects are irrelevantNoise

Channel

Receiver Desti-nation

Message source

Trans-mitter

Page 15: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Theory• Better called “Communication Theory”• Communication may be over time and

space

Noise

Source DecodingEncoding Destination

Message Message

Channel

StorageSourceDecoding

(Retrieval/Reading)Encoding

(writing/indexing)Destination

Message Message

Page 16: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

What kinds of information are there?

• Text– books, periodicals, WWW, memos, ads– published/refeered

• Film

• Photos, other Images

• Broadcast TV, Radio

• Telephone Conversations

• Databases

Page 17: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

How much information is there?(Estimates courtesy Hal Varian and Peter

Lyman: http://www.sims.berkeley.edu/emc)

Gigabyte 10^9 bytes 1000 megabytes

Terabyte 10^12 bytes 1000 gigabytes

Petabyte 10^15 bytes 1000 terabytes

Exabyte 10^18 bytes 1000 petabytes

Page 18: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

How Much Information?

• Stored Information– Print

– Film

– Optical

– Magnetic

• Communicated– Internet

– Broadcast

– Phone

– Mail

Page 19: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Print

• Annual Production– Books 968,735 = 8 Terabytes (compressed image)

– Newspapers 22643 = 25 Terabytes – Journals 40000 = 2 Terabytes– Magazines 80000 = 10 Terabytes– Office Documents 12x10^9 pages = 312 Terabytes

– TOTAL 357 Terabytes (1824 scanned, 35 text)

Page 20: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Print• Library of Congress Printed book collection

– About 18 Million books– About 130 Terabytes (compressed image)– For all of LC we should also assume

• 13M photographs, 5MB each = 65 TB• 4M maps, say 200 TB• 500K files, 1GB each = 500 TB• 3.5M sound recordings, ~2000 TB• Grand total: 3 petabytes (~3000 terabytes)

• Books in Print – 3.2 Million titles– About 26 Terabytes

Page 21: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Film and Image

• Film– Photographs = 410 Petabytes per year– Movies = 16 Terabytes (Commercial

Production of about 4000 films)– X-Rays = 12 Petabytes

Page 22: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Optical Media

• CD-Music 90,000 items = 58 TB

• CD-ROM 3,000 items = 3 TB

• DVD-Video 5,000 items = 22 TB

• Total 83 TB

Page 23: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Magnetic Media

• Audio Tape 184,200,000 = 184.2 Petabytes

• Video Tape 355,000,000 = 1420

• Floppy disks = 0.07

• Removable disks = 1.69

• Hard Disks = 500

Page 24: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Totals Stored Per YearMedium Type of content Terabytes/Year Terabytes/Year Upper Bound Lower Bound Paper Books 8 7 Newspapers 25 20 Periodicals 12 12 Office documents 312 312 SUBTOTAL 357 351Film Photographs 410,000 100,000 Cinema 16 16 X-Rays 12,000 12,000 SUBTOTAL 422,000 112,016Optical Music CDs 58 40 Data CDs 3 3 DVDs 22 22 SUBTOTAL 83 65Magnetic Camcorder 300,000 300,000 Disk drives 2,555,000 1,000,20 SUBTOTAL 2,855,000 1,300,200TOTAL 3,277,440 1,412,632

Page 25: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Current Size of Web

• There are an estimated 2.1 Billion pages on the Web– About 21 Terabytes– About 7500 further Terabytes in web-accessed DBs.

• 610 Billion email messages per year = 11285 TB• Internet Traffic is doubling every 100 days - An

estimated 62 Million Americans now use the internet (US Commerce Dept 1998)

• Radio took 38 years to get 50 M listeners, TV took 13 years, the Net took 4 years...

Page 26: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Internet - Recent Statistics

5 M Level 2 Domains (NW June 1999)

43.2 Million Hosts (NW January 1999)

206/246 IP countries (NW July 1998)

300 Million Users (Newsbytes, Mar 2000)

(830 Million Telephone Terminations)

Source: Vint Cerf

Page 27: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Internet Hosts (000s) 1989-2006

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

1989

1991

1993

1995

1997

1999

2001

2003

2005

hosts

Source: Vint Cerf

Page 28: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Projected Voice and Data Traffic

0

5000

10000

15000

20000

25000

30000

1996 1997 1998 1999 2000 2001 2002

VoiceData

Gb/s

Source: America's Network, May 15, 1998

Page 29: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Users on the Internet - May 1999

• CAN/US - 90.65M• Europe - 40.09M• Asia/Pac - 26.97M• Latin Am - 5.29M• Africa - 1.14M• Mid-east - 0.88 M

---------------------------• Total - 165M

CAN/US

Europe

Asia/Pac

Latin Am

Africa

Mid East

Source: Vint Cerf

Page 30: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Language Distribution of Web Content

English J apaneseGerman FrenchChinese SpanishItalian SwedishMalay KoreanPortuguese DutchDanish CzechFinnish RussianPolish HungarianNorwegian EstonianGreek BulgarianCroatian BasqueThai TurkishArabic AlbanianOthers & Unknown

Source: Jack Xu: Excite

Page 31: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Language Distribution on a 634 Million Web Pages Corpus

Language Number of Docs PercentageEnglish 453,685,690 71.5288%Japanese 43,271,080 6.8222%German 32,253,563 5.0851%French 11,107,994 1.7513%Chinese 9,642,450 1.5202%Spanish 6,965,560 1.0982%Italian 5,638,827 0.8890%Swedish 4,392,709 0.6926%Malay 3,619,227 0.5706%Korean 3,200,762 0.5046%Portuguese 3,014,294 0.4752%Dutch 2,745,610 0.4329%Danish 1,911,677 0.3014%Czech 1,428,385 0.2252%Finnish 1,312,932 0.2070%Russian 1,150,127 0.1813%Polish 952,716 0.1502%Hungarian 760,162 0.1198%Norwegian 607,211 0.0957%Estonian 456,613 0.0720%Greek 393,360 0.0620%Bulgarian 392,777 0.0619%Croatian 310,237 0.0489%Basque 258,074 0.0407%Thai 99,691 0.0157%Turkish 81,218 0.0128%Arabic 38,167 0.0060%Albanian 17,779 0.0028%Others & Unknown 44,561,062 7.0256%Total 634,269,953 100%

Page 32: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Sources on Information, Computer, and Network Use

• http://www.sims.berkeley.edu/emc/• http://www.cs.cmu.edu/afs/cs.cmu.edu/user/

bam/www/numbers.html – Statistical snippets extracted from the news

• http://www.wcom.com/about_the_company/cerfs_up/– Vint Cerf’s pages

• http://www.firstmonday.dk/issues/issue3_10/coffman/index.html– The size and growth rate of the Internet by K.G.

Coffman and Andrew Odlyzko

Page 33: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Human Memory– Landauer 86: Human brain holds 200MB

• looked at rate of information intake and rate of forgetting, and amount of information adults need for normal tasks

– 6B people on earth implies total memory of all people alive about 1,200 petabytes

– Another way: • estimate that people take in a byte/sec• lifetime 250,000 days or 2B sec• result is 2 GB (doesn’t count synthesizing new info)

Page 34: 8/31/2000Information Organization and Retrieval What is Information? The Nature, Growth and Characteristics of Information University of California, Berkeley

8/31/2000 Information Organization and Retrieval

Information Overload

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)