40
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

  • Upload
    naiya

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users. Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006. WorldCat: July 2006. Manifestations (records): 67,282,165. Works: 53,472,668. Total holdings: 1,071,507,045. Digital Items: 1,571,803. - PowerPoint PPT Presentation

Citation preview

Page 1: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Lynn Silipigni ConnawayOCLC ResearchASIST 2006 ConferenceNovember 9, 2006

Page 2: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

WorldCat: July 2006

Total holdings: 1,071,507,045Total holdings: 1,071,507,045

Manifestations (records): 67,282,165Manifestations (records): 67,282,165

Works: 53,472,668Works: 53,472,668

Digital Items: 1,571,803Digital Items: 1,571,803 Institutions: 26,236Institutions: 26,236

Physical Items*: ~1.6 billion*Estimated

Physical Items*: ~1.6 billion*Estimated

Page 3: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Origin of materials represented in WorldCat

US34%

UK9%

Canada3%

Rest of World40%

Unknown14%

Page 4: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Some aspects of “Global WorldCat” …

Content Languages: 476

43% of WC non-English

Top 5 non-English:

German: 4.5 million

French: 4.2 million

Spanish: 2.9 million

Dutch: 2.1 million

Chinese: 1.6 million

Content Languages: 476

43% of WC non-English

Top 5 non-English:

German: 4.5 million

French: 4.2 million

Spanish: 2.9 million

Dutch: 2.1 million

Chinese: 1.6 million

Non-English Metadata Language:

9.3 million (20 languages)

Top 5:

Dutch: 4.1 million Japanese: 0.7 million

French: 1.4 million Finnish: 0.7 million

German: 1.0 million

Non-English Metadata Language:

9.3 million (20 languages)

Top 5:

Dutch: 4.1 million Japanese: 0.7 million

French: 1.4 million Finnish: 0.7 million

German: 1.0 million

Materials w/non-US origins:

35.3 million (52%)

Top 5:

UK: 6.1 million

Germany: 4.0 million

France: 2.9 million

Netherlands: 2.2 million

Canada: 2.1 million

Materials w/non-US origins:

35.3 million (52%)

Top 5:

UK: 6.1 million

Germany: 4.0 million

France: 2.9 million

Netherlands: 2.2 million

Canada: 2.1 million

Page 5: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldCatTM: Decision-making Resource

Collection management• Cooperative collection development• Comparative collection analysis• Collection assessment• Mass digitization• Off-site storage• Preservation

Services• Virtual reference• Recommender services

Systems• Precision

Page 6: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldCatTM: Data Mining Research Projects

Audience Level Publisher Name Server WorldMap

Page 7: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Audience Level: Rationale and Objectives

Implies: we can infer materials’ audience level from holdings patterns, which in turn can support:• Collection management• Readers’ advisory services• Reference services• Information retrieval

Holdings represent selection decisions by librarians … implies there are about 1 billion individual selection decisions in the WorldCat holdings file

Selections are made to serve the interests of a library’s target community …• Associate target community (audience level) to particular library profiles - e.g., ARL, non-ARL academic, public, K-12 school …?

Page 8: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 9: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 10: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 11: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 12: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 13: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 14: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Example : Mother Goose

Page 15: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Publisher Name Server: Research Objectives

Resolve for data mining and quality of WorldCat• ISBN prefixes to publisher name • Variant publisher names to a preferred form

Complement Collection Analysis Service• Librarians• Publishers

Capture and make available various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries

Page 16: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Publisher Name Server: Methodology

Programmatically cluster publishers using ISBN prefixes• Data clustering (The Free Dictionary)

• "The science of extracting useful information from large data sets or databases"

• Classification of similar objects into different groups• Partitioning of a data set into subsets (clusters)

• Data in each subset (ideally) share some common trait

Hand parse the entities and resolve ISBN prefixes

Page 17: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Publisher Name Server: Database

To date >800 records

Relational database, preserving hierarchical relationships

Begins with high-occurrence entities to identify: • “Top 10” lists (USA, UK, Canada, Australia, Germany,

France, Japan, Italy)• Top university presses• Mergers and acquisitions

Page 18: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Top U.S. Publishing Entities in WorldCat(22,680,201 total U.S. records)

ISBN Prefix

WorldCat Records

Publishing Entity

0-13 50,298 Prentice-Hall, Inc.

0-07 44,545 McGraw Hill, Inc.

0-06 44,362 HarperCollins (Firm)

0-16 40,451 United States G.P.O.

0-471 37,710 John Wiley & Sons

0-312 33,318 St. Martin's Press

0-671 31,765 Simon & Schuster, Inc.

0-02 27,602 MacMillan Publishers

0-15 18,420 Harcourt Brace & Company

0-394 18,043 Random House (Firm)

0-590 17,290 Scholastic Inc.

0-385 16,768 Doubleday and Company, Inc.

0-395 16,699 Houghton Mifflin Company

0-19 15,724 Oxford University Press

0-03 15,417 Holt, Rinehart, and Winston

Page 19: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Publisher Name Server: Database

Database Fields: Publisher Name, Preferred

Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL

----- Languages Formats DDC Subjects LCC Subjects

Data Sources:U.S. Library of Congress, National

Authority File, 110 (Corporate Name) field

Books In Print Online (W.W. Bowker)The International ISBN Registry (K.G.

Saur)Publishers’ Weekly OnlineHoover’s Handbook OnlineStandard and Poor’s Corporate

DescriptionsThe Directory of Corporate Affiliations

(DIALOG)Company websites

DATA MINING

Page 20: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Entity-Parsing in a World of Mergers and Acquisitions

Prentice-Hall, Inc.

Pearson Education, Inc.

Addison-Wesley Publishing Company

Allyn and Bacon Dominie Press

Benjamin/Cummings Publishing Company

Scott, Foresman and Company

HarperCollins Educational Publishers

Longmans, Green, and Co.

Pearson PLC

Pearson Canada Pearson Technology Group

Copp Clark Adobe Press Cisco Press

Penguin Books

Allen Lane Ladybird Books Riverhead Books

Puffin Books Putnam Books Berkeley Publishing Group

Avery

Page 21: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldMapTM: Objectives

Geographically represent library data from UNESCO, ARL, and NCES• Number of libraries• Amount of library expenditures• Number of volumes and titles• Number of librarians• Number of users

Page 22: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldMapTM: Objectives

Research prototype• Test geographical representation of WorldCat

• Titles and holdings by country of publication• Support data mining research area

• Visually display mined data to ease review and analysis

• Internal use• Sales and marketing

• External use• Library collection assessment and comparison

• Complement the AAU/ARL Global Resources Network project

• Project of the Council on Library and Information Resources (CLIR)

Page 23: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldMapTM: Technology

First implemented SVG• Open standard maintained by W3C• Simple XML file• Young technology• Browser support limited• Requires plug-in

Converted to Flash• Browser compatibility• Plug-in compatibility (if a plug-in was installed!)

For a detailed comparison of SVG and Flash, see: http://www.carto.net/papers/svg/comparison_flash_svg/

Page 24: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

OCLC WorldMapTM

Page 25: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 26: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 27: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 28: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 29: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 30: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 31: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 32: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 33: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 34: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 35: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 36: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 37: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Page 38: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Potential Future Projects

Audience Level• Integrate into WorldCat.org and OPACS to limit searches

and retrieved sources Publisher Name Server

• Integrate into OCLC Collection Analysis Service for publisher business intelligence

WorldMap• Subject information “aboutness”• Language of item

• Content language• Metadata language

• Holdings by country of library

Page 39: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Presentation will be available at http://www.oclc.org/research/presentations/default.htm

Prototypes available at http://www.oclc.org/research/researchworks/default.htm

Project Web Site:http://www.oclc.org/research/projects/default.htm

Page 40: Capturing Untapped Descriptive Data: Creating Value for Librarians and Users

Questions and Discussion

Contact Information:[email protected]