Upload
naiya
View
20
Download
0
Embed Size (px)
DESCRIPTION
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users. Lynn Silipigni Connaway OCLC Research ASIST 2006 Conference November 9, 2006. WorldCat: July 2006. Manifestations (records): 67,282,165. Works: 53,472,668. Total holdings: 1,071,507,045. Digital Items: 1,571,803. - PowerPoint PPT Presentation
Citation preview
Capturing Untapped Descriptive Data: Creating Value for Librarians and Users
Lynn Silipigni ConnawayOCLC ResearchASIST 2006 ConferenceNovember 9, 2006
WorldCat: July 2006
Total holdings: 1,071,507,045Total holdings: 1,071,507,045
Manifestations (records): 67,282,165Manifestations (records): 67,282,165
Works: 53,472,668Works: 53,472,668
Digital Items: 1,571,803Digital Items: 1,571,803 Institutions: 26,236Institutions: 26,236
Physical Items*: ~1.6 billion*Estimated
Physical Items*: ~1.6 billion*Estimated
Origin of materials represented in WorldCat
US34%
UK9%
Canada3%
Rest of World40%
Unknown14%
Some aspects of “Global WorldCat” …
Content Languages: 476
43% of WC non-English
Top 5 non-English:
German: 4.5 million
French: 4.2 million
Spanish: 2.9 million
Dutch: 2.1 million
Chinese: 1.6 million
Content Languages: 476
43% of WC non-English
Top 5 non-English:
German: 4.5 million
French: 4.2 million
Spanish: 2.9 million
Dutch: 2.1 million
Chinese: 1.6 million
Non-English Metadata Language:
9.3 million (20 languages)
Top 5:
Dutch: 4.1 million Japanese: 0.7 million
French: 1.4 million Finnish: 0.7 million
German: 1.0 million
Non-English Metadata Language:
9.3 million (20 languages)
Top 5:
Dutch: 4.1 million Japanese: 0.7 million
French: 1.4 million Finnish: 0.7 million
German: 1.0 million
Materials w/non-US origins:
35.3 million (52%)
Top 5:
UK: 6.1 million
Germany: 4.0 million
France: 2.9 million
Netherlands: 2.2 million
Canada: 2.1 million
Materials w/non-US origins:
35.3 million (52%)
Top 5:
UK: 6.1 million
Germany: 4.0 million
France: 2.9 million
Netherlands: 2.2 million
Canada: 2.1 million
OCLC WorldCatTM: Decision-making Resource
Collection management• Cooperative collection development• Comparative collection analysis• Collection assessment• Mass digitization• Off-site storage• Preservation
Services• Virtual reference• Recommender services
Systems• Precision
OCLC WorldCatTM: Data Mining Research Projects
Audience Level Publisher Name Server WorldMap
Audience Level: Rationale and Objectives
Implies: we can infer materials’ audience level from holdings patterns, which in turn can support:• Collection management• Readers’ advisory services• Reference services• Information retrieval
Holdings represent selection decisions by librarians … implies there are about 1 billion individual selection decisions in the WorldCat holdings file
Selections are made to serve the interests of a library’s target community …• Associate target community (audience level) to particular library profiles - e.g., ARL, non-ARL academic, public, K-12 school …?
Example : Mother Goose
Publisher Name Server: Research Objectives
Resolve for data mining and quality of WorldCat• ISBN prefixes to publisher name • Variant publisher names to a preferred form
Complement Collection Analysis Service• Librarians• Publishers
Capture and make available various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries
Publisher Name Server: Methodology
Programmatically cluster publishers using ISBN prefixes• Data clustering (The Free Dictionary)
• "The science of extracting useful information from large data sets or databases"
• Classification of similar objects into different groups• Partitioning of a data set into subsets (clusters)
• Data in each subset (ideally) share some common trait
Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database
To date >800 records
Relational database, preserving hierarchical relationships
Begins with high-occurrence entities to identify: • “Top 10” lists (USA, UK, Canada, Australia, Germany,
France, Japan, Italy)• Top university presses• Mergers and acquisitions
Top U.S. Publishing Entities in WorldCat(22,680,201 total U.S. records)
ISBN Prefix
WorldCat Records
Publishing Entity
0-13 50,298 Prentice-Hall, Inc.
0-07 44,545 McGraw Hill, Inc.
0-06 44,362 HarperCollins (Firm)
0-16 40,451 United States G.P.O.
0-471 37,710 John Wiley & Sons
0-312 33,318 St. Martin's Press
0-671 31,765 Simon & Schuster, Inc.
0-02 27,602 MacMillan Publishers
0-15 18,420 Harcourt Brace & Company
0-394 18,043 Random House (Firm)
0-590 17,290 Scholastic Inc.
0-385 16,768 Doubleday and Company, Inc.
0-395 16,699 Houghton Mifflin Company
0-19 15,724 Oxford University Press
0-03 15,417 Holt, Rinehart, and Winston
Publisher Name Server: Database
Database Fields: Publisher Name, Preferred
Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL
----- Languages Formats DDC Subjects LCC Subjects
Data Sources:U.S. Library of Congress, National
Authority File, 110 (Corporate Name) field
Books In Print Online (W.W. Bowker)The International ISBN Registry (K.G.
Saur)Publishers’ Weekly OnlineHoover’s Handbook OnlineStandard and Poor’s Corporate
DescriptionsThe Directory of Corporate Affiliations
(DIALOG)Company websites
DATA MINING
Entity-Parsing in a World of Mergers and Acquisitions
Prentice-Hall, Inc.
Pearson Education, Inc.
Addison-Wesley Publishing Company
Allyn and Bacon Dominie Press
Benjamin/Cummings Publishing Company
Scott, Foresman and Company
HarperCollins Educational Publishers
Longmans, Green, and Co.
Pearson PLC
Pearson Canada Pearson Technology Group
Copp Clark Adobe Press Cisco Press
Penguin Books
Allen Lane Ladybird Books Riverhead Books
Puffin Books Putnam Books Berkeley Publishing Group
Avery
OCLC WorldMapTM: Objectives
Geographically represent library data from UNESCO, ARL, and NCES• Number of libraries• Amount of library expenditures• Number of volumes and titles• Number of librarians• Number of users
OCLC WorldMapTM: Objectives
Research prototype• Test geographical representation of WorldCat
• Titles and holdings by country of publication• Support data mining research area
• Visually display mined data to ease review and analysis
• Internal use• Sales and marketing
• External use• Library collection assessment and comparison
• Complement the AAU/ARL Global Resources Network project
• Project of the Council on Library and Information Resources (CLIR)
OCLC WorldMapTM: Technology
First implemented SVG• Open standard maintained by W3C• Simple XML file• Young technology• Browser support limited• Requires plug-in
Converted to Flash• Browser compatibility• Plug-in compatibility (if a plug-in was installed!)
For a detailed comparison of SVG and Flash, see: http://www.carto.net/papers/svg/comparison_flash_svg/
OCLC WorldMapTM
Potential Future Projects
Audience Level• Integrate into WorldCat.org and OPACS to limit searches
and retrieved sources Publisher Name Server
• Integrate into OCLC Collection Analysis Service for publisher business intelligence
WorldMap• Subject information “aboutness”• Language of item
• Content language• Metadata language
• Holdings by country of library
Presentation will be available at http://www.oclc.org/research/presentations/default.htm
Prototypes available at http://www.oclc.org/research/researchworks/default.htm
Project Web Site:http://www.oclc.org/research/projects/default.htm
Questions and Discussion
Contact Information:[email protected]