49
Generating Metadata by Machine BEA 2015 Friday, May 29, 11:30-12:20 Room 1E10

BEA 2015 Generating Metadata by Machine Final

Embed Size (px)

Citation preview

Page 1: BEA 2015 Generating Metadata by Machine Final

Generating Metadata by

MachineBEA 2015Friday, May 29, 11:30-12:20Room 1E10

Page 2: BEA 2015 Generating Metadata by Machine Final

Presenters

Moderator• Pat Payton, Senior Manager Publisher Relations, Bowker

Speakers• Randi Park, Publishing Officer, The World Bank• Hassan Zaidi, Digital Publishing Officer, International Monetary Fund• Jim Bryant, CEO, Trajectory Inc.

Page 3: BEA 2015 Generating Metadata by Machine Final

3

Terminology• Automated or Machine Indexing

– Process of assigning index terms against a set vocabulary or taxonomy without human intervention

– Full text or bibliographic records– Multiple vocabularies/rule sets allow for complex text

analysis• Optical Character Recognition (OCR)

– Machine conversion of an image to text– PDF of book content

• Extensible Markup Language (XML)– Set of rules for encoding documents– Both machine readable and human readable

Page 4: BEA 2015 Generating Metadata by Machine Final

Experience with semantic metadata creation

Randi [email protected]

WORLD BANK PUBLICATIONS

Page 5: BEA 2015 Generating Metadata by Machine Final

ABOUT THE WORLD BANK

5

• The World Bank Group is the world’s largest source of funding and technical assistance for developing countries.

• Through its five institutions, the Bank Group partners with developing countries to reduce poverty, increase economic growth, and improve the quality of life.

• Comprised of 188 member countries with offices in 120 countries around the world.

around the world.

Our Twin GoalsEnd Extreme Poverty within a Generation &

Boost Shared Prosperity

Page 6: BEA 2015 Generating Metadata by Machine Final

Like other publishers in some respects but . . .

• Publishing arm of a larger institution, with institutional imperatives

• Open accesso Dissemination trumps revenue

• Research is performed by in-house economists and experts in other fields, by development practitioners working on the ground, and by external contributors.

• Our publishing outputs are meant to enrich the development debate, inform policies, and support the development goals of our client countries.

We are a “Knowledge Bank”The World Bank is the largest source of development

knowledge

Page 7: BEA 2015 Generating Metadata by Machine Final
Page 8: BEA 2015 Generating Metadata by Machine Final

Popular Annuals and Flagships

8

Page 9: BEA 2015 Generating Metadata by Machine Final

Two platforms: The World Bank eLibrary and the Open Knowledge Repository (OKR)

Page 10: BEA 2015 Generating Metadata by Machine Final

Mobile applications

Page 11: BEA 2015 Generating Metadata by Machine Final

Topics we cover = 29

• Plus 5 Regions, Countries and Keywords

Page 12: BEA 2015 Generating Metadata by Machine Final

Metadata strategy

Primary Purpose• Supports user-centered

discovery in WB electronic products

• Semantic fields often exposed and browseable

• Complimented by full text search and filtering

• Book, chapter and article level abstracts, topics, regions, countries, keywords

• Books do not inherit chapter semantics

Secondary Re-purpose• Search and discovery

services• Aggregators• Retail sales channels, both

print and electronic

Page 13: BEA 2015 Generating Metadata by Machine Final

Our experience with machine generated metadataSet up• Customized our enterprise system as much as was practicalPros• Reasonable solution when

there is a huge corpus• Fast throughput• Inexpensive to run after labor-

intensive set up• PDF source for extraction of

topics, subtopics, countries, regions, keywords

• XML output easily transformed

Cons• Set up effort/cost• Inconsistent use of keyword

terms, depending on how they were used in the text anti-corruption/anticorruption

decision-making/decision makingpolicy-making/policy making

• Abstracts must be written by humans

• False hits due to footnotes, references, names, etc..

Page 14: BEA 2015 Generating Metadata by Machine Final
Page 15: BEA 2015 Generating Metadata by Machine Final

Present workflow – human generated

Pros• Book and chapter level

including abstracts• Able to manage keyword

vocabulary using pick-lists with additions as needed

• More accurate, author provides book level draft, EP team does sense check

• New rules and terms can be added any time with little set- up

Cons• Cost per book/chapter• Capacity• Inconsistencies between

legacy (edited machine-generated) and newer content to be addressed

• Single version of keywords may not be ideal for all channels (ie more keywords for discovery services)

Page 16: BEA 2015 Generating Metadata by Machine Final

Future

• Interested in using technology to improve discovery for direct users and in discovery services

• Full text XML and ePub available for indexing• Institutional need to implement new taxonomy

and full text search for over 200k documents

Page 17: BEA 2015 Generating Metadata by Machine Final

Randi [email protected]

WORLD BANK PUBLICATIONS

Page 18: BEA 2015 Generating Metadata by Machine Final

Introduction: IMF Publications

Objectives: Establish digital publishing program 2010-2011

• New IMF eLibrary• Digital distribution• Digital production• New metadata management system• Create metadata to a granular level (chapters and articles)

***

Page 19: BEA 2015 Generating Metadata by Machine Final

Digitization and Metadata Challenges2010-2011

Page 20: BEA 2015 Generating Metadata by Machine Final

Digitization and Metadata Challenges: 2010-2011

Page 21: BEA 2015 Generating Metadata by Machine Final

New Challenges – New Solutions

Manual vs. Machine

•Metadata quality•Time factor•Cost of labor comparison

Challenge: Cataloging to a granular level (keywords, countries, topics and sub-topics)

Page 22: BEA 2015 Generating Metadata by Machine Final

New challenges – New solutions

Do the MathIMF example: • 12, 000 titles containing 60,000 chapters/articles (assumes an average of 5 per title),• 15 minutes to catalog each chapter/article with keywords etc,• 15,000 hours/40 (per week) hours =375 weeks • 375 weeks/52 = 7 years of work for one cataloger.

If you pay just $30 per hour to a cataloger, the overall cost would be $450,000. Not to mention new content is being created daily.

Automation allows us to slash the time it takes to catalog our content, saving us time and money.

Page 23: BEA 2015 Generating Metadata by Machine Final

Machine in Action

Page 24: BEA 2015 Generating Metadata by Machine Final

Machine in Action

Page 25: BEA 2015 Generating Metadata by Machine Final

Machine in Action

Page 26: BEA 2015 Generating Metadata by Machine Final

Results on eLibrary

Super keywords or specific subjects

Page 27: BEA 2015 Generating Metadata by Machine Final

IMF eLibrary (http://elibrary.imf.org)Browsing the IMF eLibrary

Page 28: BEA 2015 Generating Metadata by Machine Final

Browsing by Countries

Page 29: BEA 2015 Generating Metadata by Machine Final

Browse by TopicsBrowsing the IMF eLibrary

Page 30: BEA 2015 Generating Metadata by Machine Final

Simple Search - Type a word or phrase into the search bar at the top of every page…

Searching on eLibrary

…or Advanced Search allows multiple concepts and filters

Page 31: BEA 2015 Generating Metadata by Machine Final

Search Results Page

Search within results to search within publications using a single word or phrase.

Select Content Type (Books and Journals/Chapters and Articles), Countries/Region, Topics, Languages, or Date.

Type a word in the Starts with box to go to the first title that begins with the word.

Sort by Title, Date, Source or Author.

Change the number of Items per page.

Keywords

Page 32: BEA 2015 Generating Metadata by Machine Final

Publication Landing Page

Read on screen in HTML

Read on a variety of devices

Citation tools

Click on a title from the results page to go to the publication landing page.

Page 33: BEA 2015 Generating Metadata by Machine Final

Publication Landing Page

Related documents

Page 34: BEA 2015 Generating Metadata by Machine Final

Publication Landing PageRelated documents

Page 35: BEA 2015 Generating Metadata by Machine Final

Digitization and Reuse of Contents

Page 36: BEA 2015 Generating Metadata by Machine Final

Digitization and Reuse of Contents

Page 37: BEA 2015 Generating Metadata by Machine Final

Outcomes• New IMF eLibrary was delivered in March 2011• Digital distribution: Distribute IMF contents to 35

channels in various digital formats• Digital production: Have an established workflow to

generate XML based contents, ePubs, Mobi and PDF ebooks

• New metadata management system. MetaLogic is a full functioning metadata management system

• Create metadata to a granular level (all chapters and articles have individual ) ***

Thank you

Page 38: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Generating Metadata By MachineBEA May 29, 2015 11:30 – 12:20

Page 39: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Natural Language Processing: Processing & Analysis

39

Natural language analysis tools process English language text input, transforming each sentence into data that can be used for search and analysis.

Identify the base forms of words.Identify parts of speech.Identify names of companies, people, places, etc.Describe the structure of sentences in terms of phrases and word dependencies.Indicate which noun phrases refer to the same entities.

Page 40: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Attributes/Entities that Characterize A Book

40

Page 41: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Sentiment: Analyzing the Words Within the Book

“Outstanding” words (5) breathtaking, thrilled, superb

hell, rape, (more unmentionables)

“Catastrophic” words (-5)

torture, fraud, (unmentionables)“Damned” words (-4)

woeful, worsen, kill“Terrible” words (-3)

worthless, travesty, threaten“Upset” words (-2)

numb, provoke, pushy“No” words (-1)

validate, safe, adequate“Yes” words (1):

strengthen, rich, funky“Welcome” words (2)

praise, marvelous, impressive

winning, stunning

“Happy” words (3)

“Wow” words (4)

41

Each word is given a numeric value based on its subjective meaning.

“Positive” words range on a positive scale; “Negative” words range on a negative scale.

Trajectory’s Analytics Engine uses these values to compute the book’s sentiment curve across sentence, paragraph, chapter and entire book.

This sentiment “fingerprint” at an aggregate level yields a unique picture of the book.

Page 42: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Sentiment: Analyzing the Words Within the Book

42

Page 43: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Sentiment: Analyzing the Words Within the Book

43

Page 44: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Trajectory Index

44

Page 45: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Keyword Analysis and Comparison

45

Page 46: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Keyword Translation into Local Languages

46

Page 47: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Recommendations

47

Page 48: BEA 2015 Generating Metadata by Machine Final

™THIS INFORMATION IS PROVIDED IN CONFIDENCE AND MAY NOT BE DISCLOSED TO ANYTHIRD PARTY OR USED FOR ANY OTHER PURPOSE WITHOUT THE EXPRESS WRITTEN PERMISSION OF TRAJECTORY, INC.

Thank You

48

2015 BEA – BOOTH 1347

United States:50 Doaks LaneMarblehead, Massachusetts01945 United [email protected]

China:No. 3, 8 Chuang Ye RoadHaidan District,Beijing, China 100085

Page 49: BEA 2015 Generating Metadata by Machine Final

Q & AGenerating Metadata by

MachineBEA 2015Friday, May 29, 11:30-12:20Room 1E10