30
Introduction Text Mining Classification Bridgeman Digital Art Library Bridgeman Categories Sample Classification Data Text mining in digital collections CHASE: Going digital Deirdre Lungley [email protected] February 6, 2013 Deirdre Lungley

Text mining in digital collections

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text mining in digital collections

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

CHASE: Going digital

Deirdre [email protected]

February 6, 2013

Deirdre Lungley

Page 2: Text mining in digital collections

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Deirdre Lungley

Page 3: Text mining in digital collections

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Bridgeman Categories

2 Oriental Miniatures 41 Mosaics

7 Maps 44 Semi-precious Stones (see also Jewellery)

9 Posters 46 Science

12 Arms, Armour & Militaria 47 Sculpture

15 Botanical 51 Sports and Leisure

18 Clocks, Watches, Barometers & Sundials 56 Trade Emblems, City Crests, Coats of Arms

20 Costume & Fashion 1126 CHOIR BOOKS

21 Enamels 5000 The Arts and Entertainment

22 Ephemera 5001 Ancient and World Cultures

24 Furniture 5002 Architecture

25 Glass 5003 Business and Industry

27 Icons 5004 Places

29 Inventions 5005 Science and Medicine

30 Jewellery (see also Semi-precious stones) 5006 History

31 Juvenilia / Children's Toys & Games 5007 Religion and Belief

33 Lighting 5010 Travel and Transport

35 Medicine 5011 Plants and Animals

38 Mythology Mythological Myth 5013 Emotions and Ideas

40 Animals

Deirdre Lungley

Page 4: Text mining in digital collections

IntroductionText Mining

Classification

Bridgeman Digital Art LibraryBridgeman CategoriesSample Classification Data

Text mining in digital collections

Sample Classification Data

Query/Clicked URL Gold Standard Annotations Classifier Predictions

monster woman 5007 : Religion and Belief 5007 : Religion and Belief

Dulle Griet raiding Hell 5 : Allegory / Allegorical

38 : Mythology Mythological Myth

nuno 5007 : Religion and Belief 5007 : Religion and Belief

The Fishermen from the Polyptych of St. Vincent 42 : Personalities 5012 : Land and Sea

42 : Personalities

girl poor 5009 : People and Society 5009 : People and Society

A Peasant Girl Gathering Faggots in a Wood 5012 : Land and Sea

Deirdre Lungley

Page 5: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Tools of the trade

Python:

High level languageMany standard libraries, e.g., XML parser

Natural Language Toolkit (NLTK):

A platform for building Python programs to work with humanlanguage data (nltk.org)

Why?

Glue between applicationsData preparation for tools such as WekaAllows programmatic access to web services

Deirdre Lungley

Page 6: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Example Web Service – WikipediaMiner

Deirdre Lungley

Page 7: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Sample Python XML parsing – Wikify RSS title

Deirdre Lungley

Page 8: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Sample Python XML parsing – Wikify RSS title (Output)

Deirdre Lungley

Page 9: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Deirdre Lungley

Page 10: Text mining in digital collections

IntroductionText Mining

Classification

Python & NLTKWeb ServicesSample Code (1) – Wikify text

Text mining in digital collections

Deirdre Lungley

Page 11: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 12: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 13: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 14: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Supervised Learning - Basics

Classifier (Model) built from:

Positive/Negative examples (labelled data)Features - present/absent for a given label

Test data built using:

Present/absent classifier features

Case Study - Support Vector Machine (SVM) Classifier:

Locates marginal points on hyperplane - support vectorsUsed extensively in researchHere – treat as black box – default settings

SVMLight data format:

< target >< feature >:< value > ... < feature >:< value >

Deirdre Lungley

Page 15: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Examples

Feature Extractor

Test Examples

Pos/Neglabelled feature

sets

Test feature

sets

Learning tool

Classifier model

Predictions

Deirdre Lungley

Page 16: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Examples

Feature Extractor

Test Examples

Pos/Neglabelled feature

sets

Test feature

sets

Learning tool

Classifier model

Predictions

Project Gutenberg Catalogue BBC RSS Feed

Training Data

Test Data

SVM_Learn SVM_Classify

Deirdre Lungley

Page 17: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Data – Project Gutenberg

Deirdre Lungley

Page 18: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Case Study Task: Classify BBC RSS feeds

Retrieve & parse BBC RSS feed

Create Classification Features

CasefoldingTokenisationStemmingStopwords

Classify (test data → predictions)

Output to file on diskCall commandRead file

Deirdre Lungley

Page 19: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Retrieve & parse RSS feed

Deirdre Lungley

Page 20: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Retrieve & parse RSS feed (Output)

Deirdre Lungley

Page 21: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Text to Features

Deirdre Lungley

Page 22: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Text to Features (Output)

Deirdre Lungley

Page 23: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 24: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 25: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Classify: Test data → predictions (Output)

Deirdre Lungley

Page 26: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Training Data – Project Gutenberg

Deirdre Lungley

Page 27: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Deirdre Lungley

Page 28: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Create training data (Output)

Deirdre Lungley

Page 29: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

References:

The Regex Coach

Deirdre Lungley

Page 30: Text mining in digital collections

IntroductionText Mining

Classification

Supervised Learning - BasicsSample Code (2) – Classify BBC RSS Feeds

Text mining in digital collections

Thank You!

Deirdre Lungley