33
ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 1 Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool Department of Computer Science Presented by Ray R. Larson University of California, Berkeley School of Information

Grid-based Search and Data Mining Using Cheshire3

  • Upload
    chance

  • View
    29

  • Download
    1

Embed Size (px)

DESCRIPTION

Grid-based Search and Data Mining Using Cheshire3. Presented by Ray R. Larson University of California, Berkeley School of Information. In collaboration with Robert Sanderson University of Liverpool Department of Computer Science. Overview. Introduction Context Architecture Grid - PowerPoint PPT Presentation

Citation preview

Page 1: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 1

Grid-based Search and Data Mining Using Cheshire3

In collaboration with

Robert Sanderson

University of Liverpool

Department of Computer Science

Presented by

Ray R. LarsonUniversity of California,

BerkeleySchool of Information

Page 2: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 2

Overview

• Introduction• Context• Architecture• Grid• Text Mining• Data Mining• Applications• Future Plans and Applications• Questions?

Page 3: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 3

Introduction

• Cheshire History:– Developed at UC Berkeley originally– Solution for library data (C1), then SGML (C2), then

XML– Monolithic applications for indexing and retrieval

server in C + TCL scripting

• Cheshire3:– Developed at Liverpool, plus Berkeley– XML, Unicode, Grid scalable: Standards based– Object Oriented Framework– Easy to develop and extend in Python

Page 4: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 4

Introduction

• Today:– Version 0.9.4 – Mostly stable, but needs thorough QA and docs– Grid, NLP and Classification algorithms integrated

• Near Future:– June: Version 1.0

• Further DM/TM integration, docs, unit tests, stability

– December: Version 1.1• Grid out-of-the-box, configuration GUI

Page 5: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 5

Context

• Environmental Requirements:– Very Large scale information systems

• Terabyte scale (Data Grid)• Computationally expensive processes (Comp. Grid)

• Digital Preservation• Analysis of data, not just retrieval (Data/Text

Mining)• Ease of Extensibility, Customizability (Python)• Open Source• Integrate not Re-implement• "Web 2.0" – interactivity and dynamic interfaces

Page 6: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 6

Context

Data Grid Layer

Data Grid

SRBiRODS

Digital Library LayerApplicationLayer

Web BrowserMultivalent

Dedicated Client

User Interface

Apache+Mod_Python+

Cheshire3

Protocol Handler

Process Management

KeplerCheshire3

Query Results

Query

Results

Export Parse

Document ParsersMultivalent,...

NaturalLanguageProcessing

InformationExtraction

Text Mining ToolsTsujii Labs, ...

ClassificationClustering

Data Mining ToolsOrange, Weka, ...

Query

Results

Search /Retrieve

Index /Store

Information System

Cheshire3

User Interface

MySRBPAWN

Process Management

KepleriRODS rules

Term Management

TermineWordNet

...

Store

Page 7: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 7

Cheshire3 Object Model

UserStore

User

ConfigStoreObject

Database

Query

Record

Transformer

Records

ProtocolHandler

Normaliser

IndexStore

Terms

ServerDocument

Group

Ingest ProcessDocuments

Index

RecordStore

Parser

Document

Query

ResultSet

DocumentStore

Document

PreParserPreParserPreParser

Extracter

Page 8: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 8

Object Configuration

• One XML 'record' per non-data object• Very simple base schema, with extensions as

needed• Identifiers for objects unique within a context

(e.g., unique at individual database level, but not necessarily between all databases)

• Allows workflows to reference by identifier but act appropriately within different contexts.

• Allows multiple administrators to define objects without reference to each other

Page 9: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 9

Grid

• Focus on ingest, not discovery (yet)• Instantiate architecture on every node• Assign one node as master, rest as slaves.

Master then divides the processing as appropriate.

• Calls between slaves possible• Calls as small, simple as possible:

(objectIdentifier, functionName, *arguments)• Typically:

('workflow-id', 'process', 'document-id')

Page 10: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 10

Grid ArchitectureMaster Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(workflow, process, document) (workflow, process, document)

fetch document fetch document

document document

extracted data extracted data

Page 11: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 11

Grid Architecture - Phase 2Master Task

Slave Task 1 Slave Task N

Data Grid

GPFS Temporary Storage

(index, load) (index, load)

store index store index

fetch extracted data fetch extracted data

Page 12: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 12

Workflow Objects

• Written as XML within the configuration record.• Rewrites and compiles to Python code on object

instantiationCurrent instructions:

– object– assign– fork– for-each– break/continue– try/except/raise– return– log (= send text to default logger object)

Yes, no if!

Page 13: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 13

Workflow example

<subConfig id=“buildSingleWorkflow”><objectType>workflow.SimpleWorkflow</objectType><workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>”Loaded Record:” + input.id</log></workflow></subConfig>

Page 14: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 14

Text Mining

• Integration of Natural Language Processing tools

• Including:– Part of Speech taggers (noun, verb, adjective,...)– Phrase Extraction – Deep Parsing (subject, verb, object, preposition,...)– Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi)

• Planned: Information Extraction tools

Page 15: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 15

Data Mining

• Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes

• Focus on automatic classification for predefined categories rather than clustering

• Algorithms integrated/implemented:– Perceptron, Neural Network (pure python)– Naïve Bayes (pure python)– SVM (libsvm integrated with python wrapper)– Classification Association Rule Mining (Java)

Page 16: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 16

Data Mining

• Modelled as multi-stage PreParser object (training phase, prediction phase)

• Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM)

• Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore

• Document vectors generated per index per document, so integrated NLP document normalization for free

Page 17: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 17

Data Mining + Text Mining

• Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies.

• Computational grid for distributing expensive NLP analysis• Results show better accuracy with fewer attributes:

Vector Source Avg

Attributes

TCV

Accuracy

Every word in document 99 85.7%

Stemmed words in document 95 86.2%

Part of Speech filtered words 69 85.2%

Stemmed Part of Speech filtered 65 86.3%

Genia filtered 68 85.5%

Genia Stem filtered 64 87.2%

Page 18: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 18

Applications (1)

Automated Collection Strength AnalysisPrimary aim: Test if data mining techniques could

be used to develop a coverage map of items available in the London libraries.

The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records.

This involved very large scale processing of records to:– Deduplicate millions of records – Enrich deduplicated records against database of 45

million – Automatically reclassify enriched records using

machine learning processes (Naïve Bayes)

Page 19: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 19

Applications (1)

• Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems.

• The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

Goldsmiths Kings Queen Mary Senate UCL Westminster

0

1000

2000

3000

4000

5000

6000Records per Library for All of Psychology

Original

Enhanced

Page 20: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 20

Applications (2)

Assessing the Grade Level of NSDL Education Material• The National Science Digital Library has assembled a

collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid.

• Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL.

• We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier).

• This processing was done on the Teragrid cluster at SDSC.

Page 21: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 21

Applications (2)

• The formula for the Flesch Reading Ease Score: FRES = 206.835 –1.015 ((total words)/(total sentences)) – 84.6 ((total

syllables)/(total words))

• The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) + 11.8 * ((total

syllables)/(total words)) –15.59

• The Domain was determined by: – Domains used were based upon the AAAS Benchmarks– Taking in samples from each of the domain areas being examined and

produces scored and ranked lists of vocabularies for each domain.– Each token in a document is passed through a lookup function against

this table and tallies are calculated for the entire document. – These tallies are then used to rank the order of likelihood of the

document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.

Page 22: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 22

Future Plans

• IR Testing and Optimization– Work with the OCA Book collection as part of INEX

2007– TREC, CLEF, and INEX Benchmarking

• Integration of Geographic Information Retrieval methods from Cheshire II– GIR Ranking and Gazetteer-based text retrieval using

NLP methods

• Pattern-driven text mining methods for extracting biographical information from texts– IMLS-funded “Bringing Lives to Light” project

Page 23: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 23

Overview

• Bringing Lives to Light– Focusing on the Who in Who, What, Where

and When– Examining and extending of various types of

Biographical Markup– Mining biographical data from available

information resources to fill our extended markup databases

Page 24: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 24

WHEN, WHERE and WHO

• Catalog records found from a time period search commonly include names of persons important at that time. Their names can be forwarded to, e.g., biographies in the Wikipedia encyclopedia.

Page 25: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 25

Place and time are broadly important across numerous tools and genres including, e.g. Language atlases, Library catalogs,Biographical dictionaries, Bibliographies, Archival finding aids, Museum records, etc., etc.

Biographical dictionaries are also heavy on place and time: Emanuel Goldberg, Born Moscow 1881. PhD under Wilhelm Ostwald, Univ. of Leipzig, 1906. Director, Zeiss Ikon, Dresden, 1926-33. Moved to Palestine 1937. Died Tel Aviv, 1970.

Life as a series of episodes involving Activity (WHAT), WHERE, WHEN, and WHO else.

Page 26: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 26

A new form of biographical dictionary would link to all

Texts

Numericdatasets

Thesaurus/Ontology

Gazetteers captionsMaps/Geo Data

EVI

Time Period Directory Time lines, Chronologies

Biographical Dictionary

Page 27: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 27

“Lives” Projected Work

• Develop XML markup for Biographical Events

• Most likely to be adaptation and extension of existing biographical event markup– Example: EAC/EAD

• Harvest biographical resources – Wikipedia, etc.

• Integrate as next generation of current interface

Page 28: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 28

EAC/EAD<bioghist> <head>Biographical Note</head> <chronlist> <chronitem> <date>1892, May 7</date> <event>Born, <geogname>Glencoe, Ill.</geogname></event> </chronitem> <chronitem> <date>1915</date> <event>A.B., <corpname>Yale University, </corpname>New Haven, Conn.</event> </chronitem> <chronitem> <date>1916</date> <event>Married <persname>Ada Hitchcock</persname> </event> </chronitem> <chronitem> <date>1917-1919</date> <event>Served in <corpname>United States Army</corpname></event> </chronitem> </chronlist> </bioghist>

Page 29: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 29

Wikipedia data

Life events metadata

WHAT: Actions prisoner

WHERE: Places Holstein

WHEN: Times

1261-1262

WHO: People Margaret Sambiria

Need external links

Page 30: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 30

Page 31: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 31

A Metadata Infrastructure

CATALOGS

AchivesHistorical Societies

LibrariesMuseums

Public TelevisionPublishersBooksellers

AudioImages

Numeric DataObjectsTexts

Virtual RealityWebpages

RESOURCES

INTERMEDIA INFRASTRUCTURE

Biographical DictionaryWHO

TimelinesTime Period DirectoryWHEN

MapsGazetteerWHERE

Syndetic StructureThesaurusWHAT

Special Display ToolsAuthority ControlFacet

Learners

Dossiers

Page 32: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 32

“Lives” Acknowledgements

• Electronic Cultural Atlas Initiative project• This work is being supported supported by the Institute

of Museum and Library Services through a National Leadership Grant for Libraries

• Contact: [email protected]

Page 33: Grid-based Search and Data Mining Using Cheshire3

ISGC 2007 - Taipei, Taiwan 2007.03.29 SLIDE 33

Thank you!

Available via http://www.cheshire3.org