79
Training-less Ontology-based Text Categorization. Maciej Janik July 1 st , 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee Dr. John A. Miller Dr. Khaled Rasheed Dr. Amit P. Sheth LSDIS lab, Computer Science, University of Georgia

Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Embed Size (px)

Citation preview

Page 1: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Training-less Ontology-based Text

Categorization.Maciej Janik

July 1st, 2008Dissertation Defense

Major professor:Dr. Krzysztof J. Kochut

CommitteeDr. John A. MillerDr. Khaled RasheedDr. Amit P. Sheth

LSDIS lab, Computer Science, University of Georgia

Page 2: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

2

Document categorization

Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents.[Wikipedia]

Page 3: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

3

Objectives

• Document categorization method– Classification is based on knowledge from

ontology– Do not require training set– Use semantic information for categorization– Explore role of semantic associations in text

categorization– Incorporate user interest (context) into

categorization

Page 4: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

4

Automatic document categorization• Methods are based on word/phrase statistics, information

gain and other probability or similarity measures 1.• Examples

– Naïve Bayes, SVM, Decision Tree, k-NN• Categorization based on information (frequencies,

probabilities) learned from the training documents.• Vocabulary extension/unification possible by use of

synonyms, homonyms, word groups (eg. from WordNet)

• Document representation for categorization– Set or vector of features - most popular and simple: bag of

words– Does not include information about document structure,

relative position of phrases, etc.

(1) Sebastiani, F. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34 (1). 1 - 47.

Page 5: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

5

Document categorization by people• People categorize document by

understanding its content, using their knowledge and understanding what the category is.

• Categorization is based on:– Document content– Knowledge– Category– Perceived interest

entities and relationshipsontologycategory definitioncategorization context

Page 6: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

6

OmniCat approach

• Categorization knowledge– Ontology

• Features– Entities, relationships and semantic associations

• Category definitions– Relevant fragments of ontology– Importance of classes, entities, and relationships

• Categorization process– Matching of a document text to find best fit into defined

ontology fragments

Page 7: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

7

Semantic associations

• Semantic Association– A simple, undirected path that connects two

entities in the knowledge base and describe how they are related.

– Relationships on the path define meaning of this connection.

– Directionality of relationships sets specific interpretation of a path.

– Entities on the path specify the content.

(1) Sheth, A. P., I. B. Arpinar, et al. (2003). Relationships at the Heart of Semantic Web: Modeling, Discovering, and Exploiting Complex Semantic Relationships. Enhancing the Power of the Internet: Studies in Fuzziness and Soft Computing. M. Nikravesh, B. Azvin, R. Yager and L. Zadeh, Springer Verlag.

Page 8: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

8

Semantic Associations - Paths in RDF

Directed path

Undirected path

Undirected path,but with specific properties anddirectionality

Page 9: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

9

BRAHMS

Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery", Fourth International Semantic Web Conference, ISWC 2005, Galway, Ireland, 6-10 November 2005

Page 10: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

10

BRAHMS

• Features– Main-memory RDF/S storage– Handle RDF and RDFS data– High performance for accessing RDF/S data– Efficient handling of large onologies– Rich API provide a framework for creating

ontology-based algorithms (e.g. semantic association discovery)

• Separation of schema and instances

– Read-only access to ontology

• Developed for the need of SemDis1 project

(1) http://lsdis.cs.uga.edu/projects/semdis/

Page 11: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

11

Design decisions

• Performance requirements– use main memory for storage – fastest access– create indexes for operations used in graph

traversal algorithms– use C/C++ in implementation instead of Java– instead of string URIs, use simple type [int] as

resource identifiers.

• Ontology size– compact representation for handling large

ontologies – leave some memory for algorithms

Page 12: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

12

Design decisions

• Handle RDF / S– simplify the design and do not include and

check logic or constraints imposed by OWL

• Separate instance base from schema– represent instances, schema classes and

properties as different object types– have specific methods to access schema or

instances– different types of objects require different

types of statements

Page 13: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

13

Design decisions• Framework for algorithms

– create rich API of basic operations to access RDF/S data

• Consequences of design decisions– compact knowledge base to minimize memory

usage, no memory fragmentation – use contiguous memory blocks make it read-only

– create snapshot of memory structures for fast start-up (parse* once, use many times)

– handle taxonomy in a special way.

(*) Redland’s Raptor is used as RDF/S parser – http://librdf.org/raptor

Page 14: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

14

bi-BFS on synthetic Business-Sports-Entertainment

Je

na

; 1

2.8

Je

na

; 3

9.9

Je

na

; 5

9.3

Se

sa

me

; 1

.8

Se

sa

me

; 1

1.9

Se

sa

me

; 2

5.7

Se

sa

me

; 3

86

Re

dla

nd

; 0

.43

Re

dla

nd

; 2

.6

Re

dla

nd

; 5

.2

Re

dla

nd

; 6

4.8

Je

na

; 8

47

BR

AM

S;

1.9

BR

AM

S;

38

BR

AM

S;

0.5

BR

AM

S;

0.1

0

100

200

300

400

500

600

700

800

900

associationlength

[relations]

tim

e [

sec]

Jena 12.8 39.9 59.3 847

Sesame 1.8 11.9 25.7 386

Redland 0.43 2.6 5.2 64.8

BRAMS 0.1 0.5 1.9 38

Found paths 8559 131009 1680943 24392420

9 10 11 12

x 1.70

x 10.16

x 22.29

45,000 Instance statements 29,889 instances RDF: 13Mb

Results - timing

Page 15: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

15

Results - timingbi-BFS search on Univ(700,0) - 6.5Gb file

BR

AH

MS

; 0

.02

BR

AH

MS

; 0

.15

BR

AH

MS

; 0

.33

BR

AH

MS

; 4

6.4

2

BR

AH

MS

; 3

08

.87

32

205

94,152

1,271,857

314,116,239

0

50

100

150

200

250

300

350

associationlength

[relations]

Tim

e [

se

c]

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

Fo

un

d p

ath

s[l

og

sc

ale

]

BRAHMS 0.02 0.15 0.33 46.42 308.87

Paths 32 205 94,152 1,271,857 314,116,239

4 5 6 7 8

Page 16: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

16

SPARQLeR

Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, 3-7 June 2007

Page 17: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

17

SPARQLeR

• Extension of SPARQL for semantic association discovery.

• Seamlessly integrated into the SPARQL syntax.• Graph patterns incorporating simple paths with

constraints.• Support for flexible length paths.• Property constraints (path patterns) are based

on regular expressions over properties.• Additional constraints on entities included in the

path (instances and properties).

Page 18: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

18

Path patterns in SPARQLeR

• Path is SPARQLeR is a meta-property– Resource –[property] Resource– Resource –[path] Resource

• Path is also a Sequence– Test if a resource is in the path:

• rdfs:member– Test if a resource is at a specific position in the path:

• rdf:_2, rdf:_4, ...

• SPARQLeR-specific path properties– Test all resources or all properties in the path:

• rdfms:entityResource and rdfms:propertyResourceExample: all resources on a path must be of type foo:Person

Page 19: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

19

SPARQLeR extensions

• Path expressions– use of regular expressions over properties

• Flexible path specification– Undirected– Defined directionality paths

• Directed

– Length restricted

• Complex path patterns– Test of resources and properties on the path– Intersecting paths

Page 20: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

20

RegExp in path constraints

• Path constraints on properties are based on regular expressions– Uses syntax similar to lex– Easy for grep users

• Examples:a c* d a+ (b|c) a

[abc] c? d ( b a-1 )+ c

Page 21: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

21

SPARQLeR - exampleSELECT list(%path) WHERE

{<r> %path <s> . %path rdf:_2 <e> . %path rdfms:entityResource ?x .?x rdf:type <foo:A>

FILTER(length(%path)<=6 && regex(%path,“(foo:prop -foo:rel)+”,“dih”) }

foo:prop foo:proper s?x

foo:rel

foo:rel

rdf:type

A

rdfs:subPropertyOf

Page 22: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

22

Experiments

• Scalability– Modified DBLP datasets in RDF (added random citations)– Test on increasing dataset (adding older years of

publications)– Search for cited publications (transitive)

PREFIX opus: <http://lsdis.cs.uga.edu/projects/semdis/opus#>

SELECT ?end_publication WHERE {<http://dblp.uni-trier.de/rec/bibtex/journals/ai/Huber06>

%path ?end_publicationFILTER ( length(%path)<=26 &&

regex(%path, "(opus:cites_publication)*" ) ) }

B. Aleman-Meza et. al. Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. (WWW2006)

Page 23: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

23

Experiments – dataset characteristics

Page 24: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

24

Experiments – results: single source paths

Search paths up to length 26

Page 25: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

25

OmniCatMaciej Janik, Krys Kochut. “OmniCat: Automatic Text Classification with Dynamically Defined Categories”, 7th International Semantic Web Conference (ISWC 2008), Karlsruhe, Germany [submitted to] Maciej Janik, Krys Kochut. "Wikipedia in Action: Ontological Knowledge in Text Categorization", Second IEEE International Conference on Semantic Computing, ICSC 2008, Santa Clara, CA, USA, August 2008 [to appear]Maciej Janik, Krys Kochut. "Training-less Ontology-based Text Categorization", Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008) at the 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 30 March 2008

Page 26: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

26

Ontology

• “An explicit specification of a conceptualization.” 1

• Ontology is a data model that represents a set of concepts within a domain and the relationships between those concepts. It is used to reason about the objects within that domain. [Wikipedia]

Gruber, T. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5 (2). 199-220, 1993.

Page 27: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

27

Ontology-based classification• Ontology IS the knowledge base and

THE CLASSIFIER – no need for training set.– Rich instance base defines known universe.– Schema with taxonomy describe categorization

structure.

• Classification is based on recognized entities in text and semantic relationships between them.

• Categories assigned are based on entities types, taxonomy embedded in schema and provided categorization contexts.

Page 28: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

28

OntoCategorization – bases• Probability

– Document is classified based on probabilities that given feature (word, phrase) belongs to a certain category.

• Similarity– Category is defined as ontology fragment (entities,

classes, structures, etc.)– Similarity of document graph to given ontology fragment

describes closeness to selected category• Connectivity (components)

– Knowledge is based on associations.– Entities in one category should form a connected

component, as they belong to the same subject.• Context

– Specific entities, entity types, or semantic structures may be of different importance for user

Page 29: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

29

Graph representation of text• Graph representation preserves (selected)

structural information from document– Relative words positions to find close co-occurring

phrases.– Paragraph, formatting (eg. emphasize), part of

document.

• Sample representations– Words form a directed graph, chained in order as they

appear in each sentence.– Words form a weighted graph, where edge connects

words within certain distance and weight determines closeness.

– Connected terms based on NLP processing or co-occurrence.

Page 30: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

30

Graph-based categorization• Categorization based on similarity metrics 1

– Isomorphism– Maximum common subgraph/ minimum common

supergraph– Graph edit distance– Statistical methods

• Diameter, degree distribution, betwenness– Comparison of node neighbors– Distance preservation measure

• Methods– k-NN – most straightforward– similarity to centroids – graph mean and graph median– term distance to category

(1) Schenker, A., Bunke, H., Last, M. and Kandel, A. Graph-Theoretic Techniques for Web Content Mining. World Scientific, London, 2005.

Page 31: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

31

Classes and categories

• Classes do not have to be categories• Classes

– Form taxonomy / partonomy– Strict, formal requirements– Membership based on features

• Categories– Can include other categories, intersect with them, etc. –

more set-like approach– Category can be a complex structure of classes,

relationships and instances– Topic of interest that can span multiple, normally

unrelated classes in schema

Page 32: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

32

OmniCat system

Page 33: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

33

Algorithm sketch

• Semantic graph construction– Conversion of an unstructured text into

semantic graph

• Thematic graph selection– Setting a topic by selection of graph(s) for

categorization

• Categorization using ontology– Bottom-up approach of category discovery– Top-down approach with categorization context

projection

Page 34: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

34

Semantic graph construction (1)

• Named entity identification– Matching known phrases

(literals) from ontology and assign initial confidence weight

– Each phrase has assigned a confidence level based on uniqueness of entity identification

– Number of times each phrase is matched suggests its importance in text

– Text-phrase similarity is used when applying stop words removal or stemming

..niiii mpl*sp

w

1

),(1

11

Page 35: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

35

Ford Motor Co. is in the process of selling

Jaguar and Land Rover, according to Ford

CEO Alan Mulally.

Example of entity matching

Ford Motor Company

Jaguar (animal)

Business process Process (science)

Alan_MulallyChief Executive Officer

Process (computing)

Jaguar Cars Ltd.

Land_RoverFord Motor Company

Sales

Page 36: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

36

Semantic graph construction (2)• Entity relationship extraction

– NLP parse of each sentence to get dependency tree

– Use previously matched phrases as clues for entities positions

– If matched phrases are close in the parse tree, add a relationship between them in the final graph

• OmniCat does not extract named relationships

Page 37: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

37

Example – parse tree and triplesFord Motor Co. is in the process of selling Jaguar and Land Rover,

according to Ford CEO Alan Mulally.

Page 38: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

38

Semantic graph construction (3)• Connectivity

inducement– For each pair of

matched entities find all relationships in the ontology

– Each relationship has importance factor, based on semantics of information it defines

Page 39: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

39

Example – NLP + ontology knowledgeFord Motor Co. is in the process of selling Jaguar and Land Rover,

according to Ford CEO Alan Mulally.

Ford Motor Company

Jaguar Cars

Land RoverAlan Mulally

Chief Executive OfficerJaguar (animal)

sells

sells

parent_company

parent_company

has_CEO

CEO_of

is_a

named_after

Page 40: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

40

Thematic graph selection (1)• Removal of specific types of entities

(optional)– Specific for news documents– What? Who?

• Content of the news

– Where? When?• Date, time and place• Entities that may become hotspots in the created

document graph

Page 41: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

41

Thematic graph selection (2)• Entity weight propagation

– Each entity has assigned initial match weight– Entities are connected by relationships with

given importance factor– Propagate weight using HITS 1 algorithm to find

best hub and authority entities– Best authoritative entities are most important

for document categorization – core of the graph

– Calculate centrality to find entities that are “topic landmarks”

jji

i vvdvCentrality

),(

1)(

(1) Kleinberg, J.M., Authoritative Sources in a Hyperlinked Environment. in ACM-SIAM Symposium on Discrete Algorithms, (1998).

Page 42: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

42

Thematic graph selection (3)• Selection of the dominant thematic graph

for categorization– Select connected component that is largest

and has maximum weight for further categorization

– Based on assumption that entities associated with the same or related topics are interconnected in ontology

– Effectively disambiguate many incorrectly matched entities

– Focus on one or few major topics of a document

Page 43: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

43

Thematic graph examples

Ford Motor Company

Jaguar Cars

Land RoverAlan Mulally

Chief Executive OfficerJaguar (animal)

Sales

Business

Buyer Newspaper

Announcement

News

Page 44: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

44

Thematic graph categorization• Categorization concentrates on selected

dominant thematic graph• Proposed methods

– Bottom-up category discovery• Class-category mapping

– Top-down category projection• Categorization based on context projection• Combination of categorization contexts for complex

categories

Page 45: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

45

Bottom-up categorization (1)• Category discovery approach

– No category definitions are needed, only taxonomy from the ontology

– Bottom-up approach – discover categories based on classification of entities

– Best category should• Cover largest portion of entities in the thematic graph• Be most possible direct class for entities• Include entities from core of the graph

)

),(),(1

11(1)(

2

max

j k Cki

k

ji

jCi

eChw

eCh

whs

Page 46: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

46

Bottom-up class discovery

Page 47: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

47

Bottom-up categorization (2)• External categories are given as set of

classes– In case of Wikipedia and external corpora,

categories are defined as mapping of appropriate Wikipedia categories

• Previously discovered categories are matched with categories definitions– Top-k are considered for matching– Matching until one category becomes dominant

Page 48: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

48

Entities and categories

Ford Motor Company

Jaguar Cars

Land Rover

Alan Mulally

Chief Executive Officer

Jaguar (animal)

Ford

Car Manufacturers

JaguarFord people

Ford executives

Living people

Felines

Panthera

PantherinaeOff-road wehicles

Page 49: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

49

ExampleFord, utility ready to work on plug-in car Automaker, Southern California Edison to unveil alliance in

response to demand for energy-efficient vehicles.

DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison will announce an unusual alliance Monday aimed at clearing the way for a new generation of rechargeable electric cars, the companies said.

Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International (Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with reporters at Edison's headquarters in Rosemead, Calif., the companies said.

[...]

Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles uses batteries to power the vehicle at low speeds and in to provide assistance during stop-and-go traffic and hard acceleration, delivering higher fuel economy.

General Motors Corp. (Charts , Fortune 500) has already begun work this year to develop its own plug-in hybrid car, designed to use little or no gasoline over short distances. The company showed off a concept version of the Chevrolet Volt in January at the Detroit Auto show and has awarded contracts to two battery makers to research advanced batteries for a possible production version.

Page 50: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

50

ExampleFord, utility ready to work on plug-in car Automaker, Southern California Edison to

unveil alliance in response to demand for energy-efficient vehicles.

DETROIT (Reuters) -- Ford Motor Co. and power utility Southern California Edison will announce an unusual alliance Monday aimed at clearing the way for a new generation of rechargeable electric cars, the companies said.

Ford (Charts , Fortune 500) Chief Executive Alan Mulally and Edison International (Charts , Fortune 500) Chief Executive John Bryson are scheduled to meet with reporters at Edison's headquarters in Rosemead, Calif., the companies said.

[...]

Led by Toyota Motor Corp's (Charts) Prius, the current generation of hybrid vehicles uses batteries to power the vehicle at low speeds and in to provide assistance during stop-and-go traffic and hard acceleration, delivering higher fuel economy.

General Motors Corp. (Charts , Fortune 500) has already begun work this year to develop its own plug-in hybrid car, designed to use little or no gasoline over short distances. The company showed off a concept version of the Chevrolet Volt in January at the Detroit Auto show and has awarded contracts to two battery makers to research advanced batteries for a possible production version.

Page 51: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

51

Page 52: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

52

Example: graph properties• Initial number of vertexes : 205• Initial number of edges : 361• Largest component : 95• Component for analysis : 35• Central and most important entities:

– Hybrid_vehicle * Centrality 208, * weight 1.516873

– Automobile * Centrality 213, weight 1.249790,

– Internal_combustion_engine* Centrality 233, weight 1.069511

– Ford_Motor_CompanyCentrality 237, * weight 1.451533,

– Southern_California_EdisonCentrality 351, * weight 1.308824

Page 53: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

53

Example: assigned categories• Category:Automobiles

– CAT instances <13>, (avg. height 2.384615)weight [0.874697]

• Category:Alternative_propulsion– CAT instances <4>, (avg. height 1.250000)

weight [0.873287]• Category:Car_manufacturers

– instances <3> (avg. height 1.000000) weight [0.781271]

• Category:Vehicles– CAT instances <13>, (avg. height 2.923077)

weight [0.647903]• Category:Transportation

– CAT instances <11>, (avg. Height 3.090909) weight [0.629714]

Page 54: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

54

Top-down approach

• Need externally defined categories– Categories are given as classification contexts– Category can be defined as combination of

contexts

• Categorization process– Each context is projected onto the thematic

graph– Fitness score for each context is calculated– In case when category is defined as linear

combination of contexts, cosine similarity for fitness score is calculated

Page 55: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

55

Categorization context

• Simplify definition of categories by classes and projection.

• Capture better user interest in categories to specify preferred type of entities.

• Define union, intersection, and difference of contexts for flexible context definition.

• Enable creating combination of contexts for defining more complex categories.

Page 56: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

56

Hierarchical distance and projection• Distance between entity and

class – number of rdf:type and rdfs:subClassOf properties

• Distance between entity and set of classes – minimum distance to all classes in the set

• Entity is not covered by a class (or any class in the set) – distance is zero

• Projection of context on instance base – instances with assigned hierarchical distance

Page 57: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

57

Categorization into contexts

• Fitness score for context

• Hierarchical distance weighting function

to emphasize the weight of the nearest classes

n

cnHccnk

kHk CedisthwCedisthwTCfs )),((*)),((*),(

)),(()),(( )2,1( CedistNCedisth HH

Page 58: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

58

Categorization context example

Business

Person

Business Person Business( )

Page 59: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

59

Complex categories - composition of contexts

b s b s b combined with sLinear combination

of contexts

Page 60: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

60

Top-down categorization

• For each defined categorization context calculate a fitness score using context projection onto instance base– If there are only “simple” context, fitness

scores can be compared directly to choose category

– Otherwise, create a vector space from the calculated fitness scores and calculate similarity (cosine) between category definition and context vector

Page 61: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

61

Top-down classification

Page 62: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

62

Experiments (1)

• Classic text categorization algorithms– BOW statistic classifier 1

– SVM implemented in Weka 2

• Text corpora– CNN (2007-07-03 – 2007-09-04)

• 2,590 news documents in 12 categories

– Reuters RCV1 (1996-08-20 – 1996-09-02)• 2,254 documents in 6 categories

• Mapping for Wikipedia categories– Created manually by mapping top Wikipedia categories

with corpora categories(1) McCallum, A.K. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow, 1996.(2) Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann, San Francisco (2005)

Page 63: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

63

Experiments (2)

• Wikipedia ontology– Includes around 2,000,000 entries

• Multiple entity names (variations for matching)

– Has rich instance base (articles)– Internal href, templates and “infobox” relations

carry semantic connections among entries– Has large schema with categories – over

310,00 categories• They DO NOT form a taxonomy, just a graph (even

include cycles)

Page 64: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

64

Experiments (3)

• Wikipedia 2 RDF– Created initially by dbpedia.org 1

– Creation of RDF – some modifications• Focus on href, infoboxes and templates

– Special relationships for entities in infoboxes and templates

– Only English version of Wikipedia• Entity name variations for matching

– Name, short name (no brackets), redirect, disambiguation, alternate names

(1) Auer, S. and Lehmann, J., What have Innsbruck and Leipzig in common? Extracting Semantics from Wiki Content. in European Semantic Web Conference (ESWC'07), (Innsbruck, Austria, 2007), Springer, 503-517.

Page 65: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

65

Wikipedia categories

• Wikipedia categories DO NOT form a taxonomy– It is just a directed graph, that contains cycles.– Not possible to use subsumption for categories.– Thesaurus-like structure 1.

• Categories may be very deep and detailed, or very broad– Hard to pinpoint the cut-off point good for

categorization.– There is no simple mapping between news categories

and categories in Wikipedia.

(1) Voss, J. Collaborative thesaurus tagging the Wikipedia way. ArXiv Computer Science e-prints, cs/0604036.

Page 66: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

66

Text corpora information

Page 67: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

67

Text corpora – CNN mapping

Page 68: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

68

Text corpora – Reuters mapping

Page 69: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

69

Bottom-up categorization - OmniCat

OmniCat results using Wikipedia-CNN category mapping

Page 70: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

70

Bottom-up categorization – BOW

BOW results on CNN corpora using Wikipedia training

Page 71: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

71

Bottom-up categorization – BOW (2)

BOW results on Wikipedia corpora using Wikipedia training

Page 72: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

72

Bottom-up categorization - Reuters

Comparison of BOW, SVM and OmniCat (bottom-up approach)on selected Reuters corpora

Page 73: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

73

Top-down categorization - OmniCat

OmniCat results on CNN corpora using top-down approachwith categorization context projection

Page 74: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

74

OmniCat categorization – CNN

Comparison of CNN corpora categorization results of BOW, SVM, OmniCat bottom-up (Onto), and OmniCat top-down (OmniCat)

Page 75: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

75

OmniCat categorization – Reuters

Comparison of Reuters corpora categorization results of BOW, SVM, OmniCat bottom-up (Onto), and OmniCat top-down (OmniCat)

Page 76: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

76

Misclassifications - text corpora and Wikipedia• Original text corpora categories

– Classified by people– Describe mostly article interest, not necessarily

its content• Frequently described reader’s interest rather than

true subject.

– Hard to match to Wikipedia categories

• Wikipedia categories– Content-based– Very detailed and deep– Some regions in ontology are better developed

Page 77: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

77

Summary of work• Ontology storage and querying

– Brahms RDF/S storage– Sparqler – query language extension with path queries

• For use in Glycomics project

• OmniCat - Ontology-based categorization – Methodology for ontology-based categorization– Proposed two schemes of categorization– Defined categorization context, combination of contexts

for categorization– Implemented OmniCat prototype– Experiments using general-purpose ontology – RDF/S

graph created from the English Wikipedia– Published at ESAIR’08 and ICSC’08, submitted to

ISWC’08

Page 78: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

78

Proposed work• Experiment with other ontologies and taxonomies for

categorization– Use categories extracted from Freebase or Dmoz– Categorize medical publications to MeSH using Wikipedia

references• Approach to categorization

– Include definitions of interesting structures (e.g. specific semantic associations) into categorization context

– Utilize context information in calculating and selecting the document core entities

– Use other similarity metrics for calculating thematic graph and ontology similarity

• OmniCat beyond text categorization– Study applicability of OmniCat approach for categorizing

ontologies with other (gold standard) ontologies– Document summarization using semantic graph (towards

proposition presented in [1])

(1) Leskovec, J., M. Grobelnik, et al. (2004). Learning Semantic Graph Mapping for Document Summarization. 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Pisa, Italy.

Page 79: Training-less Ontology-based Text Categorization. Maciej Janik July 1 st, 2008 Dissertation Defense Major professor: Dr. Krzysztof J. Kochut Committee

Computer Science DepartmentUniversity of Georgia

79

Published papers• Maciej Janik, Krys Kochut. "BRAHMS: A WorkBench RDF Store And High Performance

Memory System for Semantic Association Discovery", Fourth International Semantic Web Conference, ISWC 2005, Galway, Ireland, 6-10 November 2005

• Krys Kochut, Maciej Janik. "SPARQLeR: Extended Sparql for Semantic Association Discovery", Fourth European Semantic Web Conference, ESWC 2007, Innsbruck, Austria, 3-7 June 2007

• Matthew Perry, Maciej Janik, Cartic Ramakrishnan, Conrad Ibanez, Budak Arpinar, Amit Sheth. "Peer-to-Peer Discovery of Semantic Associations", Second International Workshop on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005

• Maciej Janik, Krys Kochut. "Wikipedia in Action: Ontological Knowledge in Text Categorization", Second IEEE International Conference on Semantic Computing, ICSC 2008, Santa Clara, CA, USA, August 2008 [to appear]

• Maciej Janik, Krys Kochut. "Training-less Ontology-based Text Categorization", Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008) at the 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 30 March 2008

• Matthew Eavenson, Maciej Janik, Shravya Nimmagadda, John A. Miller, Krys J. Kochut, William S. York. "GlycoBrowser - A Tool for Contextual Visualization of Biological Data and Pathways Using Ontologies", 4-th International Symposium on Bioinformatics Research and Applications (ISBRA2008), Atlanta, Georgia (May 2008)

• S. Nimmagadda, A. Basu, M. Evenson, J. Han, M. Janik, R. Narra, K. Nimmagadda, A. Sharma, K.J. Kochut, J.A. Miller and W. S. York, "GlycoVault: A Bioinformatics Infrastructure for Glycan Pathway Visualization, Analysis and Modeling," Proceedings of the 5th International Conference on Information Technology: New Generations (ITNG'08), Las Vegas, Nevada (April 2008)