104
Improving the Machine Interpretation of Internet Posts Università degli Studi di Pavia CVMLab Giacomo Parigi 14.02.14 Extraction of a lightweight, domain independent semantic network from the Wikipedia categorization system Part 2

Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

Improving the Machine Interpretation of Internet Posts

Università degli Studi di Pavia

CVMLab

Giacomo Parigi

14.02.14

Extraction of a lightweight, domain independent semantic network from the Wikipedia categorization system

Part 2

Page 2: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 2

Objectives

1.Overview of the topic

2.Identification of the specific set of problems

3.Show the reasons of the chosen approach

4.Illustrate the solutions implemented (or designed)

Page 3: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 3

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 4: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 4

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 5: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 5

1. Knowledge Representations I

● How is it made?

– A Knowledge Base

The machine-readable codification of the information

– A set of Inference Rules

The reasoning methods, determined by the purpose

“A subarea of Artificial Intelligence concerned with understanding, designing, and implementing ways of representing information in computers so that programs (agents) can use this information

– to derive information that is implied by it,

– to converse with people in natural languages,

– to decide what to do next

– to plan future activities,

– to solve problems in areas that normally require human expertise.”

(Stuart C. Shapiro)

Page 6: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 6

1. Knowledge Representations II

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 7: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 7

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 8: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 8

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 9: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 9

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 10: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 10

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● What kind of inference rules do we need?

– Word Sense Disambiguation

Disambiguate the meaning of words in context in a computational manner

AI-Complete problem

– Named Entity Recognition & Classification

Identify atomic units in test and classify them into predefined categories

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 11: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 11

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● What kind of inference rules do we need?

– Word Sense Disambiguation

Disambiguate the meaning of words in context in a computational manner

AI-Complete problem

– Named Entity Recognition & Classification

Identify atomic units in test and classify them into predefined categories

...Spoiler:

– Neither of them is enough!

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 12: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 12

● Knowledge Bases

– Structured

Thesauri, machine-readable dictionaries, taxonomies and ontologies

– Unstructured

Raw or sense-annotated corpora, lists, other...

1. Knowledge Representation III: WSD & NERC Methods

Page 13: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 13

● Knowledge Bases

– Structured

Thesauri, machine-readable dictionaries, taxonomies and ontologies

– Unstructured

Raw or sense-annotated corpora, lists, other...

● Methods

– Supervised and semi-supervised

Hand-crafted rules or bootstrapping methods

Knowledge Acquisition Bottleneck

– Unsupervised

Avoid the bottleneck, but provide only word clustering

– Knowledge based

Wide semantic knowledge for context extraction, word statistics

1. Knowledge Representation III: WSD & NERC Methods

Page 14: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 14

1. KR IIII: WSD & NERC, What about our purpose?

/Proposed

DiCaprio nominated for an Oscar!

● Word Sense Disambiguation

Page 15: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 15

1. KR IIII: WSD & NERC, What about our purpose?

/Proposed

DiCaprio nominated for an Oscar!

● Word Sense Disambiguation

Page 16: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 16

1. KR IIII: WSD & NERC, What about our purpose?

DiCaprio nominated for an Oscar!

● Named Entity Recognition & Classification

Page 17: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 17

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● What does it imply?

– Reduced and human-readable information

– Little or no near-words context

– Extremely wide possible domain

– Typing errors and carelessness

– Omitted information about shared or collective background

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 18: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 18

1. Knowledge Representations II

DiCaprio nominated for an Oscar!

● Proposed solution

– Reduced and human-readable information, little or no near-words context, Extremely wide possible domain

Knowledge-based system– Typing errors and carelessness

Mixed NERC and WSD methods– Omitted information about shared or collective background

Integration with a priori information (e.g. word frequency)

● Purpose of our system

– Identify context and entities in user-generated contents (posts, forums...)

Page 19: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 19

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 20: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 20

● Main competitors

– WordNet Hand-crafted lexical database, no named entities

– ResearchCyc Hand-crafted ontology, multi-domain breadth, out-of-date

– Wikipedia Crowd-crafted database, domain-independent, multilingual, captures “common sense”

5. Wikipedia I: Why Wikipedia?

Page 21: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 21

● Guidelines

– Group similar articles

– Balance category breadth

n° of sub-categories/sub-pages w.r.t. hierarchic level

– Avoid cycles

– Every article should belong to at least one category

– Article inclusions should be based only on defining characteristics

2. Wikipedia II: The Categorization System

Page 22: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 22

● Guidelines

– Group similar articles

– Balance category breadth

n° of sub-categories/sub-pages w.r.t. hierarchic level

– Avoid cycles

– Every article should belong to at least one category

– Article inclusions should be based only on defining characteristics

● Folk taxonomy (or Folksonomy)

– Not a strict and logically grounded ontology

– Inconsistencies

– Loose definition of relationships

2. Wikipedia II: The Categorization System

Page 23: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 23

● Guidelines

– Group similar articles

– Balance category breadth

n° of sub-categories/sub-pages w.r.t. hierarchic level

– Avoid cycles

– Every article should belong to at least one category

– Article inclusions should be based only on defining characteristics

● Folk taxonomy (or Folksonomy)

– Not a strict and logically grounded ontology

– Inconsistencies

– Loose definition of relationships

– Reflects our intuitions about classification and organization

2. Wikipedia II: The Categorization System

Page 24: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 24

● Many overlapping trees

2. Wikipedia III: The Category Tree Organization

Arts Geography

Page 25: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 25

● Many overlapping trees

2. Wikipedia III: The Category Tree Organization

Arts Geography

Cinema of the United States Italian literature

Page 26: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 26

● Many overlapping trees (at each hierarchical level)

2. Wikipedia III: The Category Tree Organization

Never Say Goodbye (1956 film)

As You Desire Me (film)

Dante's Inferno (1924 film)

Arts Geography

Cinema of the United States Italian literature

Page 27: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 27

● Two main kinds

Topic (Opera) and Set (Operas) categories

2. Wikipedia IV: Subcategorization

Page 28: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 28

● Two main kinds

Topic (Opera) and Set (Operas) categories

● Diffuse large categories

Albums → Albums by artist → Artistname albums

2. Wikipedia IV: Subcategorization

Page 29: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 29

● Two main kinds

Topic (Opera) and Set (Operas) categories

● Diffuse large categories

Albums → Albums by artist → Artistname albums

● Non-diffusing categories

Film actors → Best Actor Academy Awards winners

2. Wikipedia IV: Subcategorization

Page 30: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 30

● Two main kinds

Topic (Opera) and Set (Operas) categories

● Diffuse large categories

Albums → Albums by artist → Artistname albums

● Non-diffusing categories

Film actors → Best Actor Academy Awards winners

● Eponymous categories

France/cat → France/article

2. Wikipedia IV: Subcategorization

Page 31: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 31

● Two main kinds

Topic (Opera) and Set (Operas) categories

● Diffuse large categories

Albums → Albums by artist → Artistname albums

● Non-diffusing categories

Film actors → Best Actor Academy Awards winners

● Eponymous categories

France/cat → France/article

Systematic Error

2. Wikipedia IV: Subcategorization

France

Populated places in France

Cities in France

Strasbourg

Council of Europe

Members of the Council of Europe

Page 32: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 32

● Two main kinds

Topic (Opera) and Set (Operas) categories

● Diffuse large categories

Albums → Albums by artist → Artistname albums

● Non-diffusing categories

Film actors → Best Actor Academy Awards winners

● Eponymous categories

France/cat → France/article

Systematic Error

2. Wikipedia IV: Subcategorization

France

Populated places in France

Cities in France

Strasbourg

Council of Europe

Members of the Council of Europe

Page 33: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 33

● Natural Language Processing (NLP) methods

– Based on category and page names

Part-Of-Speech patterns, word matching

– Build a new graph from scratch (usually)

Classes and instances are made by copying or splitting Wikipedia categories

Links are made anew from the relations found

2. Wikipedia V: Knowledge Extraction Methods

Page 34: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 34

● Natural Language Processing (NLP) methods

– Based on category and page names

Part-Of-Speech patterns, word matching

– Build a new graph from scratch (usually)

Classes and instances are made by copying or splitting Wikipedia categories

Links are made anew from the relations found

● Connectivity-based methods

– Based on properties and “habits” of Wikipedia categorization

instance and redundant categorization

– Propagate relations found to sub-categories and sub-pages

2. Wikipedia V: Knowledge Extraction Methods

Page 35: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 35

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 36: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 36

● Application of hand-crafted rules to category names

1) Explicit relation categories

“Members_of...”, “Presidents_of...”

[VBN IN] patterns: “...directed_by...”, “...located_in...”

3. Knowledge Extraction Example I: NLP & Connectivity

Page 37: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 37

● Application of hand-crafted rules to category names

1) Explicit relation categories

2) Partly explicit relation categories

Prepositions: “Villages_in_Brandeburg”, “Conflicts_in_2000”

Need super-categories to identify the relation (“Geography”/“Years”)

3. Knowledge Extraction Example I: NLP & Connectivity

Page 38: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 38

● Application of hand-crafted rules to category names

1) Explicit relation categories

2) Partly explicit relation categories

3) Implicit relation categories

“Mixed martial arts television programs”

3. Knowledge Extraction Example I: NLP & Connectivity

Page 39: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 39

● Application of hand-crafted rules to category names

1) Explicit relation categories

2) Partly explicit relation categories

3) Implicit relation categories

4) “Class attribute” or “Diffusing” categories

“X_by_Y” patterns

Grouping of instances of X by attribute Y

3. Knowledge Extraction Example I: NLP & Connectivity

Page 40: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 40

● Application of hand-crafted rules to category names

1) Explicit relation categories

2) Partly explicit relation categories

3) Implicit relation categories

4) “Class attribute” or “Diffusing” categories

● Relations found propagate to sub-categories and sub-pages

3. Knowledge Extraction Example I: NLP & Connectivity

Page 41: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 41

● Application of hand-crafted rules to category names

1) Explicit relation categories

2) Partly explicit relation categories

3) Implicit relation categories

4) “Class attribute” or “Diffusing” categories

● Relations found propagate to sub-categories and sub-pages

– Problem: “Extinct_cephalopods” is a subcategory of “Fashion” !

3. Knowledge Extraction Example I: NLP & Connectivity

Page 42: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 42

3. Knowledge Extraction Example II: Limitations

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

Page 43: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 43

3. Knowledge Extraction Example II: Limitations

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

Cinema of Italy

Italian films

002 Operazione Luna

Cinema by country

Page 44: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 44

3. Knowledge Extraction Example II: Limitations

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

Cinema of Italy

Italian films

002 Operazione Luna

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Cinema by country

Page 45: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 45

3. Knowledge Extraction Example II: Limitations

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

Cinema of Italy

Italian films

002 Operazione Luna

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Page 46: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 46

3. Knowledge Extraction Example II: Limitations

Cinema of Italy

Italian films

002 Operazione Luna

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Director

Page 47: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 47

3. Knowledge Extraction Example II: Limitations

Cinema of Italy

Italian films

002 Operazione Luna

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Director

Page 48: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 48

3. Knowledge Extraction Example II: Limitations

Cinema of Italy

Italian films

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Cinema

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Director

Page 49: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 49

3. Knowledge Extraction Example II: Limitations

Cinema of Italy

Italian films

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Cinema

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Director

Page 50: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 50

3. Knowledge Extraction Example II: Limitations

Films by director nationality

Films by italian directors

Films directed by Lucio Fulci

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Director

Cinema of Italy

Italian films

002 Operazione Luna

“002_Operazione_Luna”, IS_A, Cinema

Cinema by country

Cities and Towns in Italy

Categories by city in Italy

People by city or town in Italy

Cities by country

Italian people by occupation by city

People from Rome by occupation

Actors from Rome

Lucio Fulci

“Lucio_Fulci”, IS_PART_OF, Cities

Page 51: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 51

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 52: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 52

1) Atomic entities and meaningful relations

“Films directed by Steven Spielberg”

Films → [directed by] → Steven Spielberg

4. Implementation I: Key concepts

Page 53: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 53

1) Atomic entities and meaningful relations

2) Keep only the most specific links

remove from “Footwear” pages and categories shared with “Shoes”

4. Implementation I: Key concepts

Page 54: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 54

1) Atomic entities and meaningful relations

2) Keep only the most specific links

3) Human-made links are good (Unless otherwise proven)

Use a pruning strategy instead of a build-from-scratch one

4. Implementation I: Key concepts

Page 55: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 55

1) Atomic entities and meaningful relations

2) Keep only the most specific links

3) Human-made links are good (Unless otherwise proven)

4) Human-made chains of links are (usually) bad

Trust connectivity only for a few levels

4. Implementation I: Key concepts

Page 56: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 56

1) Atomic entities and meaningful relations

2) Keep only the most specific links

3) Human-made links are good (Unless otherwise proven)

4) Human-made chains of links are (usually) bad

5) Don't impose strict rules automatically

such as “X_of_the_Y” : band names using that pattern do exist!

Cases must be distinguished with mixed methods

4. Implementation I: Key concepts

Page 57: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 57

1) Atomic entities and meaningful relations

2) Keep only the most specific links

3) Human-made links are good (Unless otherwise proven)

4) Human-made chains of links are (usually) bad

5) Don't impose strict rules automatically

6) KISS (Keep It Simple Stupid)

4. Implementation I: Key concepts

Page 58: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 58

● NLP methods

– Give more importance to word matching

– Part-Of-Speech patterns are used mainly as trigger for further controls

4. Implementation II: Methods

Page 59: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 59

● NLP methods

– Give more importance to word matching

– Part-Of-Speech patterns are used mainly as trigger for further controls

● Connectivity methods

– Short range

– Mainly used as constraints for other types of method

4. Implementation II: Methods

Page 60: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 60

● NLP methods

– Give more importance to word matching

– Part-Of-Speech patterns are used mainly as trigger for further controls

● Connectivity methods

– Short range

– Mainly used as constraints for other types of method

● Statistical methods

– Aim to reconstruct a (natural?) hierarchical structure

– Hypothesis: pyramidal structure composed by several overlapping pyramids

– Problem: Statistical values change greatly between different topics

– Better applied separately to sub-trees (such as “Clothing” or “Music”)

4. Implementation II: Methods

Page 61: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 61

● Step 1: cleanup

a) Wikipedia pages are organized in Namespaces (files, templates...)

Remove pages with Namespaces different from articles or categories

b) Administration categories are directly linked with content one

Identify and remove Administration categories: by connectivity...

(linked to “Wikipedia Administration”)

...and by Natural Language Processing (names)

(wikipedia, wikiprojects, lists, mediawiki)

c) Stubs are managed with less care than full articles

Remove all: they generate more noise than content

d) Eventually remove categories left empty by the previous steps

Repeated even during the rest of the process

4. Implementation III: Process

Page 62: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 62

● Step 1: cleanup

● Step 2: Chose a sufficiently homogeneous sub-tree

– Strongly different topics have necessarily different statistics

Even close categories like “Baseball” and “Fencing”

4. Implementation III: Process

Page 63: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 63

● Step 1: cleanup

● Step 2: Chose a sufficiently homogeneous sub-tree

● Step 3: Apply combined methods of the three kinds in a breadth-first fashion

– Most significant statistics are between categories at the same level

4. Implementation III: Process

Page 64: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 64

● Step 1: cleanup

● Step 2: Chose a sufficiently homogeneous sub-tree

● Step 3: Apply combined methods of the three kinds in a breadth-first fashion

● Step 4: Modification check after visiting each level

– If the tree has been significantly modified: restart from the tree root

Removal of empty categories doesn't affect statistical values too much

Almost any other change does

– Else: proceed to the next level, resume from step three

4. Implementation III: Process

Page 65: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 65

● Step 1: cleanup

● Step 2: Chose a sufficiently homogeneous sub-tree

● Step 3: Apply combined methods of the three kinds in a breadth-first fashion

● Step 4: Modification check after visiting each level

● Repeat until the last level of the tree is reached

The modification check on the last level has proven unnecessary

4. Implementation III: Process

Page 66: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 66

● Two different networks for two different purposes:

1) Light and fast network for “on-line” context identification

Unlabeled links, implied meaning: “has something to do with”

“Films directed by Steven Spielberg”“Films produced by Steven Spielberg”

Films → [have something to do with] → Steven Spielberg

2) Complete semantic network for more complex tasks

Labeled links

Correctness is much more critical and hard to achieve

4. Implementation IV: the Prototype(s)

Page 67: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 67

● Measures used

Total number of sub-nodes (both categories and pages)

Eccentricity: distance from the farthest leaf

Tangledness: number of sub-nodes shared with brother classes

● Word matching helps identifying where the unrelated branch stems

4. Implementation V: Unrelated Branches Pruning

Page 68: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 68

● Measures used

Total number of sub-nodes (both categories and pages)

Eccentricity: distance from the farthest leaf

Tangledness: number of sub-nodes shared with brother classes

● Word matching helps identifying where the unrelated branch stems

● Example: “Military uniforms”

Total sub-nodes = 11803; Level average ≈ 550

Eccentricity = 11 ; Level average = 4

Tangledness = 1.3% ; Level average > 70%

4. Implementation V: Unrelated Branches Pruning

Page 69: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 69

● Measures used

Total number of sub-nodes (both categories and pages)

Eccentricity: distance from the farthest leaf

Tangledness: number of sub-nodes shared with brother classes

● Word matching helps identifying where the unrelated branch stems

● Example: “Military uniforms”

Total sub-nodes = 11803; Level average ≈ 550

Eccentricity = 11 ; Level average = 4

Tangledness = 1.3% ; Level average > 70%

4. Implementation V: Unrelated Branches Pruning

Military camouflage

Camouflage patterns

Animals that can change colorMilitary uniforms

Page 70: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 70

● Measures used

Total number of sub-nodes (both categories and pages)

Eccentricity: distance from the farthest leaf

Tangledness: number of sub-nodes shared with brother classes

● Word matching helps identifying where the unrelated branch stems

● Example: “Military uniforms”

Total sub-nodes = 11803; Level average ≈ 550

Eccentricity = 11 ; Level average = 4

Tangledness = 1.3% ; Level average > 70%

4. Implementation V: Unrelated Branches Pruning

Military camouflage

Camouflage patterns

Animals that can change colorMilitary uniforms

Page 71: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 71

● Measures used

Total sub-nodes, eccentricity, tangledness

● Search is leaded by statistics and ended by word matching

4. Implementation VI: Cycle Detection

Page 72: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 72

● Measures used

Total sub-nodes, eccentricity, tangledness

● Search is leaded by statistics and ended by word matching

● Example: “Strasbourg”

Total sub-nodes = 5820478 ; Level average ≈ 5000

Eccentricity = 29 ; Level average ≈ 7

Tangledness = 100%

4. Implementation VI: Cycle Detection

Page 73: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 73

● Measures used

Total sub-nodes, eccentricity, tangledness

● Search is leaded by statistics and ended by word matching

● Example: “Strasbourg”

Total sub-nodes = 5820478 ; Level average ≈ 5000

Eccentricity = 29 ; Level average ≈ 7

Tangledness = 100%

4. Implementation VI: Cycle Detection

France

Strasbourg

Council of Europe

Members of the Council of Europe

Page 74: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 74

● Push-down specialized categories

– Low importance (total number of sub-nodes)

– Multiple parents

Leave only the lowest (or lower) level parents

4. Implementation VII: Level Adjustment

Headgear

Eyewear

Clothing

Page 75: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 75

● Push-down specialized categories

– Low importance (total number of sub-nodes)

– Multiple parents

Leave only the lowest (or lower) level parents

● Push-up general categories

– High importance

– Shorter path to common parent

Remove connection with lower-level parent

4. Implementation VII: Level Adjustment

Headgear

Eyewear

Clothing

Clothing Fashion

Culture

Page 76: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 76

● Split compound categories through simple word matching

– Perform a word matching between a parent categoryand each of its sub-categories

Names can be lemmatized (cars → car)

and/or simplified (“Mini_(marque)” → Mini; Mini_marque)

4. Implementation VIII: Compound Categories Splitting

Page 77: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 77

● Split compound categories through simple word matching

– Perform a word matching between a parent categoryand each of its sub-categories

– If the sub-category name contains the parent one

Mark the sub-category as Compound Category candidate

Mark the parent category as Compound Root for that lemma

4. Implementation VIII: Compound Categories Splitting

Page 78: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 78

● Split compound categories through simple word matching

– Perform a word matching between a parent categoryand each of its sub-categories

– If the sub-category name contains the parent one → Mark both

– Confirm the category as Compound when all its componentsare in one of these conditions:

i. Have a Compound Root

ii. Are recognized as preposition not part of a named entity

iii. Match the pattern [VBN IN]

4. Implementation VIII: Compound Categories Splitting

Page 79: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 79

● Split compound categories through simple word matching

– Perform a word matching between a parent categoryand each of its sub-categories

– If the sub-category name contains the parent one → Mark both

– Confirm the category as Compound when all its componentsare in one of these conditions:

i. Have a Compound Root

ii. Are recognized as preposition not part of a named entity

iii. Match the pattern [VBN IN]

– Split the category by extending all its link to its Compound Roots

Compound Roots may as well be splitted later

4. Implementation VIII: Compound Categories Splitting

Page 80: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 80

● Split compound categories through simple word matching

4. Implementation VIII: Compound Categories Splitting

Emulation software

Android emulation software

QEMU

Android (Operating System) software

Page 81: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 81

● Split compound categories through simple word matching

4. Implementation VIII: Compound Categories Splitting

Emulation software

Android emulation software

QEMU

Android (Operating System) software

Page 82: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 82

● Split compound categories through simple word matching

4. Implementation VIII: Compound Categories Splitting

Emulation software

Android emulation software

QEMU

Android (Operating System) software

Emulation software

QEMU

Android (Operating System) software

Page 83: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 83

● Split compound categories through simple word matching

4. Implementation VIII: Compound Categories Splitting

Berlin State Opera

Music directors of the Berlin State Opera

Herbert von Karajan

Music directors (opera)

Berlin State Opera

Herbert von Karajan

Music directors (opera)

Page 84: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 84

● Split compound categories through simple word matching

4. Implementation VIII: Compound Categories Splitting

Steven Spielberg

Films directed by Steven Spielberg

Indiana Jones and the Last Crusade

Films

Steven Spielberg

Indiana Jones and the Last Crusade

Films

Page 85: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 85

● Under study: explicit meaning relations

– If main entity of the Compound Category title (Noun Phrase head) is plural

→ Set category

→ is_a relation

4. Implementation VIII: Compound Categories Splitting

directed_by

Steven Spielberg

Indiana Jones and the Last Crusade

is_a

Films

Page 86: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 86

● Split compound categories through simple word matching

– Increase precision at the cost of recall

Doesn't lose in generality, but in performances

– Still not error-proof

4. Implementation VIII: Compound Categories Splitting

Page 87: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 87

● Split compound categories through simple word matching

– Increase precision at the cost of recall

Doesn't lose in generality, but in performances

– Still not error-proof (Under construction: check on eponym articles)

4. Implementation VIII: Compound Categories Splitting

Bass guitars

Rickenbacker_4001

Bass (sound) Guitars

Rickenbacker_4001

Bass (sound)Guitars

Page 88: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 88

● “Made by” or Diffusing Category disambiguation

– Absence of Compound Root for attribute Y

– Container Categories connection

4. Implementation IX: X_by_Y Categories

Steven Spielberg

Works by Steven Spielberg

L.A. 2017

Creative works

Container categories

Works by creator

Creative works

Page 89: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 89

● Implicit relation meanings

– Still not a “real” Semantic Network

● Limited domain

– Handmade selection of the sub-trees

– Automatic identification of “good” sub-trees is under study

4. Implementation X: Observations and tests

Page 90: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 90

● Implicit relation meanings

– Still not a “real” Semantic Network

● Limited domain

– Handmade selection of the sub-trees

– Automatic identification of “good” sub-trees is under study

● Simple disambiguation algorithms have showed good results

– On ad hoc phrases with different complexity

“With my new Nvidia graphic card, my Dell computer has become legendary!”

“without michael, the bulls are not the same anymore...”

– Entities correctly identified

– Good coarse-grained context identification (based on common parents)

4. Implementation X: Observations and tests

Page 91: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 91

Contents

1.Knowledge Representations & NLP

2.Wikipedia

3.Knowledge Extraction Example

4.Implementation

5.Conclusions

Part 2

Page 92: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 92

● Statistical methods are effective for network structure “repairs”

– Increased network usability and efficiency

– low semantic level

5. Conclusions and Future Work

Page 93: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 93

● Statistical methods are effective for network structure “repairs”

● A mixture of the three method types is necessary

– NLP and Connectivity-based to obtain semantic relations

Hard to correctly apply to the Wikipedia Categorization System

– Several cases cannot be distinguished by single-type methods

5. Conclusions and Future Work

Page 94: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 94

● Statistical methods are effective for network structure “repairs”

● A mixture of the three method types is necessary

Next steps:

● Wikipedia features integration

– Redirect and disambiguation links to form a Thesaurus

– Lists to directly infer relations

– Links between pages to infer relatedness

5. Conclusions and Future Work

Page 95: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 95

● Statistical methods are effective for network structure “repairs”

● A mixture of the three method types is necessary

Next steps:

● Wikipedia features integration

– Redirect and disambiguation links to form a Thesaurus

– Lists to directly infer relations

– Links between pages to infer relatedness

● Suitable disambiguation algorithms

– Shortest path, preferred entities, co-occurrence probability

5. Conclusions and Future Work

Page 96: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 96

● Statistical methods are effective for network structure “repairs”

● A mixture of the three method types is necessary

Next steps:

● Wikipedia features integration

– Redirect and disambiguation links to form a Thesaurus

– Lists to directly infer relations

– Links between pages to infer relatedness

● Suitable disambiguation algorithms

– Shortest path, preferred entities, co-occurrence probability

● External sources integration

– WordNet, hand-crafted information, word occurrence probabilities

5. Conclusions and Future Work

Page 97: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

Improving the Machine Interpretation of Internet Posts

Giacomo Parigi

14.02.14

Q&A

Page 98: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 98

● What about more complex titles?

Extra: Lightweight & Semantic Network Difference

Steven Spielberg

Video games based on films directed by Steven Spielberg

FilmsVideo Games

Indiana Jones and the Last Crusade: The Graphic Adventure

Page 99: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 99

● Lightweight network

Extra: Lightweight & Semantic Network Difference

Steven Spielberg

Video games based on films directed by Steven Spielberg

FilmsVideo Games

Steven Spielberg

Indiana Jones and the Last Crusade: The Graphic Adventure

Video Games Films

Indiana Jones and the Last Crusade: The Graphic Adventure

Page 100: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 100

● Complete semantic network – Current methods

Extra: Lightweight & Semantic Network Difference

Steven Spielberg

Video games based on films directed by Steven Spielberg

FilmsVideo Games

directed_by based_on

Steven Spielberg

Indiana Jones and the Last Crusade: The Graphic Adventure

is_a

Video Games Films

Indiana Jones and the Last Crusade: The Graphic Adventure

Page 101: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 101

● Complete semantic network – Current methods

Extra: Lightweight & Semantic Network Difference

Steven Spielberg

Video games based on films directed by Steven Spielberg

FilmsVideo Games

directed_by based_on

Steven Spielberg

Indiana Jones and the Last Crusade: The Graphic Adventure

is_a

Video Games Films

Indiana Jones and the Last Crusade: The Graphic Adventure

Page 102: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 102

● Complete semantic network – Possible “correct” representation

Extra: Lightweight & Semantic Network Difference

directed_by

based_on

Steven Spielberg

Indiana Jones and the Last Crusade: The Graphic Adventure

is_a

Video Games

Films

Indiana Jones and the Last Crusade

is_a

Page 103: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

14.02.14 Giacomo Parigi 103

● Complete semantic network – Possible “correct” representation

– How to realize it?

● Title splitting based on grammatical hierarchy● Medium-long range connectivity based methods

Extra: Lightweight & Semantic Network Difference

directed_by

based_on

Steven Spielberg

Indiana Jones and the Last Crusade: The Graphic Adventure

is_a

Video Games

Films

Indiana Jones and the Last Crusade

is_a

Page 104: Improving the Machine Interpretation of Internet Posts · 2014. 10. 5. · Natural Language Processing (NLP) methods – Based on category and page names Part-Of-Speech patterns,

Improving the Machine Interpretation of Internet Posts

Giacomo Parigi

14.02.14

Thank you