42
NLP/NIF Knowledge and Media 2012-2013 Lecture 11 Monday, December 3, 12

KM Lecture11 nlp/nif

Embed Size (px)

DESCRIPTION

Slides of the 11th lecture in the 2012 Knowledge and Media course concerning natural language processing

Citation preview

Page 1: KM Lecture11 nlp/nif

NLP/NIFKnowledge and Media 2012-2013Lecture 11

Monday, December 3, 12

Page 2: KM Lecture11 nlp/nif

Monday, December 3, 12

Page 3: KM Lecture11 nlp/nif

Overview

Natural Language Processing 101

The NLP pipeline

NLP tasks

NLP Challenges

NIF (NLP Interchange Format)

Monday, December 3, 12

Page 4: KM Lecture11 nlp/nif

NLP: What is it?NLP or text analytics adds semantic understanding of:

named entities: people, companies, locations, etc.

pattern-based entities: email-addresses, phone numbers

concepts: abstractions of entities

facts and relationships

concrete and abstract attributes (e.g., 5 years, expensive)

subjectivity in the form of opinions, sentiments and emotions

SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12

Page 5: KM Lecture11 nlp/nif

80% of relevant information to businesses is in ‘unstructured’ textual form:

web pages, news and blog articles, forum postings, other social media

email and messages

surveys, feedback forms, warranty claims

scientific literature, books, legal documents, patents

...

SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12

Page 6: KM Lecture11 nlp/nif

NLP: What is it for?NLP transforms unstructured text into structured information which may be:

categorised

queried

mined for patterns, topics or themes

presented intelligently

visualised and explored

SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011

Monday, December 3, 12

Page 7: KM Lecture11 nlp/nif

NLP: Some history

1950 - 1980: Handwritten rules

Russian - English translation system

ELIZA

Since 1980: Machine learning

IBM’s Watson

Monday, December 3, 12

Page 8: KM Lecture11 nlp/nif

NLP: Tasks

IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG

Monday, December 3, 12

Page 9: KM Lecture11 nlp/nif

Morphological/Lexical Analysis

Language identification

Tokenisation

Stemming/Lemmatisation

Monday, December 3, 12

Page 10: KM Lecture11 nlp/nif

Syntactic Analysis

Text segmentation

Part of Speech (POS) tagging

Chunking

Shallow Parsing

Monday, December 3, 12

Page 11: KM Lecture11 nlp/nif

Semantic Analysis

Named entity recognition (NER)

Relation finding

Semantic role labelling (SRL)

Word-sense disambiguation (WSD)

Co-reference/anaphora resolution

Monday, December 3, 12

Page 12: KM Lecture11 nlp/nif

Semantic Analysis (ctd)

Topic detection/segmentation

Machine Translation (MT)

Sentiment analysis/opinion mining

Automatic summarisation

Monday, December 3, 12

Page 13: KM Lecture11 nlp/nif

NLP: Approaches

Rule-based

Statistical

Hybrid methods

Monday, December 3, 12

Page 14: KM Lecture11 nlp/nif

Named Entity Recognition Explained

Monday, December 3, 12

Page 15: KM Lecture11 nlp/nif

NER: State-of-the-Art

Statistical methods: Conditional Random Fields (CRF)

Precision: 92.15%

Recall: 92.39%

F-Measure: 92.27%

Monday, December 3, 12

Page 16: KM Lecture11 nlp/nif

PrecisionHow many predictions were correct?

P=TP/(TP+FP)

Spam Not Spam

Spam True Positive (TP)

False Positive (FP)

Not Spam False Negative (FN)

True Negative (TN)

ACTUAL

PRED

ICTE

D

Monday, December 3, 12

Page 17: KM Lecture11 nlp/nif

RecallOf the total number of instances in a class, how many were found?

R=TP/(TP+FN)

Spam Not Spam

Spam True Positive (TP)

False Positive (FP)

Not Spam False Negative (FN)

True Negative (TN)

ACTUAL

PRED

ICTE

D

Monday, December 3, 12

Page 18: KM Lecture11 nlp/nif

F-ScoreHarmonic mean of Precision and Recall

F=2 • P • R/(P+R)

[Acc=(TP+TN)/(TP+FP+FN+TN)]

Spam Not Spam

Spam True Positive (TP)

False Positive (FP)

Not Spam False Negative (FN)

True Negative (TN)

ACTUAL

PRED

ICTE

D

Monday, December 3, 12

Page 19: KM Lecture11 nlp/nif

Machine Learning 101Training

1. Collect a set of representative training documents

2. Label each token for its entity class or other (O)

3. Design feature extractors appropriate to the text and classes

4. Train a sequence classifier to predict the labels from the data

Testing

1. Receive a set of testing documents

2. Run sequence model inference to label each token

3. Appropriately output the recognised entities

SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF

Monday, December 3, 12

Page 21: KM Lecture11 nlp/nif

NER Training DataIOB Scheme

Inside, Outside, Begin

For each type of entity there is an I-XXX and a B-XXX tag

Non-entities are tagged O

B-XXX only used if two entities of same type next to each other

Assumes that named entities are non-recursive and don’t overlap

Example:

Meg Whitman CEO of eBay I-PER I-PER O O I-ORG

SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF

Monday, December 3, 12

Page 22: KM Lecture11 nlp/nif

Features for text learning taskIs the word capitalised?

Is the word at the start of a sentence?

What is the Part of speech tag?

Previous and following words

Info from gazetteers

Useful features help your learner, badly chosen features may harm it

SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF

Monday, December 3, 12

Page 23: KM Lecture11 nlp/nif

Relation Finding Explained

Amphibia Anura

Monday, December 3, 12

Page 24: KM Lecture11 nlp/nif

Relation Finding: State-of-the-Art

Induce relation-dictionaries using slot filling (AutoSlog)

Example-based learning (Snowball)

Pattern-recognition over shallow parses (LEILA)

Monday, December 3, 12

Page 25: KM Lecture11 nlp/nif

Relation Finding: pattern finding over shallow parses

direction relation candidate

frequency rating

is a municipality and a town in

45 +

is a municipality and a city in

19 +

is a municipality in

10 +

is one of the five districts of

5 -

is the name of two provinces in

5 -

Monday, December 3, 12

Page 26: KM Lecture11 nlp/nif

RL for domain modelling

Type

Location

on the island of (0.500)

Genus

Order

is a (1.000)

Family

Class

is a (0.750)

Country

is in (0.500)

Species

is a (1.000)

is a (0.833)

Type Name

Province

occur in (0.333)

occur in (0.750)

may refer to (0.560)

is a (0.854)

is found in (0.635)

Town

is found in (0.566)

is a town in (0.794)

may refer to (0.482)

is found in (0.573)

is a municipality in (0.891)

is a town in (0.759)

is a (1.000)

Monday, December 3, 12

Page 27: KM Lecture11 nlp/nif

RL for template filling

Date Ship Type Crew Ransom

2005/04/10 Feisty Gas LNG carrier 12 $315,000

2005/06/27 Semlow Freighter 10 $50,000

2005/10/28 Panagia Bulk Carrier 22 $700,000

2005/11/05 Seabourn Spirit Cruise ship 210 none

Monday, December 3, 12

Page 28: KM Lecture11 nlp/nif

Opinion Mining Explained

Monday, December 3, 12

Page 29: KM Lecture11 nlp/nif

Opinion Mining: State-of-the-ArtSupervised learning using features such as:

opinion words and phrases

negation

part-of-speech-tags

dependency parsing

Monday, December 3, 12

Page 30: KM Lecture11 nlp/nif

Positive or negative?

“I bought an iPhone a few days ago. It was such a nice phone. The touch screen was really cool. The voice quality was clear too. Although the battery life was not long, that is ok for me. However, my mother was mad with me as I did not tell her before I bought it. She also thought the phone was too expensive, and wanted me to return it to the shop. … ”

EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY, IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS), 2010.

Monday, December 3, 12

Page 31: KM Lecture11 nlp/nif

IBM’s Watson

HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXWMonday, December 3, 12

Page 32: KM Lecture11 nlp/nif

NLP: Challenges

Negation

Messy text (twitter and SMS language)

Domain adaptation

Cross- and multi-document text analysis

Resource-scarce languages

Monday, December 3, 12

Page 33: KM Lecture11 nlp/nif

NIF: Natural Language Processing Interchange Format

Monday, December 3, 12

Page 34: KM Lecture11 nlp/nif

Monday, December 3, 12

Page 35: KM Lecture11 nlp/nif

Look familiar?

Monday, December 3, 12

Page 36: KM Lecture11 nlp/nif

NIF: Why do we need it?Integration of NLP tools

Bridge between LOD and NLP communities

Monday, December 3, 12

Page 37: KM Lecture11 nlp/nif

NIF Claims1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is

compatible with all other tools, which implement NIF.

2. NIF achieves this interoperability by using and defining a most common denominator for annotations. This means that some standard annotations are required to be used. On the other hand NIF is flexible and allows the NLP tools to add any extra annotations at will.

3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you need to solve a certain task.

4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:

RDF makes data integration easy: URIs, LinkedData

OWL is based on Description Logics (Types, Type inheritance)

Availability of open data sets (access and licence)

Reusability of Vocabularies and Ontologies

Diverse serializations for annotations: XML, Turtle,RDFa+XHTML

Scalable tool support (Databases, Reasoning)

Data is flexible and can be queried / transformed in many ways

Monday, December 3, 12

Page 38: KM Lecture11 nlp/nif

Structural interoperabilityNIF specifies how to create an identifier for uniquely locating arbitrary substrings in a document

either using offset- or context-hash-based URIs

String ontology to describe Strings

Structured Sentence Ontology

Monday, December 3, 12

Page 39: KM Lecture11 nlp/nif

Conceptual InteroperabilityLemma and stem annotations are data type properties in the Structured Sentence Ontology

POS tags use OLiA (Ontologies or Linguistic Annotations)

NER tags use Semantic Content Management System (SCMS) EU Project

Monday, December 3, 12

Page 40: KM Lecture11 nlp/nif

Access InteroperabilityMain interface: wrapper to NIF Web service

IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG

Monday, December 3, 12

Page 41: KM Lecture11 nlp/nif

NLP/NIF: Wrap up

NLP History and tasks

Machine learning 101

Use-cases NER, relation finding and opinion mining

Interoperability NLP results with NIF

Monday, December 3, 12

Page 42: KM Lecture11 nlp/nif

Further reading/ToolsPeter Jackson and Isabelle Moulinier (2007)Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins. ISBN: 9027249938

ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics

Machine learning: WEKA

Natural language processing: GATE

Monday, December 3, 12