Upload
marieke-van-erp
View
1.165
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Slides of the 11th lecture in the 2012 Knowledge and Media course concerning natural language processing
Citation preview
NLP/NIFKnowledge and Media 2012-2013Lecture 11
Monday, December 3, 12
Monday, December 3, 12
Overview
Natural Language Processing 101
The NLP pipeline
NLP tasks
NLP Challenges
NIF (NLP Interchange Format)
Monday, December 3, 12
NLP: What is it?NLP or text analytics adds semantic understanding of:
named entities: people, companies, locations, etc.
pattern-based entities: email-addresses, phone numbers
concepts: abstractions of entities
facts and relationships
concrete and abstract attributes (e.g., 5 years, expensive)
subjectivity in the form of opinions, sentiments and emotions
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
80% of relevant information to businesses is in ‘unstructured’ textual form:
web pages, news and blog articles, forum postings, other social media
email and messages
surveys, feedback forms, warranty claims
scientific literature, books, legal documents, patents
...
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
NLP: What is it for?NLP transforms unstructured text into structured information which may be:
categorised
queried
mined for patterns, topics or themes
presented intelligently
visualised and explored
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
NLP: Some history
1950 - 1980: Handwritten rules
Russian - English translation system
ELIZA
Since 1980: Machine learning
IBM’s Watson
Monday, December 3, 12
NLP: Tasks
IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG
Monday, December 3, 12
Morphological/Lexical Analysis
Language identification
Tokenisation
Stemming/Lemmatisation
Monday, December 3, 12
Syntactic Analysis
Text segmentation
Part of Speech (POS) tagging
Chunking
Shallow Parsing
Monday, December 3, 12
Semantic Analysis
Named entity recognition (NER)
Relation finding
Semantic role labelling (SRL)
Word-sense disambiguation (WSD)
Co-reference/anaphora resolution
Monday, December 3, 12
Semantic Analysis (ctd)
Topic detection/segmentation
Machine Translation (MT)
Sentiment analysis/opinion mining
Automatic summarisation
Monday, December 3, 12
NLP: Approaches
Rule-based
Statistical
Hybrid methods
Monday, December 3, 12
Named Entity Recognition Explained
Monday, December 3, 12
NER: State-of-the-Art
Statistical methods: Conditional Random Fields (CRF)
Precision: 92.15%
Recall: 92.39%
F-Measure: 92.27%
Monday, December 3, 12
PrecisionHow many predictions were correct?
P=TP/(TP+FP)
Spam Not Spam
Spam True Positive (TP)
False Positive (FP)
Not Spam False Negative (FN)
True Negative (TN)
ACTUAL
PRED
ICTE
D
Monday, December 3, 12
RecallOf the total number of instances in a class, how many were found?
R=TP/(TP+FN)
Spam Not Spam
Spam True Positive (TP)
False Positive (FP)
Not Spam False Negative (FN)
True Negative (TN)
ACTUAL
PRED
ICTE
D
Monday, December 3, 12
F-ScoreHarmonic mean of Precision and Recall
F=2 • P • R/(P+R)
[Acc=(TP+TN)/(TP+FP+FN+TN)]
Spam Not Spam
Spam True Positive (TP)
False Positive (FP)
Not Spam False Negative (FN)
True Negative (TN)
ACTUAL
PRED
ICTE
D
Monday, December 3, 12
Machine Learning 101Training
1. Collect a set of representative training documents
2. Label each token for its entity class or other (O)
3. Design feature extractors appropriate to the text and classes
4. Train a sequence classifier to predict the labels from the data
Testing
1. Receive a set of testing documents
2. Run sequence model inference to label each token
3. Appropriately output the recognised entities
SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF
Monday, December 3, 12
k-NN
HTTP://WWW.YOUTUBE.COM/USER/ANTALVANDENBOSCH#P/U/2/PB4QATZITLQ
Monday, December 3, 12
NER Training DataIOB Scheme
Inside, Outside, Begin
For each type of entity there is an I-XXX and a B-XXX tag
Non-entities are tagged O
B-XXX only used if two entities of same type next to each other
Assumes that named entities are non-recursive and don’t overlap
Example:
Meg Whitman CEO of eBay I-PER I-PER O O I-ORG
SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF
Monday, December 3, 12
Features for text learning taskIs the word capitalised?
Is the word at the start of a sentence?
What is the Part of speech tag?
Previous and following words
Info from gazetteers
Useful features help your learner, badly chosen features may harm it
SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF
Monday, December 3, 12
Relation Finding Explained
Amphibia Anura
Monday, December 3, 12
Relation Finding: State-of-the-Art
Induce relation-dictionaries using slot filling (AutoSlog)
Example-based learning (Snowball)
Pattern-recognition over shallow parses (LEILA)
Monday, December 3, 12
Relation Finding: pattern finding over shallow parses
direction relation candidate
frequency rating
is a municipality and a town in
45 +
is a municipality and a city in
19 +
is a municipality in
10 +
is one of the five districts of
5 -
is the name of two provinces in
5 -
Monday, December 3, 12
RL for domain modelling
Type
Location
on the island of (0.500)
Genus
Order
is a (1.000)
Family
Class
is a (0.750)
Country
is in (0.500)
Species
is a (1.000)
is a (0.833)
Type Name
Province
occur in (0.333)
occur in (0.750)
may refer to (0.560)
is a (0.854)
is found in (0.635)
Town
is found in (0.566)
is a town in (0.794)
may refer to (0.482)
is found in (0.573)
is a municipality in (0.891)
is a town in (0.759)
is a (1.000)
Monday, December 3, 12
RL for template filling
Date Ship Type Crew Ransom
2005/04/10 Feisty Gas LNG carrier 12 $315,000
2005/06/27 Semlow Freighter 10 $50,000
2005/10/28 Panagia Bulk Carrier 22 $700,000
2005/11/05 Seabourn Spirit Cruise ship 210 none
Monday, December 3, 12
Opinion Mining Explained
Monday, December 3, 12
Opinion Mining: State-of-the-ArtSupervised learning using features such as:
opinion words and phrases
negation
part-of-speech-tags
dependency parsing
Monday, December 3, 12
Positive or negative?
“I bought an iPhone a few days ago. It was such a nice phone. The touch screen was really cool. The voice quality was clear too. Although the battery life was not long, that is ok for me. However, my mother was mad with me as I did not tell her before I bought it. She also thought the phone was too expensive, and wanted me to return it to the shop. … ”
EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY, IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS), 2010.
Monday, December 3, 12
IBM’s Watson
HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXWMonday, December 3, 12
NLP: Challenges
Negation
Messy text (twitter and SMS language)
Domain adaptation
Cross- and multi-document text analysis
Resource-scarce languages
Monday, December 3, 12
NIF: Natural Language Processing Interchange Format
Monday, December 3, 12
Monday, December 3, 12
Look familiar?
Monday, December 3, 12
NIF: Why do we need it?Integration of NLP tools
Bridge between LOD and NLP communities
Monday, December 3, 12
NIF Claims1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is
compatible with all other tools, which implement NIF.
2. NIF achieves this interoperability by using and defining a most common denominator for annotations. This means that some standard annotations are required to be used. On the other hand NIF is flexible and allows the NLP tools to add any extra annotations at will.
3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you need to solve a certain task.
4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:
RDF makes data integration easy: URIs, LinkedData
OWL is based on Description Logics (Types, Type inheritance)
Availability of open data sets (access and licence)
Reusability of Vocabularies and Ontologies
Diverse serializations for annotations: XML, Turtle,RDFa+XHTML
Scalable tool support (Databases, Reasoning)
Data is flexible and can be queried / transformed in many ways
Monday, December 3, 12
Structural interoperabilityNIF specifies how to create an identifier for uniquely locating arbitrary substrings in a document
either using offset- or context-hash-based URIs
String ontology to describe Strings
Structured Sentence Ontology
Monday, December 3, 12
Conceptual InteroperabilityLemma and stem annotations are data type properties in the Structured Sentence Ontology
POS tags use OLiA (Ontologies or Linguistic Annotations)
NER tags use Semantic Content Management System (SCMS) EU Project
Monday, December 3, 12
Access InteroperabilityMain interface: wrapper to NIF Web service
IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG
Monday, December 3, 12
NLP/NIF: Wrap up
NLP History and tasks
Machine learning 101
Use-cases NER, relation finding and opinion mining
Interoperability NLP results with NIF
Monday, December 3, 12
Further reading/ToolsPeter Jackson and Isabelle Moulinier (2007)Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization. John Benjamins. ISBN: 9027249938
ACL Anthology: A Digital Archive of Research Papers in Computational Linguistics
Machine learning: WEKA
Natural language processing: GATE
Monday, December 3, 12