31
Microblog-Genre Noise and Impact on Semantic Annotation Accuracy Leon Derczynski Diana Maynard Niraj Aswani Kalina Bontcheva

Microblog-genre noise and its impact on semantic annotation accuracy

Embed Size (px)

DESCRIPTION

As presented at ACM HyperText 2013. See the full paper: http://bit.ly/gate-microblog-nel

Citation preview

Page 1: Microblog-genre noise and its impact on semantic annotation accuracy

Microblog-Genre Noise and Impact on Semantic Annotation Accuracy

Leon DerczynskiDiana Maynard

Niraj AswaniKalina Bontcheva

Page 2: Microblog-genre noise and its impact on semantic annotation accuracy

Evolution of communication

Functional utterances

Vowels

Velar closure: consonants

Speech

New modality: writing

Digital text

E-mail

Social media

Increased

machine-readable

information??

Page 3: Microblog-genre noise and its impact on semantic annotation accuracy

Social Media = Big Data

Gartner ''3V'' definition:

1.Volume

2.Velocity

3.Variety

High volume & velocity of messages:

Twitter has ~20 000 000 users per monthThey write ~500 000 000 messages per day

Massive variety: Stock markets;Earthquakes;Social arrangements;… Bieber

Page 4: Microblog-genre noise and its impact on semantic annotation accuracy

What resources do we have now?

Large, content-rich, linked, digital streams of human communication

We transfer knowledge via communication

Sampling communication gives a sample of human knowledge

''You've only done that which you can communicate''

The metadata (time – place – imagery) gives a richer resource:

→ A sampling of human behaviour

Page 5: Microblog-genre noise and its impact on semantic annotation accuracy

Linking these resources

Why is it useful to link this data?

Identifying subjects, themes, … ''entities''

What's the text about?

Who'd be interested in it?

How can we summarise it?

- very important, considering information overload in soc med!

Page 6: Microblog-genre noise and its impact on semantic annotation accuracy

Typical annotation pipeline

Language ID

TokenisationPoS tagging

Text

Page 7: Microblog-genre noise and its impact on semantic annotation accuracy

Typical annotation pipeline

Named entity recognition

dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer)

Linking entities

Page 8: Microblog-genre noise and its impact on semantic annotation accuracy

Pipeline cumulative effect

Good performance is important at each stage – not just entity linking

Page 9: Microblog-genre noise and its impact on semantic annotation accuracy

Language ID

LADY GAGA IS BETTER THE 5th TIME OH BABY(:

The Jan. 21 show started with the unveiling of an impressive three-story castle from which Gaga emerges. The band members were in various portals, separated from each other for most of the show. For the next 2 hours and 15 minutes, Lady Gaga repeatedly stormed the moveable castle, turning it into her own gothic Barbie Dreamhouse

Newswire:

Microblog:

Page 10: Microblog-genre noise and its impact on semantic annotation accuracy

Language ID difficulties

General accuracy on microblogs: 89.5%

Problems include switching language mid-text:

je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt, het is duidelijk, ze bedekken hun gezicht. Get over it

New info in this format:

Metadata:spatial informationlinked URLs

Emoticons::) vs. ^_^cu vs. 88

Accuracy when customised to genre: 97.4%

Page 11: Microblog-genre noise and its impact on semantic annotation accuracy

Tokenisation

General accuracy on microblogs: 80%

Goal is to convert byte stream to readily-digestible word chunks

Word bound discovery is a critical language acquisition task

The LIBYAN AID Team successfully shipped these broadcasting equipment to Misrata last August 2011, to establish an FM Radio station ranging 600km, broadcasting to the west side of Libya to help overthrow Gaddafi's regime.

RT @JosetteSheeran: @WFP #Libya breakthru! We move urgently needed #food (wheat, flour) by truck convoy into western Libya for 1st time :D

Page 12: Microblog-genre noise and its impact on semantic annotation accuracy

Tokenisation difficulties

Not curated, so typos

Improper grammar – e.g. apostrophe usage; live with it!

doesnt → doesnt

doesn't → does n't

Smileys and emoticons

I <3 you → I & lt ; you

This piece ;,,( so emotional → this piece ; , , ( so emotional

Loss of information (sentiment)

Punctuation for emphasis

*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *

Words run together

Page 13: Microblog-genre noise and its impact on semantic annotation accuracy

Tokenisation fixes

Custom tokeniser!

Apostrophe insertion

Slang unpacking

Ima get u → I 'm going to get you

Emoticon rules

- if we can spot them, we know not to break them

Customised accuracy on microblogs: 96%

Page 14: Microblog-genre noise and its impact on semantic annotation accuracy

Part of speech tagging

Goal is to assign words to classes (verb, noun etc)

General accuracy on newswire:97.3% token, 56.8% sentence

General accuracy on microblogs:73.6% token, 4.24% sentence

Sentence-level accuracy important:

without whole sentence correct, difficult to extract syntax

Page 15: Microblog-genre noise and its impact on semantic annotation accuracy

Part of speech tagging difficulties

Many unknowns:

Music bandsSoulja Boy | TheDeAndreWay.com in stores Nov 2, 2010

Places#LB #news: Silverado Park Pool Swim Lessons

Capitalisation way off@thewantedmusic on my tv :) aka dereklast day of sorting pope visit to birmingham stuff out

Slang~HAPPY B-DAY TAYLOR !!! LUVZ YA~

Orthographic errorsdont even have homwork today, suprising ?

DialectShall we go out for dinner this evening?Ey yo wen u gon let me tap dat

Page 16: Microblog-genre noise and its impact on semantic annotation accuracy

Part of speech tagging fixes

Slang dictionary for ''repair''

Won't cover previously-unseen slang

In-genre labelled data

Expensive to create!

Leverage ML

Existing taggers can handle unknown words

Maximise use of these features!

General accuracy on microblogs:73.6% token, 4.24% sentence

Accuracy when customised to genre:88.4% token, 25.40% sentence

Page 17: Microblog-genre noise and its impact on semantic annotation accuracy

Named Entity Recognition

Goal is to find entities we might like to link

General accuracy on newswire: 89% F1

General accuracy on microblogs: 41% F1

Newswire:

Microblog:Gotta dress up for london fashion week and party in style!!!

London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.

Page 18: Microblog-genre noise and its impact on semantic annotation accuracy

NER difficulties

Rule-based systems get the bulk of entities (newswire 77% F1)

ML-based systems do well at the remainder (newswire 89% F1)

Small proportion of difficult entities

Many complex issues

Using improved pipeline:

ML struggles, even with in-genre data: 49% F1

Rules cut through microblog noise: 80% F1

Page 19: Microblog-genre noise and its impact on semantic annotation accuracy

NER on Facebook

Longer texts than tweets

Still has informal tone

MWEs are a problem!

- all capitalised:

Green Europe Imperiled as Debt Crises Triggers Carbon Market Drop

Difficult, though easier than Twitter

Maybe due to the possibility of including more verbal context?

Page 20: Microblog-genre noise and its impact on semantic annotation accuracy

Entity linking

Goal is to find out which entity a mention refers to

''The murderer was Professor Plum, in the Library, with the Candlestick!''

Which Professor Plum?

Disambiguation is through connecting text to the web of data

dbpedia.org/resource/Professor_Plum_(astrophysicist)

Two tasks:

- Whole-text linking

- Entity-level linking

Page 21: Microblog-genre noise and its impact on semantic annotation accuracy

''Aboutness''

Goal: answer ''What entities is this text about?''

Good for tweets:

Lack of lexicalised context

Not all related concepts are in the text

Helpful for summarisation

No concern for entity bounds; (finding them is tough in microblog!)

* but *

Added concern for themes in text

e.g. ''marketing'', ''US elections''

Page 22: Microblog-genre noise and its impact on semantic annotation accuracy

Aboutness performance

Corpus:

from Meij et al. ''Adding semantics to microblog posts.''

468 tweets

From one to six concepts per tweet

DBpedia spotlight: highest recall (47.5)

TextRazor: highest precision (64.6)

Zemanta: highest F1 (41.0)

Zemanta tuned for blog entries – so compensates for some noise

Page 23: Microblog-genre noise and its impact on semantic annotation accuracy

Word-level linking

Goal is to link an entity

Given:The entity mentionSurrounding microblog context

No corpora exist for this exact task:

Two commercially produced ones

Policy says ''no sharing''

How can we approach this key task?

Page 24: Microblog-genre noise and its impact on semantic annotation accuracy

Word-level linking performance

Dataset: Replab

Task is to determine relatedness-or-not

Six entities given

Few hundred tweets per entity

Detect mentions of entity in tweets

We disambiguate mentions to DBpedia / Wikipedia (easy to map)

General performance: F1 around 70

Page 25: Microblog-genre noise and its impact on semantic annotation accuracy

Word-level linking issues

NER errors

Missed entities damages / destroys linking

Specificity problems

Lufthansa Cargo

Lufthansa Cargo

Which organisation to choose?

Require good NER

Direct linking chunking reduces precision:Apple trees in the home garden bit.ly/yOztKs

Pipeline NER does not mark Apple as entity here

Lack of disambiguation context is a problem!

Page 26: Microblog-genre noise and its impact on semantic annotation accuracy

Word-level linking issues

Automatic annotation:Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter!

Actual:Branching out from Lincoln park after dark(ORG) ... Hello "Russian Navy", it's like the same thing but with glitter!

Clue in unusual collocations

+ ?

Page 27: Microblog-genre noise and its impact on semantic annotation accuracy

Whole pipeline: how to fix?

Common genre problems centre on mucky, uncurated text

Orth error

Slang

Brevity

Condensed

Non-Chicago punctuation..

Maybe clearing up this will improve performance?

Page 28: Microblog-genre noise and its impact on semantic annotation accuracy

Normalisation

General solution for overcoming linguistic noise

How to repair?

1. Gazetteer (quick & dirty); or..

2. Noisy channel model

Task is to ''reverse engineer'' the noise on this channel

Brown clustering; double metaphone; auto orth correction

An honest, well-formed sentence u wot m8 biber #lol

Page 29: Microblog-genre noise and its impact on semantic annotation accuracy

Normalisation performance

NER on tweets:

Rule-basedNo normalisation F1 80%Gazetter normalisation F1 81%Noisy channel F1 81%

ML-basedNo normalisation F1 49.1%Gazetter normalisation F1 47.6%Noisy channel F1 49.3%

Negligible performance impact, and introduces errors!

Sentiment change:undisambiguable → disambiguable

Meaning change:She has Huntington's → She has Huntingdon's

Page 30: Microblog-genre noise and its impact on semantic annotation accuracy

Future directions

MORE DATA!

and better.. no IAA for many resources

Maybe from the crowd?

MORE CONTEXT!

Not just linguisticMicroblog has host of metatdata

Explicit:

Time, Place, URIs, Hashtags

Implicit:

Friend networkPrevious messages

Page 31: Microblog-genre noise and its impact on semantic annotation accuracy

Thank you!

Thank you for listening!

Do you have any questions?