35
How computers understand text content a presentation for the Auckland content strategy meetup by Anna Divoli @annadivoli . Ph.D. in Biomedical Text Mining | Text Analytics Researcher | Head of R&D at Pingar

How computers understand text content - by Anna Divoli

Embed Size (px)

Citation preview

How computers understand text content

a presentation for the Auckland content strategy meetup

by Anna Divoli@annadivoli

.

Ph.D. in Biomedical Text Mining | Text Analytics Researcher | Head of R&D at Pingar

Who am I?

• 14 years in academia + 4 years in industry• academically exposed to different disciplines:

biomedicine, bioinformatics, computational linguistics, information retrieval, information extraction, semantic technologies, human-computer interaction, search user interface usability, knowledge acquisition, visualizations

• lived in different countries:Greece, UK, US, NZ

• learned English as a second language (hint: I empathize with computer systems)

Anna Divoli Auckland content strategy meetup Aug 2015

Who are you?

• Marketing?• Digital content?• Information Architecture?• Journalists?• UX?• Business Analysis?• Software Development?• CS research (incl. “text” people)?• Other?

Anna Divoli Auckland content strategy meetup Aug 2015

What is “text”? Where is it?w

ww

.nai

lingi

t.com

/im

ages

/web

site

s.jp

g

ww

w.b

u.ed

u/to

day/

files

/201

2/10

/t_j

ourn

als1

.jpg

web

.cla

rku.

edu/

office

s/its

/im

ages

/file

pile

.jpg

ww

w.fl

ickr

.com

/pho

tos/

jlcon

for/

1419

1286

471

Human – Text Content Interaction

Humans:Slow, Inconsistent, Expensive

Text content:Overwhelmingly fast growing, Disseminated across multiple sources

Anna Divoli Auckland content strategy meetup Aug 2015

NLP Artificial Intelligence∈

Machine Learning

NLP

Computational Linguistics

Applied Text

Analytics

Storage

Memory

Security

Friendly UIs

Visualizations

Anna Divoli Auckland content strategy meetup Aug 2015

So, what’s in the text?

• Entities• Facts• Relations• Themes/topics• Opinions & sentiment• …

+ Time/Location dimensions:• Trends & paradigm shifts• Networks• …

Anna Divoli Auckland content strategy meetup Aug 2015

Named Entity Recognition

Find and classify names…

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

Anna Divoli Auckland content strategy meetup Aug 2015

Named Entity Recognition

Find and classify names…

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

People Locations Organizations

Methods: lexicon-based (gazeteers)grammar-based (rule-based)

✓ statistical models (machine learning: algorithms + features)

✓ hybrids Anna Divoli Auckland content strategy meetup Aug 2015

Named Entity Recognition

Find and classify names…

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

People DatesLocations Organizations

Who? Where?

When?

Anna Divoli Auckland content strategy meetup Aug 2015

Disambiguation & Normalization:Word Sense Disambiguation & Text Normalization

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

Word Sense Disambiguation: identifying which sense/meaning of a word is used in a sentence, when the word has multiple meanings. Synonyms & homonyms. Use context!!

Text normalization: transforming text into a single canonical form that it might not have had before.

Anna Divoli Auckland content strategy meetup Aug 2015

Word Sense Disambiguation & Text Normalization

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

Sam Arlington initiated partnership discussions during his visit to Eureka offices in July.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

J. Smith went to Washington DC to see the Smithsonian Institute and also met up with Virginia Peterson for a coffee.

Anna Divoli Auckland content strategy meetup Aug 2015

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

Sam Arlington initiated partnership discussions during his visit to Eureka office in July.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

J. Smith went to Washington DC to see the Smithsonian Institute and also met up with Virginia Peterson for a coffee.

Word Sense Disambiguation & Text Normalization

Anna Divoli Auckland content strategy meetup Aug 2015

Fact & Relationship extraction

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

What?

Anna Divoli Auckland content strategy meetup Aug 2015

Deeper knowledge & Sentiment

S. Arlington initiated partnership discussions during his visit to Eureka’s Ltd offices last month.

John Smith went to Washington to see the Smithsonian and also met up with Virginia for a coffee.

How? Why? How do we feel about it?

S. Arlington visited the Eureka’s Ltd offices last month to initiate partnership discussions.

John Smith was delighted to go to Washington to see the Smithsonian and also met up with Virginia for a coffee.

Anna Divoli Auckland content strategy meetup Aug 2015

Sentiment analysis & opinion mining

• Dictionary-based (e.g. LIWC)• Statistical• Hybrid

• Polarity & strength • Feelings• Mood• Aspects• Who has this sentiment (source)• What is the target of the sentiment

Pos | Neu | Neg & scoreAngry, sad…Happy, depressed…Location, cleanliness…Employees, customers…Product, event, person…

Anna Divoli Auckland content strategy meetup Aug 2015

So, what’s in the text?

Anna Divoli Auckland content strategy meetup Aug 2015

• Entities• Facts• Relations• Themes/topics no training or ontologies need!

can utilize web resources (e.g., Wikipedia)• Opinions & sentiment• …

+ Time/Location dimensions:• Trends & paradigm shifts• Networks• …

So, what ELSE is in the text?• Ambiguity• Metaphors• Sarcasm• Colloquialism/Slang• Negation• Hedging• Conditional statements• Inconsistencies/Bad grammar• Text speak• Anaphora• Humor

I want an apple.He drowned in a sea of grief.George W Bush. Love him!I slept like crap last night. I am not sure I want to go to NYC.The results indicate this.When it rains I feel sad.I think your smart.C u l8r @JacksJohn met with Nick. He was upset. Did you take a bath today? No. Is one missing?

Anna Divoli Auckland content strategy meetup Aug 2015

So, what ELSE is in the text?• Ambiguity• Metaphors• Sarcasm• Colloquialism/Slang• Negation• Hedging• Conditional statements• Inconsistencies/Bad grammar• Text speak• Anaphora• Humor

I want an apple.He drowned in a sea of grief.George W Bush. Love him!I slept like crap last night. I am not sure I want to go to NYC.The results indicate this.When it rains I feel sad.I think your smart.C u l8r @JacksJohn met with Nick. He was upset. Did you take a bath today? No. Is one missing?

Consider: distributed information (dialogue), technical/scientific text, legal text, creative/poetry…

Anna Divoli Auckland content strategy meetup Aug 2015

Human language!

Eye drops off shelf.

Include your children when baking cookies.

Turn right here.

John saw the man on the mountain with a telescope.

He gave her cat food.

They are hunting dogs. Anna Divoli Auckland content strategy meetup Aug 2015

Examples: Biology…

Looking for: interactions between SAF and viral LTR elements(SAF is a transcription factor, LTR stands for ‘long terminal repeat’)(Also: SAF = single and free, LTR = long term relationship)

Gene names:tinman, lilliputian, dreadlocks, lush, cheap date, methuselah, Van Gogh, maggie, brainiac, grim, reaper, cleopatra, swiss cheese, fucK, out cold, ken and barbie, kenny, lava lamp, hamlet, sonic hedgehog, werewolf, half pint, drop dead, chardonnay, agnostic, I’m not dead yet…

Anna Divoli Auckland content strategy meetup Aug 2015

Current State of NLP

• Rule-based systems for high precision results

• Hybrid systems for more robust performance (rules + dictionaries/ontologies + statistical models)

• Limitation: specialized systems perform better (much like humans!)

• Workflows offer work-around for more generic systemse.g., check language check category choose model

Anna Divoli Auckland content strategy meetup Aug 2015

Examples of applications

(some are very specialized!)

Anna Divoli Auckland content strategy meetup Aug 2015

Content Enrichment

Content Inventory

Content Intelligence

pingar.com/discoveryone/

www.youtube.com/watch?v=i9FnMylGQxw

Take home messages

• Machines can do a lot of consistent, fast information extraction

• Specialization is needed in several fields but systems can have internal workflows

• Big data + statistics = magic!

• Always room for improvement

• Information management AND decisions AND predictions

Time for questions and discussion!

https://xkcd.com/1263/

Anna Divoli Auckland content strategy meetup Aug 2015

@annadivoli.