37
Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis, and Discovery Roger Bradford Agilex Technologies 16 April 2012

II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

Getting beyond Keywords:

Using Conceptual

Representations of Text in

Search, Analysis, and Discovery

Roger Bradford

Agilex Technologies

16 April 2012

Page 2: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

2

What’s Wrong with Keyword Search?

6.9 Million Documents

Page 3: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

3

• Synonomy:

�Common English Nouns have 6-8 Close Synonyms

� Common English Verbs have 9-11

• Polysemy: The English Word Strike has >30

Common Meanings

• Implicit Context (e.g., John’s Project)

• Data Irregularities:

�Missing

�Erroneous

�Contradictory

• Cross-document Relationships

Complexities of Text

Page 4: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

4

The Problem of Synonomy

Xxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxx

Xxx Auto xxx

Xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

User Terminology

Author Terminology

Car?Automobile?

Motor Vehicle?

Page 5: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

5

Large Collections ⇒⇒⇒⇒ Many Terms to Choose from

Page 6: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

6

The Problem of Polysemy

Miss (Baseball)?Hit (Bowling)?Labor Action?Military Operation?

User TerminologyIntent

Strike

Page 7: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

7

John ↔ Bob Relationship:

First Order:

Second Order:

Third Order:

JOHN

BOB

JOHN

TOM

TOM

BOB

JOHN

TOM

TOM

DAVE

DAVE

BOB

51,474

11,026,553

68,070,600

# of Relations in

5,998 Documents:

Cross-document Relationships

Page 8: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

8

How do Semantic Representations Help?

1. Improve Efficiency

2.Overcome Errors

3.Deal with Complex Information Requirements

4.Organize Information

5. Interpret Results

6.Discover New Information

7.Work across Languages

8.Search Multiple Databases Simultaneously

9.Search Multimedia Collections

10.Support Direct Analytics

Top 10 Benefits:

Page 9: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

9

Implementation Approaches

• Capture Semantics Manually(Labor-intensive):

� Taxonomies

� Dictionaries

• Approximate Semantics using Local Co-occurrence Statistics (Low Accuracy)

• Employ Matrix Decomposition Techniques (Computationally Intensive) :

� LSI

� PLSI

� Ontologies

� Grammars

� LRA

� SDD-LSI

Page 10: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

10

� 8 of the Top 10 Litigation Support Platforms use it for Analytics

� 6 of the Top 8 Essay Scoring Packages use it for Holistic Scoring (SAT, GMAT, etc.)

� Most Spam Filters are Based on it

� It is used for Survey Analysis for Major Corporations (Marriott, Home Depot, etc.)

� It is used in the Most Advanced Literature-based Discovery Systems

Semantic Processing: Commercial Application Examples

Page 11: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

11

Xxxxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxxxx

x automobile xx

Xxxxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxxxx

Xxxxxxxxxxxxxxxxx

Conceptual Search

Car

Semantic Space

Page 12: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

12

XxxxxxxxxxxxxxXxxxxxxxxxxxxxmethods of armedstruggle not accepted internationallyXxxxxxxxxxxxxxxXxxxxxxxxxxxxxx

Deep Conceptual Generalization is Possible

War Crimes

Semantic Space

Page 13: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

13

Muammar Kaddafi Muammar Qadhdhafi Muammar Qadhafi

Muammar Ghadafi Muammar Gaddafi Moammar Qaddafi

Muammar Gaddaffi Muamar Qaddafi Muammar Qadhaffi

Muammar Kadafi Muammer Gadaffi Muamar Ghadaffi

Muammar Gadhafi Muammar Qhadhafi Muammar Qathafi

Muammar Kadhafi Muammar al Qadaffi Moammar Gadaffi

Mouammar Qaddafi Muammar al Qadhaffi Moammar Qadafi

Muamar Gadaffi Moammar Qadhafi Muammar Qaddafi

Automated Terminology Variant Identification

Query = Muammar Qadaffi

Variants Found Automatically in 6.1 M News Articles:

Page 14: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

14

X

Query Term

TranscriptionErrors

Nicknames

Synonyms

OCR Errors

Trade Names

Speech Conversion Errors

Aliases

TransliterationDifferences

Comprehensive Variant Clustering

� �

���

��

Page 15: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

15

Interpreting Results

Today MOX is widely used in Europe and is

planned to be used in Japan. Currently over

30 reactors in Europe (Belgium, Switzerland,

Germany and France) are using MOX and a

further 20 have been licensed to do so. Japan

also plans to use MOX in around a third of its

reactors by 2010. Most reactors use it as

about one third of their core, but some will

accept up to 50% MOX assemblies. France

aims to have all its 900 MWe series of

reactors running with at least one third MOX.

Japan aims to have one third of its reactors

using MOX by 2010, and has

Page 16: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

16

Today MOX is widely used in Europe and is planned to be used in Japan. Currently over 30 reactors in Europe (Belgium, Switzerland, Germany and France) are using MOX and a further 20 have been licensed to do so. Japan also plans to use MOX in around a third of its reactors by 2010. Most reactors use it as about one third of their core, but some will accept up to 50% MOX assemblies. France aims to have all its 900 MWe series of reactors running with at least one third MOX. Japan aims to have one third of its reactors using MOX by 2010, and has

uranium-plutoniummixed-oxide

bnflreactor

reactor'splutonium-uranium

bnfl'snuclear-fuel

vver

Instant Context Display

Page 17: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

17

Incoming Document Stream

Immediate

Action

Queue for

Review

No Further

Action

Conceptual Generalization + Context “Understanding” ⇒⇒⇒⇒ Accurate Categorization

Page 18: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

18

Semantic Processing ⇒⇒⇒⇒ Noise Tolerance

0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

0.0% 20.0% 40.0% 60.0% 80.0% 100.0%

Percentage of Words Corrupted

Ca

teg

ori

za

tio

n A

cc

ura

cy

C

om

pa

red

to

Ba

se

lin

e

Page 19: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

19

Sample Document with 70% Word Error Rate – Still Categorized Correctly

EORE MA4EUE uIqTED AT CAJFMARHUILLAPeru's stnte minerals markeing Arm, MineroPFou CVvercial SA (Mi1peco)C ifted a force majeure n zinc iWgotshipmenfwfrom 2he country's biggesD zioc sefintry at CajUmarquivla,Vaspekus an said.

The spofesmanBsaid the p oblems affectiKg sulphurxg a iT and roast j plants that Uad hgltedWproduction sinUe May p had bsen resolvFd.D Howvjr,Khe BTid producti6 of znc ingots thisLyearPwas expe ted to fall to aro nd 86,000 tonnes thistyear at CajamruuiwpT, from 9wT000 Tonne7 in 198x becauNe of1the f5oppageO Thm refVuery has an opKimum annuv pjoduotibn capwcity of 100,000 tonn s but its highest qroueio ca 969000 tonnes of refined zRnc zLgots inw1985, the spokebman sa d.

Page 20: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

20

Application: Sentiment Detection

Mideast

Sentiment

0400 hrs

5 August 2002

Automatic Detection in

News Articles

Page 21: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

21

Good, Quality, Standard

Nice hotel, Clean

Lobby, Small

Enjoyed stay

Bathrooms, Hair, Plug

Showers, Water, Head, Floor

Smoking room, Pool area, Cigarettes

Light, Poor, Dark, Bedside

Work properly, Flushing

Excellent staffProperty, View, Beautiful hotel

Noise, Traffic, Construction, Sleep

5,000 Surveys from Large Hotel Chain

Automatic Taxonomy Generation

426

253

410

348

281

260

231

251

250

242

198

192

Internet access, High-speed internet172

Page 22: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

22

Application: Summarization

• So I ask you to join me in expanding their access to child care, creating new hiring preferences for military spouses across the federal government, and allowing our troops to transfer their unused education benefits to their spouses or children.

• And tonight, I ask Congress to support an innovative proposal to provide food assistance by purchasing crops directly from farmers in the developing world, so we can build up local agriculture and help break the cycle of famine.

45-minute Speech

Page 23: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

23

Represent Complex Information Requirements (I)

JobDescriptions

E-mail Alertto HiringManager

Résumé

For One Fortune 500 Company ⇒⇒⇒⇒Savings of >3 Days in Average

Applicant Processing Time

Page 24: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

24

Represent Complex Information Requirements (II)

Sections of Previous Proposals

DraftProposal

InputsRFPSubsection

One Fortune 500 Company Estimates Savings of >$10

Million per Year

Page 25: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

25Zukas, A., GO-Driven Literature-Based Discovery using Semantic Analysis, MS Thesis, George Mason University, 2007.

• PubMed Abstracts• Gene – Function Relationships

Derived Semantically• 98,074,359 Potential Gene-

function Associations.

Literature-based Discovery (I)

Page 26: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

26

Latent Gene and Function Relationships from the June 2006

Gene Ontology - Later Documented in the January 2007 Gene Ontology

Literature-based Discovery (II)

Page 27: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

27

Social Media – Goldmine of Public Opinion?

Page 28: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

28

� Volume – 200 Million

Tweets per Day

Worldwide

� Velocity – Minutes to

Hours

� Variety - 17 Languages

� Variability – Very Low

Signal to Noise Ratio

28

Processing Social Media = Big Data Problem

Page 29: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

29

Cross-lingual Retrieval

Farsi

English

Arabic

English

English

Doc

Retrieved Documents

in Correct Relevance

Order

Documents in Multiple

Languages

Page 30: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

30

Interleaved Results Display

Page 31: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

31

Federation of Queries

Database #1

Database #2

Database #3

Merged Result

List

Page 32: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

32

Buyer Seller Material Amount Date

John

Smith

Ace

Jewelers

Diamond

Ring

$3,000 8/18/11

RDBMSSemantic

Representation Space

Rows = Mini-documents

Text

Combined Search of Structured and Unstructured Information

Page 33: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

33

Where is Semantic Analysis Going?

1.Larger Databases (100s of Millions of Documents +)

2.Multimedia Collections

� Text

� Relational Data

� Audio

� Video

3.Visualization

4.Analytics

Page 34: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

34

0

100

200

300

400

500

0 25 50 75 100

Bu

ild

Tim

e (

hrs

)

Millions of Documents

2010

2009

2007

2011

Hardware Advances ⇒⇒⇒⇒ Large Applications

Page 35: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

35

Integrated

AnalyticsStructured Data

Images

Multi-

lingual Text

Audio

Sensor DataVideo

Integration of Multimedia Data

Buyer Seller Material Amount Date

John

Smith

Ace

Jewelers

Diamond

Ring

3 Carat 8/18/06

Page 36: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

36

Analytics + Visualization

Page 37: II-SDV 2012 Getting beyond Keywords: Using Conceptual Representations of Text in Search, Analysis and Discovery

37

Questions or Comments

Roger Bradford

Agilex Technologies Inc

703-889-3916

[email protected]