72
It's not what you said, it's how you said it. Jamie Taylor, Ph.D. Text Analytic Summit Boston 2010

Text Analytic Summit 2010

Embed Size (px)

DESCRIPTION

With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself. The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.

Citation preview

Page 1: Text Analytic Summit 2010

It's not what you said, it's how you said it.

Jamie Taylor, Ph.D.

Text Analytic Summit Boston 2010

Page 2: Text Analytic Summit 2010

What do y'all mean"Semantics"

The Web!Now with

Better Flavor!

Page 3: Text Analytic Summit 2010
Page 4: Text Analytic Summit 2010
Page 5: Text Analytic Summit 2010
Page 7: Text Analytic Summit 2010

The Caketaken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png

The Semantic Web?

Page 8: Text Analytic Summit 2010

Linked Open Data

Page 9: Text Analytic Summit 2010

The Real Web

http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg

Page 10: Text Analytic Summit 2010
Page 11: Text Analytic Summit 2010

Wish it were real

Page 12: Text Analytic Summit 2010

Might be real

Page 13: Text Analytic Summit 2010

Is real, but don't believe it

Page 14: Text Analytic Summit 2010

Is currently useful

Page 15: Text Analytic Summit 2010

Entities

Page 16: Text Analytic Summit 2010

Identifiers

Bono, A.K.A. Paul David Hewson

http://rdf.freebase.com/ns/en.paul_david_hewson

Side Step Polysemy

Page 17: Text Analytic Summit 2010

Vocabulary

Manufactures

http://rdf.freebase.com/ns/automotive.make.model_s

Page 18: Text Analytic Summit 2010

A socially managed semantic database

Page 19: Text Analytic Summit 2010

Freebase has Many Types of Things

Page 20: Text Analytic Summit 2010
Page 21: Text Analytic Summit 2010
Page 23: Text Analytic Summit 2010

350 Million Relations

12 Million Entites

Page 24: Text Analytic Summit 2010

Users extend the data model

Users contribute data

Page 25: Text Analytic Summit 2010

schema = vocabulary

Page 26: Text Analytic Summit 2010

A range of of vocabularies....

1500 types with 500+ instances!!

Page 27: Text Analytic Summit 2010

Growing Freebase

Page 28: Text Analytic Summit 2010

Reconciliation

+=

Page 29: Text Analytic Summit 2010

Reconciliation

Record Matching

Identity Matching

Collective Entity Resolution

Relational Learning

Record LinkingEquivalence Mining

Page 30: Text Analytic Summit 2010

Reconciliation

"Harrison Ford"

"Excuse Me"

"Vanity Fair"

"Harrison Ford"

"Excuse Me"

"Maytime"

Page 31: Text Analytic Summit 2010

Reconciliation

"Harrison Ford"

"Excuse Me"

"Vanity Fair"

"Harrison Ford"

"Fugitive"

"Blade Runner"

Page 32: Text Analytic Summit 2010

A Graph of Entities

Page 33: Text Analytic Summit 2010

Vocabulary

education

nationality

located

education

plays-inplays-in

performed-at

created

released-bylocated

contains

Page 34: Text Analytic Summit 2010
Page 35: Text Analytic Summit 2010

Reconciliation as "understanding"

education

nationality

located

education

plays-inplays-in

performed-at

created

released-bylocated

contains

Page 36: Text Analytic Summit 2010

http://data.labs.freebase.com/recon/

{ "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{

"id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, .....

Page 37: Text Analytic Summit 2010

Data Everywhere

Page 38: Text Analytic Summit 2010
Page 39: Text Analytic Summit 2010

Wikipedia Features

Page 40: Text Analytic Summit 2010

Wikipedia Features

Error Prone -- Usually <99%

X

X

Page 41: Text Analytic Summit 2010
Page 42: Text Analytic Summit 2010
Page 43: Text Analytic Summit 2010
Page 44: Text Analytic Summit 2010
Page 45: Text Analytic Summit 2010

WEX

calculate feature counts per type

intersect the two sources

generate type scores for topics

join feature counts with topics

extract features

gettypes

(Machine) Learning Semantics

37M features

2.4M features1400 types

5M type assertions

2.8M Wikipedia topics

5M articles

1.6G scores

Page 46: Text Analytic Summit 2010

/people/person distribution

untyped topicsperson topicsother topicsall topics

Data courtesy Viral Shah

Page 47: Text Analytic Summit 2010

RABJ: Humans in the loop

Page 48: Text Analytic Summit 2010

Thresholding Results

99% threshold at 16.75

Page 49: Text Analytic Summit 2010

/people/person assertions

threshold

53K /people/person assertions

Page 50: Text Analytic Summit 2010

Training Wheels?Semantics are Everywhere

Page 51: Text Analytic Summit 2010
Page 52: Text Analytic Summit 2010
Page 53: Text Analytic Summit 2010

A Strong Tag for Food Inc.http://movi.es/BVl43

Page 54: Text Analytic Summit 2010

Widgets: Content Tags

Page 55: Text Analytic Summit 2010
Page 56: Text Analytic Summit 2010
Page 57: Text Analytic Summit 2010
Page 58: Text Analytic Summit 2010

Explicit Semantics

Page 59: Text Analytic Summit 2010
Page 60: Text Analytic Summit 2010

Rich Snippets

<div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li><li class="star">&nbsp;</li> <li class="star">&nbsp;</li> <li class="star">&nbsp;</li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>

Page 61: Text Analytic Summit 2010
Page 62: Text Analytic Summit 2010

microformats

HTML5 MicroData

Open Graph Protocol

RDFa

Page 63: Text Analytic Summit 2010

Explicit Semantics in Surprising Places

Page 64: Text Analytic Summit 2010

Blog Tags::Entities

Page 65: Text Analytic Summit 2010

Metaweb Topic Block

Page 66: Text Analytic Summit 2010

Widget Microdata

<div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http://www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/artist"> ..... </div>

Page 67: Text Analytic Summit 2010

Thickening the Graph

Page 68: Text Analytic Summit 2010

"Vocabulary" Patterntaw shooter marksman

marble marksman

photo: http://sarabbit.openphoto.net

http://wordnet.freebaseapps.com

Page 69: Text Analytic Summit 2010
Page 70: Text Analytic Summit 2010

E. Coli

Review (neighborhood) Pattern

Robert Kenner

Michael Pollan

Eric Schlosser

Page 71: Text Analytic Summit 2010
Page 72: Text Analytic Summit 2010