Upload
jamie-taylor
View
1.163
Download
0
Embed Size (px)
DESCRIPTION
With over 12 million entities and 350 million relationships, Freebase is an excellent resource for performing text analysis. One way to look at document "understanding" is to think about how the entities in the document are connected on a knowledge graph. This is similar to the "reconciliation" process that is used to grow Freebase itself. The web is currently full of semantic hints, whether they are explicit (like those promoted by the Semantic Web) or implicit (like the use of blog widgets.) Using these hints, text analytic methods can get a toe-hold on the web corpus at large.
Citation preview
It's not what you said, it's how you said it.
Jamie Taylor, Ph.D.
Text Analytic Summit Boston 2010
What do y'all mean"Semantics"
The Web!Now with
Better Flavor!
May 2001
Tim Berners-Lee, James Hendler and Ora Lassila
The Caketaken from http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/layerCake-4.png
The Semantic Web?
Linked Open Data
The Real Web
http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Wish it were real
Might be real
Is real, but don't believe it
Is currently useful
Entities
Identifiers
Bono, A.K.A. Paul David Hewson
http://rdf.freebase.com/ns/en.paul_david_hewson
Side Step Polysemy
Vocabulary
Manufactures
http://rdf.freebase.com/ns/automotive.make.model_s
A socially managed semantic database
Freebase has Many Types of Things
Many Strong Identifiers
http://www.ellerdale.com/topics/view/0080-6ba0
http://rdf.freebase.com/ns/en.berlin_wall
http://musicbrainz.org/artist/7f347782-eb14-40c3-98e2-17b6e1bfe56c
http://rdf.freebase.com/ns/authority.musicbrainz.7f347782-eb14-40c3-98e2-17b6e1bfe56c
http://www.bbc.co.uk/music/artists/7f347782-eb14-40c3-98e2-17b6e1bfe56c
350 Million Relations
12 Million Entites
Users extend the data model
Users contribute data
schema = vocabulary
A range of of vocabularies....
1500 types with 500+ instances!!
Growing Freebase
Reconciliation
+=
Reconciliation
Record Matching
Identity Matching
Collective Entity Resolution
Relational Learning
Record LinkingEquivalence Mining
Reconciliation
"Harrison Ford"
"Excuse Me"
"Vanity Fair"
"Harrison Ford"
"Excuse Me"
"Maytime"
Reconciliation
"Harrison Ford"
"Excuse Me"
"Vanity Fair"
"Harrison Ford"
"Fugitive"
"Blade Runner"
A Graph of Entities
Vocabulary
education
nationality
located
education
plays-inplays-in
performed-at
created
released-bylocated
contains
Reconciliation as "understanding"
education
nationality
located
education
plays-inplays-in
performed-at
created
released-bylocated
contains
http://data.labs.freebase.com/recon/
{ "/type/object/name":"Blade Runner", "/type/object/type":"/film/film", "/film/film/starring/actor":["Harrison Ford", "Rutger Hauer"], "/film/film/director":"Ridley Scott", "/film/film/release_date_s":"1981" } [{
"id":"/guid/9202a8c04000641f8000000000009e89", "name":["Blade Runner", "Bladerunner"], "score":1.4320519, "match":true, "type":["/common/topic", "/film/film","/media_common/adapted_work", "/award/award_winning_work", ...... ]}, { "id":"/guid/9202a8c04000641f80000000002643d0", "name":["Blade"], "score":0.48852453, "match":false, "type":["/common/topic", "/film/film", "/award/award_winning_work", "/award/award_nominated_work", ....... ]}, { "id":"/guid/9202a8c04000641f800000000e5daaae", "name":["Blade"], "score":0.46398318, "match":false, .....
Data Everywhere
Wikipedia Features
Wikipedia Features
Error Prone -- Usually <99%
X
X
WEX
calculate feature counts per type
intersect the two sources
generate type scores for topics
join feature counts with topics
extract features
gettypes
(Machine) Learning Semantics
37M features
2.4M features1400 types
5M type assertions
2.8M Wikipedia topics
5M articles
1.6G scores
/people/person distribution
untyped topicsperson topicsother topicsall topics
Data courtesy Viral Shah
RABJ: Humans in the loop
Thresholding Results
99% threshold at 16.75
/people/person assertions
threshold
53K /people/person assertions
Training Wheels?Semantics are Everywhere
Widgets: Content Tags
Explicit Semantics
Rich Snippets
<div class="post-item restaurant-gen-info hreview-aggregate"> <div class="item vcard"> <h1 class="fn org">Taylor's Refresher</h1> <div class="address"> <div class="ratings"> <ul class="star-rating-2 rating" title="4.0 star rating across 3 ratings"> <li class="current-rating average" style="width:80%;">4.0 star rating</li> <li class="star"> </li> <li class="star"> </li><li class="star"> </li> <li class="star"> </li> <li class="star"> </li> </ul> <div class="rating-stats"> <span class="rating"> <span class="average">4.0</span> </span> rating over <span class="count">1</span> review </div>
microformats
HTML5 MicroData
Open Graph Protocol
RDFa
Explicit Semantics in Surprising Places
Blog Tags::Entities
Metaweb Topic Block
Widget Microdata
<div class="fb-widget" id="fbtb-9a1f44348ad145b5b7d7d7d2376b0420" style="border:0; outline:0; padding:0; margin:0; position:relative;" itemscope="" itemid="http://www.freebase.com/id/en/taylor_swift" itemtype="http://www.freebase.com/id/music/artist"> ..... </div>
Thickening the Graph
"Vocabulary" Patterntaw shooter marksman
marble marksman
photo: http://sarabbit.openphoto.net
http://wordnet.freebaseapps.com
E. Coli
Review (neighborhood) Pattern
Robert Kenner
Michael Pollan
Eric Schlosser