View
632
Download
0
Category
Preview:
Citation preview
Building knowledge graphs in DIG
Pedro Szekely and Craig Knoblock University of Southern California
Information Sciences Institutedig.isi.edu
Goal
USC Information Sciences Institute CC-By 2.0 2
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 3
raw w messy w disconnected clean w organized w linked
hard to query, analyze & visualize easy to query, analyze & visualize
Use Case: Human Trafficking
USC Information Sciences Institute CC-By 2.0 4
100 million pages~ 100 Web sites
help victimsprosecute traffickers
Salient Statistics on Human Trafficking
• Profits per Year: $32 Billion• Average Age of Entry To Prostitution in the US: 14• PIMP’s Profit Per Victim Per Year: $150,000• Advertising Budget On the Web:$45 Million
CC-By 2.0 5USC Information Sciences Institute
Task: Tracking the Victim’s Locations
>100millionpagesadvertisingadultservicesUSC Information Sciences Institute CC-By 2.0 6
Example: Investigating a Reported Victim
SanDiego,whereelse?USC Information Sciences Institute CC-By 2.0 7
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 9
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
Data Acquisition
Data Acquisition
USC Information Sciences Institute CC-By 2.0 10
downloading relevant data
batch w real-time
Web pagesw Web service w database wCSV w Excel w XML w JSON
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 11
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
Feature Extraction
USC Information Sciences Institute CC-By 2.0 12
from raw sources to structured data
• trainable text extractors
• extraction from structured Web pages
• image features
• PDF extractor
Feature Extraction from Text
USC Information Sciences Institute CC-By 2.0 13
“YOU don't wanna miss out on ME :) Perfect lil booty Green eyes Long curly black hair Im a Irish, Armenian and Filipino mixed princess :) ❤ Kim ❤7○7~7two7~7four77 ❤ HH 80 roses ❤ Hour 120 roses ❤ 15 mins 60 roses”
name: Kimeye-color: greenhair-color: black
phone: 707-727-7477rate: $60/15min
$80/30min$120/60min
Performance of CRF Extractors
80
1018
9991 94
0
20
40
60
80
100
120
Precision Recall F
RegularExpressions DIG
80
612
99
7384
0
20
40
60
80
100
120
Precision Recall F
RegularExpressions DIG
Eyes Hair
USC Information Sciences Institute CC-By 2.0 16
Automated Extraction
input:a pileofpages
ClassifybyTemplates
pagesclusteredbytemplate
InferExtractor
InferExtractor
InferExtractor
InferExtractor
extractor
USC Information Sciences Institute CC-By 2.0 18
Extraction Evaluation
Title Desc Seller Date Price Loc Cat MemberSince Expires Views ID
Perfect 1.0(50/50)
.76(37/49)
.95(40/42)
.83(40/48)
.87(39/45)
.51(23/45)
.68(34/50)
1.0(35/35)
.52(15/29)
.76(19/25)
.97(35/36)
PrettyGood
1.0(50/50)
.98(48/49)
.95(40/42)
.83(40/48)
.98(44/45)
.84(38/45)
.88(44/50)
1.0(35/35)
.55(16/29)
1.0(25/25)
1.0(36/36)
10websites,5pageseach
fields
USC Information Sciences Institute CC-By 2.0 20
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 21
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
Feature Alignment
USC Information Sciences Institute CC-By 2.0 22
from multiple schemas to a common domain schema
- CSV, Excel- Database tables- Web services- Extractors
- Nomenclature- Spelling
Multiple Schemas
Karma: Mapping Data to OntologiesServicesRelational
Sources
Karma
{JSON-LD}
HierarchicalSources
Schema.org
USC Information Sciences Institute CC-By 2.0 23karma.isi.edu
Karma Solves Feature Alignment
CC-By 2.0 24USC Information Sciences Institute
Provenance Domain Schema
took ~30 minutes to align the output of the Stanford name extractor
Feature Alignment Statistics• 5 contractors provided data• ~ 15 datasets• > 30 Karma models• > 200 million records
• 1 hour processing in 20 node Hadoop cluster
CC-By 2.0 25USC Information Sciences Institute
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 26
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
Entity Resolution
USC Information Sciences Institute CC-By 2.0 27
merging records that refer to the same entity
missing dataincorrect data
scale (~50 million records)
currently working on techniques to address
Entity Resolutuion on Strong Attributes
AdultService-1
Person-1
Offer-1availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColorblue
nameJessica
itemProvided
USC Information Sciences Institute CC-By 2.0 28
Linking Using Text Similarity
E M I L Y SEXY. ** wHiTe/lATin girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
L A Y L A SEXY. ** wHiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O____U____T____C___A___L____L____S
L I L A SEXY. ** WhiTe girl ** bUsTy SWEET. LoTs Of fUn. Call Me. O_U_T_C___A___L_L_S
USC Information Sciences Institute CC-By 2.0 29
Linking Using Image Similarity
CC-By 2.0 30USC Information Sciences Institute
100 Million Images Technology: Deep Learning
AdultService-1
Person-1
Offer-1availableAt
seller
phone
619-319-7315
Santa Barbara
hairColor
red
price
250/hour
startDate
2014-12-07
eyeColor
blue
name
Jessica
itemProvided
Offer-2
Person-2
availableAt
Washington DC
phone
seller
price
250/hour
startDate
2014-05-28
AdultService-2
eyeColorblue
nameJessica
itemProvided
same victim
same Trafficker
Unsupervised Collective Entity Resolution
USC Information Sciences Institute CC-By 2.0 31
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 33
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
Graph Construction
USC Information Sciences Institute CC-By 2.0 34
assembling the data for efficient query & analysis
- ElasticSearch: scalable, efficient query- graph databases: network analytics- NoSQL: scalable analytics
- bulk loading: massive data imports- real-time updates: live, changing data
Elastic Search Data Model
AdultService Offer Person Phone Web
Page
USC Information Sciences Institute CC-By 2.0 35
Indexing for High Performance Knowledge Graph Queries
Avg.QueryTimesinMillisecondsSingleUserQueryLoad
1.2billiontriples
StateoftheArtGraphDatabase(RDF)
DIGindexingdeployedinElasticSearchUSC Information Sciences Institute CC-By 2.0 36
Steps To Build a DIG
USC Information Sciences Institute CC-By 2.0 37
Crawling ExtractionData Acquisition
Mapping ToOntology
Entity Linking& Similarity
Knowledge GraphDeployment
Query &Visualization
ElasticSearch
GraphDB
schema.org geonames
Data Acquisition
Feature Extraction
Feature Alignment
EntityResolution
GraphConstruction
User Interface
DIG Deployment for Human Trafficking
USC Information Sciences Institute CC-By 2.0 40
- 100 million Web pages - Live updates (~5,000 pages/hour)- ElasticSearch database (7 nodes)- Hadoop workflows (20 nodes)
- District Attorney- Law Enforcement- NGOs
Deployedto6LawEnforcement
AgenciesandSuccessfullyUsedtoProsecute
TraffickersUSC Information Sciences Institute CC-By 2.0 41
DIG ApplicationsHuman Trafficking
large, real usersMaterial Science Research
70,000 paper abstracts (built in 1 week)Arms Trafficking
Identify illegal salesPatent Trolls
Identify patent trollsCyber Attacks
Predict cyber attacks from dark web data
CC-By 2.0 42USC Information Sciences Institute
Conclusions• Complete tool-chain to build domain-specific
knowledge graphs
• Integrates heterogeneous data: web pages, databases, CSV, web APIs, images, etc.
• Scales to ~100 million pages, ~3 billion facts
• Deployed to law enforcement
USC Information Sciences Institute CC-By 2.0 43
Recommended