Upload
kaiser
View
34
Download
0
Embed Size (px)
DESCRIPTION
Schema Free Querying of Semantic Data. Lushan Han Advisor: Dr. Tim Finin May 23, 2014. Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion. Road Map. Part 1. Introduction. Semantic Data. - PowerPoint PPT Presentation
Citation preview
Schema Free Querying of Semantic Data
Lushan Han Advisor: Dr. Tim Finin
May 23, 2014 1
Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion
Road Map
2
Part 1. Introduction
3
Semantic Data
A network of entities, which are annotated with types and interlinked with properties.
Increasing amount of Semantic Data
Examples: RDF semantic data LOD DBpedia Freebase
4
Objectives
Develop schema-free query interfaces Works with “semantic data” in many forms, e.g., RDF, Freebase,
RDBMS Allow casual users to freely query semantic data without learning
its schema Queries should be in the user’s conceptual world
Two existing interfaces: Natural Language Interface (NLI) Keyword Interface
Three hard problems
5
P1. No Practical Interface
Natural language interface NLP techniques are still not reliable to parse out the full relational
structure from natural language questions
Keyword interface Ambiguity and limited expressiveness
(e.g. “president children spouse”)
(e.g. Who was the author of the Adventures of Tom Sawyer and where was he born?)
6
SFQ Interface
Still in the user’s conceptual world Make implicit structure of NL questions explicit
Who was the author of the Adventures of Tom Sawyer and where was he born?
7
P2. Semantic Heterogeneity Problem
Many different ways to express (model) the same meaning
Vocabulary and structure mismatches between the user’s query and the machine’s representation
Existing methods: Labor-intensive and ad-hoc methods
Domain-specific syntactic or semantic grammars Mapping Lexicons (Mapping rules) Templates
Thesaurus (e.g. WordNet) is insufficient
8
P2. Examples
9
P2. More Examples
4 5
10
A purely computational approach
Lexical Semantic similarity Measures Capture flexible semantics
Statistical Association Measures Carry out disambiguation
A novel “overall semantic similarity” or fitness metric that combines Lexical semantic similarity measures statistical association measures structure features
Context-sensitive mapping algorithms
11
P3. Heterogeneous or unknown schema
Hard to reach consensus on a schema for the world
Open domain semantic data has heterogeneous or even unknown schema (e.g. Semantic Web data, DBpedia)
Traditional NLI systems are difficult to apply
Some modern systems Not produce formal queries (e.g. SQL or SPARQL). Directly search into the entity network for matchings
Computationally expensive and has ad-hoc natures
12
The schema network
Learn a schema statistically from the entity network by exploiting co-occurrences. The schema itself is also represented as a network
Mapping the user’s query into the schema network, instead of the entity network. Much more scalable Produce formal queries Enable joint disambiguation and context-sensitive mapping
algorithm
13
Thesis Statement
We can develop an effective and efficient algorithm to map a casual user's schema-free query into a formal knowledge base query language that overcomes vocabulary and structure mismatch problems by exploiting lexical semantic similarity measures, association degree measures and structural features.
14
Contributions
An intuitive SFQ interface that avoids the problem of extracting relations structure from NL queries
Novel algorithms mapping SFQ queries to KB queries addressing both vocabulary and structure mismatches
A novel approach to handle heterogeneous or unknown schemas by building a schema from an entity network
Define the probability of observing a path in a schema network and develop two novel statistical association models
An improved PMI metric and new semantic text similarity measures and algorithms
15
Part 2. Related Work
16
Natural Language Interface to Database (NLIDB) Systems
Early Systems in 70s, (e.g. LUNAR and LADDER) Domain-specific syntactic or semantic grammars Heavily customized to a particular application
Later systems in 80s and 90s. (e.g. TEAM, ASK, MASQUE) More general parser Require human-crafted lexicons, mapping rules and domain
knowledge to interpret the parse tree Allow knowledge engineers or end users to enrich lexicons and
add new mapping rules through an interactive interface More portable than early systems
17
Recent NLI SystemsSystem Data NL Parsing Vocabulary
MismatchStructureMismatch
Auto-matic
Limitations
PRECISE DB Tokenizer to get a collection of tokens Lexicon
bipartite matching
Yes• Very restricted domains
SCISSOR KB Semantic parser Machine learning Yes • Very restricted domains• Manually annotated training data
NaLIX XML Dependency parser to get adjacent tokens Lexicon
Adjacency matching
No• Restricted domains
ORAKEL RDF Syntactic parser to get logical lambda-calculus query Lexicon Lexicon No • Restricted domains
• Simple NL questions
FREyA RDF Syntactic parser to get a collection of terms Lexicon No No • Restricted domains
Aqualog RDF Shallow parsing and pattern rules to get relations Lexicon No No • Restricted domains
PANTO RDFSyntactic parser and a head-driven algorithm to get relations
Lexicon No Yes• Very restricted domains• Simple NL questions
True Knowledge(Evi)
KB1,200 templatesA very large repository of query rephrasing
Lexicon, 1,200 templates and a very large repository of query rephrasing
Yes
• Extremely laborious
PowerAqua RDF Shallow parsing and pattern rules to get relations Lexicon
partial matching
Yes• Directly match into the entity network• Not produce formal queries
Treo RDF Syntactic parser to get a ordered list of terms
Semantic similarity
No Yes • Directly match into the entity network• Not produce formal queries• Queries must be a single path • Produce no exact answers but triple paths that may contain the answers
18
Part 3. SFQ Interface
19
SFQ Examples1. Where was the author of the Adventures of Tom Sawyer born?
2. Give me authors in the CIKM conference
3. A more complicated one
20
Default Relations
The relation name can be left out
A stop word list for filtering relation names with words like in, of, has, from, belong, part of, locate and etc.
21
Envisioned Web Interface
22
Output (1)
23
Output (2)
24
Part 4. Schema Network and Association Models
25
Instance Data (ABox)
Two datasets The relation dataset (all relations between instances) The type dataset (all type definitions for instances)
Integrate all RDF data types into five types that are familiar to users ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral ˆLiteral is the super type of the other four
We use DBpedia for examples in the following slides
26
Automatically enrich the set of types
Automatically deduce types from relations Infer attribute types from data type properties
e.g. <Beijing>, population, “20693000” => ˆPopulation
Infer classes from object properties e.g. < Zelig>, director, <Woody Allen> => ˜Director
27
Counting Co-occurrence
28
The Schema Network
A statistical meta description of the underlying entity network, which is a network itself.
29
The Schema Path
A path on the schema network is called a schema path
A schema path P represents a composite relation
Example 1.
Example 2.
30
The Schema Path Probability
Measure the reasonableness of a path
The probability of “observing” a path on the schema network
(A1) we select the starting node c0 of the path randomly from all the nodes in the schema network
(A2) observe the path in a random walk starting with c0
31
Compute Transition Probability
0 ≤ ≤ 1
32
A Property about Schema Path
A schema path P and its return path P’ represent the same relation.
Given a schema path P and its return path P’ we have P(P ) = P(P’).
P
P’
33
Schema Path Model
Supposed to store and index all the schema paths with a length no larger than a given threshold and their probabilities
The only supported function is to return all the schema paths and their probabilities between two given classes.
Put in memory for fast computation
34
Schema Path Model Optimization
35
Concept Path
Group all the edges with the same direction between two nodes into a single edge
By analogy to schema path, we have concept path probability
Concept path frequency
36
Concept Association Knowledge (CAK) model
Pairwise associations (i) direct association between classes and properties (ii) indirect association between two classes
PMI measure
Our improved PMI measure
37
Concept Association Knowledge (CAK) model
Direct association between a directed class and a property p
Indirect association between two directed classes
38
CAK Examples
39
PMI* vs PMIThe most associated property for “Person” in DBpedia
PMI* PMI
40
Part 5. Query Interpretation
41
SFQ Interpretation
42
Two Phase Mapping Algorithm
43
Generating Candidates via Lexical Semantic Similarity
Disambiguation via Optimization
Concept Mapping Optimization Problem
46
A joint disambiguation example
Time Complexity of Concept Mapping Algorithm
A straightforward concept mapping algorithm
After exploiting locality – the optimal mapping choice of a property can be determined locally when the two classes it links are fixed
48
Relation Mapping Optimization Problem
H* : the set of top k3 concept mapping hypotheses The reduced mapping space for the SFQ
The optimization problem
49
Computing the fitness of a mapping σ on a relation r
Let
Two features and one parameter β Joint lexical semantic similarity between and P The schema path frequency of P The parameter β adjusts the relative importance of the two
features
50
Align terms in P to terms in r
The relation
The path C = <c0, c1, …, cl-1, cl>
P = <p0, p1, …, pl-1>
We already know and are paired with c0 and cl
We ignore all the intermediate classes c1, …, cl-1 Semantics in c1, …, cl-1 is overlapped with that in p0, p1, …, pl-1
the less terms we join, the less likely errors can occur
The only unaligned terms are and p0, p1, …, pl-1
51
Semantic Stretch and Heterogeneous Alignments
52
Cutting Function and Cutting Objective Function
Each cutting defines a function, referred to as ω
The product of similarity of the pairs in the minimum pair set that covers and every
53
Cutting Optimization Problem and a Greedy Algorithm
Cutting Optimization
The cutting space Ω has a size of Total running time
Greedy Algorithm: SmartCutter that run in First find the property in P that is the most similar to and
assume it is in the predicate region. Stretch to the left until we meet a property u that is more similar
to the and stretch to the right until meeting a property v that is more similar to the
54
Joint Lexical Semantic Similarity
tends to be biased towards small l, short paths. α is a parameter in the range [0..1].
High similarities in the subject and object regions but low similarities in the predicate region can still have a fairly high
The joint lexical semantic similarity between and P
55
Deal with Default Relation
Combining and
θ is a parameter in the range [0, ).
56
Formal Query Generating and Entity Matching
57
Part 5. Evaluation
58
Evaluation Settings Two very different datasets
DBLP+ DBLP augmented with data from CiteSeerX and ArnetMiner (narrow domain) DBpedia Structured data in Wikipedia (open domain)
Three similarity measures LSA semantic similarity (purely statistical) Hybrid semantic similarity (LSA + WordNet) String similarity (bigrams + Dice coefficient)
Performance metrics Mean Reciprocal Rank (how high the first correct interpretation are in the top-10 list) Mean Precision and Recall (evaluate the answers produced by the SPARQL queries)
Test environment PC with 2.33GHz Intel Core2 CPU and 8GB memory
59
DBLP+ Dataset Statistics
60
Degree of connectivity 18 x 18 class pairs resulted from pairing every C with every C
61
Degree of connectivity Degree of connectivity
distribution of connectivity degree when distance = 1 distribution of connectivity degree when distance ≤ 3
Num
ber
of c
lass
pai
rs
Num
ber
of c
lass
pai
rs
DBLP+ Query Set 64 test questions
31 Direct Single (DS) questions (e.g. Give me author x of the paper y ) 15 Indirect Single (IS) questions (e.g. Show person x who cites the person y ) 8 Direct Multiple (DM) questions (e.g. List person x who published the book y with
ISBN z ) 10 Indirect Multiple (IM) questions (e.g. List the institutions u of the author y with
whom the person x from the organization z has co-authored )
Rephrased to 220 SFQ queries for example, rephase “Give me author x of the paper y” to seven SFQ queries
62
Resolving Parameters
Use sufficiently large numbers to set k1, k2 and k3
k1 = 10 (the size of the class candidates list )
k2 = 20 (the size of the property candidates list )
k3 = 40 (the number of top hypotheses returned by the concept mapping phase)
Resolving α, β, γ, and θ First tune α and γ while fixing β = 0 and θ = 1 Next, tune β while still fixing θ = 1 Last, tune θ
63
Results of Tuning Parameters
Top-10 coverage of 220 SFQ hybrid 99.5% LSA 98.2% string 56.4%
64
Cross-Validation
Using all the queries:
65
DBpedia Dataset Statistics
66
Degree of connectivity 249 x 249 class pairs resulted from pairing every C with every C
67
connectivity degree among 249 classes when distance ≤ 2connectivity degree among 249 classes when distance = 1
degr
ee o
f co
nnec
tivity
degr
ee o
f co
nnec
tivity
DBpedia Query Set 2011 QALD (QA over Linked Data) workshop
50 training and 50 test questions on DBpedia 3.6 ground truth answers
33 questions from 50 QALD test questions that can be answered using only the DBpedia ontology Modify 7 questions due to unsupported operations and 1 question due to data
issue but we preserve the relational structure and vocabulary of the questions 27 DS questions (e.g. Which river does the Brooklyn Bridge cross?) 6 DM questions that contains two relations (e.g. Give me the official websites of
actors of the television show Charmed. )
Three graduate students who are unfamilar with DBpedia independently translated them into 99 SFQ queries .
68
Results
Top-10 coverage of 99 SFQ hybrid 88.9% LSA 82.8% string 51.5%
The coverage has an upper limit 91.9% 5 test cases due to ambiguty.
Translators’ interpretation changed the questions. 3 test cases due to an incorrect property name.
69
Generating SPARQL queries and answers
Use a non-empty strategy to automatically generate answers for a SFQ query Run SPARQL generated for the best interpretation of a SFQ query. If an empty result is returned, go to next interpretation and so on.
Results on 99 SFQ queries (33 NL questions) using the parameters learned on the DBLP+ dataset.
70
Compare with QALD Systems
Compare with two QALD systems on 30 test questions
Three questions are excluded because we made them easier by dropping the aggregation functions.
Among 30 questions, PowerAqua modified 8 questions and FREyA modified 4 questions.
71
Our system (hybrid) Our system (LSA) FREyA PowerAqua
Compare with QALD Systems
Compare with two QALD systems on 6 two-relation questions
Our Differences Both PowerAqua and FREyA use lexicons FREyA highly depends on the user’s interaction to perform mappings PowerAqua and FREyA tuned their systems on 50 training questions Both PowerAqua and FREyA use TBox data of DBpedia Ontology
72
Our system (hybrid) Our system (LSA) FREyA PowerAqua
Compare with Online Systems
Compare with two online systems on 33 test questions
Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.
73
Our system (hybrid) Our system (LSA) True Knowledge PowerAqua
Running time Comparison
QALD reported systems FREyA 36 seconds per question PowerAqua N/A
Online systems True Knowledge a few seconds PowerAqua 143.7 seconds
Our systems Hybrid 0.721 seconds LSA 0.766 seconds
Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.
74
Conclusion and Future Work The schema-free structured query approach allows people to query
semantic data without mastering formal queries or acquiring detailed knowledge of the classes.
Our system uses statistical data about lexical semantics and semantic data to generate most appropriate formal queries from a user’s intuitive query.
Our evaluation showed that the approach was both effective and efficient for two very different, large datasets
Our next step is to make the approach easier to apply to new RDF data collection and to a large LOD cloud and develop the envisioned web interface
75
Contributions The SFQ interface that works around the unsolved problem of parsing full
relational structure from natural language queries.
Novel context-sensitive and fully computation-based mapping algorithms that address both vocabulary and structure mismatch problems.
A novel approach to build a schema network from the entity network to deal with heterogenous or unknown schemas
Define the probability of observing a path on the schema network and develop two novel statistical association models
Improve a popular statistical association measure, PMI
Develop state of art and novel semantic simialrity measures
76
End
Thank you!!!Questions?
77