Schema Free Querying of Semantic Data

Schema Free Querying of Semantic Data

Lushan Han Advisor: Dr. Tim Finin

May 23, 2014 1

Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion

Road Map

2

Part 1. Introduction

3

Semantic Data

A network of entities, which are annotated with types and interlinked with properties.

Increasing amount of Semantic Data

Examples: RDF semantic data LOD DBpedia Freebase

4

Objectives

Develop schema-free query interfaces Works with “semantic data” in many forms, e.g., RDF, Freebase,

RDBMS Allow casual users to freely query semantic data without learning

its schema Queries should be in the user’s conceptual world

Two existing interfaces: Natural Language Interface (NLI) Keyword Interface

Three hard problems

5

P1. No Practical Interface

Natural language interface NLP techniques are still not reliable to parse out the full relational

structure from natural language questions

Keyword interface Ambiguity and limited expressiveness

(e.g. “president children spouse”)

(e.g. Who was the author of the Adventures of Tom Sawyer and where was he born?)

6

SFQ Interface

Still in the user’s conceptual world Make implicit structure of NL questions explicit

Who was the author of the Adventures of Tom Sawyer and where was he born?

7

P2. Semantic Heterogeneity Problem

Many different ways to express (model) the same meaning

Vocabulary and structure mismatches between the user’s query and the machine’s representation

Existing methods: Labor-intensive and ad-hoc methods

Domain-specific syntactic or semantic grammars Mapping Lexicons (Mapping rules) Templates

Thesaurus (e.g. WordNet) is insufficient

8

P2. Examples

9

P2. More Examples

4 5

10

A purely computational approach

Lexical Semantic similarity Measures Capture flexible semantics

Statistical Association Measures Carry out disambiguation

A novel “overall semantic similarity” or fitness metric that combines Lexical semantic similarity measures statistical association measures structure features

Context-sensitive mapping algorithms

11

P3. Heterogeneous or unknown schema

Hard to reach consensus on a schema for the world

Open domain semantic data has heterogeneous or even unknown schema (e.g. Semantic Web data, DBpedia)

Traditional NLI systems are difficult to apply

Some modern systems Not produce formal queries (e.g. SQL or SPARQL). Directly search into the entity network for matchings

Computationally expensive and has ad-hoc natures

12

The schema network

Learn a schema statistically from the entity network by exploiting co-occurrences. The schema itself is also represented as a network

Mapping the user’s query into the schema network, instead of the entity network. Much more scalable Produce formal queries Enable joint disambiguation and context-sensitive mapping

algorithm

13

Thesis Statement

We can develop an effective and efficient algorithm to map a casual user's schema-free query into a formal knowledge base query language that overcomes vocabulary and structure mismatch problems by exploiting lexical semantic similarity measures, association degree measures and structural features.

14

Contributions

An intuitive SFQ interface that avoids the problem of extracting relations structure from NL queries

Novel algorithms mapping SFQ queries to KB queries addressing both vocabulary and structure mismatches

A novel approach to handle heterogeneous or unknown schemas by building a schema from an entity network

Define the probability of observing a path in a schema network and develop two novel statistical association models

An improved PMI metric and new semantic text similarity measures and algorithms

15

Part 2. Related Work

16

Natural Language Interface to Database (NLIDB) Systems

Early Systems in 70s, (e.g. LUNAR and LADDER) Domain-specific syntactic or semantic grammars Heavily customized to a particular application

Later systems in 80s and 90s. (e.g. TEAM, ASK, MASQUE) More general parser Require human-crafted lexicons, mapping rules and domain

knowledge to interpret the parse tree Allow knowledge engineers or end users to enrich lexicons and

add new mapping rules through an interactive interface More portable than early systems

17

Recent NLI SystemsSystem Data NL Parsing Vocabulary

MismatchStructureMismatch

Auto-matic

Limitations

PRECISE DB Tokenizer to get a collection of tokens Lexicon

bipartite matching

Yes• Very restricted domains

SCISSOR KB Semantic parser Machine learning Yes • Very restricted domains• Manually annotated training data

NaLIX XML Dependency parser to get adjacent tokens Lexicon

Adjacency matching

No• Restricted domains

ORAKEL RDF Syntactic parser to get logical lambda-calculus query Lexicon Lexicon No • Restricted domains

• Simple NL questions

FREyA RDF Syntactic parser to get a collection of terms Lexicon No No • Restricted domains

Aqualog RDF Shallow parsing and pattern rules to get relations Lexicon No No • Restricted domains

PANTO RDFSyntactic parser and a head-driven algorithm to get relations

Lexicon No Yes• Very restricted domains• Simple NL questions

True Knowledge(Evi)

KB1,200 templatesA very large repository of query rephrasing

Lexicon, 1,200 templates and a very large repository of query rephrasing

Yes

• Extremely laborious

PowerAqua RDF Shallow parsing and pattern rules to get relations Lexicon

partial matching

Yes• Directly match into the entity network• Not produce formal queries

Treo RDF Syntactic parser to get a ordered list of terms

Semantic similarity

No Yes • Directly match into the entity network• Not produce formal queries• Queries must be a single path • Produce no exact answers but triple paths that may contain the answers

18

Part 3. SFQ Interface

19

SFQ Examples1. Where was the author of the Adventures of Tom Sawyer born?

2. Give me authors in the CIKM conference

3. A more complicated one

20

Default Relations

The relation name can be left out

A stop word list for filtering relation names with words like in, of, has, from, belong, part of, locate and etc.

21

Envisioned Web Interface

22

Output (1)

23

Output (2)

24

Part 4. Schema Network and Association Models

25

Instance Data (ABox)

Two datasets The relation dataset (all relations between instances) The type dataset (all type definitions for instances)

Integrate all RDF data types into five types that are familiar to users ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral ˆLiteral is the super type of the other four

We use DBpedia for examples in the following slides

26

Automatically enrich the set of types

Automatically deduce types from relations Infer attribute types from data type properties

e.g. <Beijing>, population, “20693000” => ˆPopulation

Infer classes from object properties e.g. < Zelig>, director, <Woody Allen> => ˜Director

27

Counting Co-occurrence

28

The Schema Network

A statistical meta description of the underlying entity network, which is a network itself.

29

The Schema Path

A path on the schema network is called a schema path

A schema path P represents a composite relation

Example 1.

Example 2.

30

The Schema Path Probability

Measure the reasonableness of a path

The probability of “observing” a path on the schema network

(A1) we select the starting node c0 of the path randomly from all the nodes in the schema network

(A2) observe the path in a random walk starting with c0

31

Compute Transition Probability

0 ≤ ≤ 1

32

A Property about Schema Path

A schema path P and its return path P’ represent the same relation.

Given a schema path P and its return path P’ we have P(P ) = P(P’).

P

P’

33

Schema Path Model

Supposed to store and index all the schema paths with a length no larger than a given threshold and their probabilities

The only supported function is to return all the schema paths and their probabilities between two given classes.

Put in memory for fast computation

34

Schema Path Model Optimization

35

Concept Path

Group all the edges with the same direction between two nodes into a single edge

By analogy to schema path, we have concept path probability

Concept path frequency

36

Concept Association Knowledge (CAK) model

Pairwise associations (i) direct association between classes and properties (ii) indirect association between two classes

PMI measure

Our improved PMI measure

37

Concept Association Knowledge (CAK) model

Direct association between a directed class and a property p

Indirect association between two directed classes

38

CAK Examples

39

PMI* vs PMIThe most associated property for “Person” in DBpedia

PMI* PMI

40

Part 5. Query Interpretation

41

SFQ Interpretation

42

Two Phase Mapping Algorithm

43

Generating Candidates via Lexical Semantic Similarity

Disambiguation via Optimization

Concept Mapping Optimization Problem

46

A joint disambiguation example

Time Complexity of Concept Mapping Algorithm

A straightforward concept mapping algorithm

After exploiting locality – the optimal mapping choice of a property can be determined locally when the two classes it links are fixed

48

Relation Mapping Optimization Problem

H* : the set of top k3 concept mapping hypotheses The reduced mapping space for the SFQ

The optimization problem

49

Computing the fitness of a mapping σ on a relation r

Let

Two features and one parameter β Joint lexical semantic similarity between and P The schema path frequency of P The parameter β adjusts the relative importance of the two

features

50

Align terms in P to terms in r

The relation

The path C = <c0, c1, …, cl-1, cl>

P = <p0, p1, …, pl-1>

We already know and are paired with c0 and cl

We ignore all the intermediate classes c1, …, cl-1 Semantics in c1, …, cl-1 is overlapped with that in p0, p1, …, pl-1

the less terms we join, the less likely errors can occur

The only unaligned terms are and p0, p1, …, pl-1

51

Semantic Stretch and Heterogeneous Alignments

52

Cutting Function and Cutting Objective Function

Each cutting defines a function, referred to as ω

The product of similarity of the pairs in the minimum pair set that covers and every

53

Cutting Optimization Problem and a Greedy Algorithm

Cutting Optimization

The cutting space Ω has a size of Total running time

Greedy Algorithm: SmartCutter that run in First find the property in P that is the most similar to and

assume it is in the predicate region. Stretch to the left until we meet a property u that is more similar

to the and stretch to the right until meeting a property v that is more similar to the

54

Joint Lexical Semantic Similarity

tends to be biased towards small l, short paths. α is a parameter in the range [0..1].

High similarities in the subject and object regions but low similarities in the predicate region can still have a fairly high

The joint lexical semantic similarity between and P

55

Deal with Default Relation

Combining and

θ is a parameter in the range [0, ).

56

Formal Query Generating and Entity Matching

57

Part 5. Evaluation

58

Evaluation Settings Two very different datasets

DBLP+ DBLP augmented with data from CiteSeerX and ArnetMiner (narrow domain) DBpedia Structured data in Wikipedia (open domain)

Three similarity measures LSA semantic similarity (purely statistical) Hybrid semantic similarity (LSA + WordNet) String similarity (bigrams + Dice coefficient)

Performance metrics Mean Reciprocal Rank (how high the first correct interpretation are in the top-10 list) Mean Precision and Recall (evaluate the answers produced by the SPARQL queries)

Test environment PC with 2.33GHz Intel Core2 CPU and 8GB memory

59

DBLP+ Dataset Statistics

60

Degree of connectivity 18 x 18 class pairs resulted from pairing every C with every C

61

Degree of connectivity Degree of connectivity

distribution of connectivity degree when distance = 1 distribution of connectivity degree when distance ≤ 3

Num

ber

of c

lass

pai

rs

Num

ber

of c

lass

pai

rs

DBLP+ Query Set 64 test questions

31 Direct Single (DS) questions (e.g. Give me author x of the paper y ) 15 Indirect Single (IS) questions (e.g. Show person x who cites the person y ) 8 Direct Multiple (DM) questions (e.g. List person x who published the book y with

ISBN z ) 10 Indirect Multiple (IM) questions (e.g. List the institutions u of the author y with

whom the person x from the organization z has co-authored )

Rephrased to 220 SFQ queries for example, rephase “Give me author x of the paper y” to seven SFQ queries

62

Resolving Parameters

Use sufficiently large numbers to set k1, k2 and k3

k1 = 10 (the size of the class candidates list )

k2 = 20 (the size of the property candidates list )

k3 = 40 (the number of top hypotheses returned by the concept mapping phase)

Resolving α, β, γ, and θ First tune α and γ while fixing β = 0 and θ = 1 Next, tune β while still fixing θ = 1 Last, tune θ

63

Results of Tuning Parameters

Top-10 coverage of 220 SFQ hybrid 99.5% LSA 98.2% string 56.4%

64

Cross-Validation

Using all the queries:

65

DBpedia Dataset Statistics

66

Degree of connectivity 249 x 249 class pairs resulted from pairing every C with every C

67

connectivity degree among 249 classes when distance ≤ 2connectivity degree among 249 classes when distance = 1

degr

ee o

f co

nnec

tivity

degr

ee o

f co

nnec

tivity

DBpedia Query Set 2011 QALD (QA over Linked Data) workshop

50 training and 50 test questions on DBpedia 3.6 ground truth answers

33 questions from 50 QALD test questions that can be answered using only the DBpedia ontology Modify 7 questions due to unsupported operations and 1 question due to data

issue but we preserve the relational structure and vocabulary of the questions 27 DS questions (e.g. Which river does the Brooklyn Bridge cross?) 6 DM questions that contains two relations (e.g. Give me the official websites of

actors of the television show Charmed. )

Three graduate students who are unfamilar with DBpedia independently translated them into 99 SFQ queries .

68

Results

Top-10 coverage of 99 SFQ hybrid 88.9% LSA 82.8% string 51.5%

The coverage has an upper limit 91.9% 5 test cases due to ambiguty.

Translators’ interpretation changed the questions. 3 test cases due to an incorrect property name.

69

Generating SPARQL queries and answers

Use a non-empty strategy to automatically generate answers for a SFQ query Run SPARQL generated for the best interpretation of a SFQ query. If an empty result is returned, go to next interpretation and so on.

Results on 99 SFQ queries (33 NL questions) using the parameters learned on the DBLP+ dataset.

70

Compare with QALD Systems

Compare with two QALD systems on 30 test questions

Three questions are excluded because we made them easier by dropping the aggregation functions.

Among 30 questions, PowerAqua modified 8 questions and FREyA modified 4 questions.

71

Our system (hybrid) Our system (LSA) FREyA PowerAqua

Compare with QALD Systems

Compare with two QALD systems on 6 two-relation questions

Our Differences Both PowerAqua and FREyA use lexicons FREyA highly depends on the user’s interaction to perform mappings PowerAqua and FREyA tuned their systems on 50 training questions Both PowerAqua and FREyA use TBox data of DBpedia Ontology

72

Our system (hybrid) Our system (LSA) FREyA PowerAqua

Compare with Online Systems

Compare with two online systems on 33 test questions

Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.

73

Our system (hybrid) Our system (LSA) True Knowledge PowerAqua

Running time Comparison

QALD reported systems FREyA 36 seconds per question PowerAqua N/A

Online systems True Knowledge a few seconds PowerAqua 143.7 seconds

Our systems Hybrid 0.721 seconds LSA 0.766 seconds

Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.

74

Conclusion and Future Work The schema-free structured query approach allows people to query

semantic data without mastering formal queries or acquiring detailed knowledge of the classes.

Our system uses statistical data about lexical semantics and semantic data to generate most appropriate formal queries from a user’s intuitive query.

Our evaluation showed that the approach was both effective and efficient for two very different, large datasets

Our next step is to make the approach easier to apply to new RDF data collection and to a large LOD cloud and develop the envisioned web interface

75

Contributions The SFQ interface that works around the unsolved problem of parsing full

relational structure from natural language queries.

Novel context-sensitive and fully computation-based mapping algorithms that address both vocabulary and structure mismatch problems.

A novel approach to build a schema network from the entity network to deal with heterogenous or unknown schemas

Define the probability of observing a path on the schema network and develop two novel statistical association models

Improve a popular statistical association measure, PMI

Develop state of art and novel semantic simialrity measures

76

End

Thank you!!!Questions?

77

Documents

Schema Free Querying of Semantic Data