77
Schema Free Querying of Semantic Data Lushan Han Advisor: Dr. Tim Finin May 23, 2014 1

Schema Free Querying of Semantic Data

  • Upload
    kaiser

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Schema Free Querying of Semantic Data. Lushan Han Advisor: Dr. Tim Finin May 23, 2014. Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion. Road Map. Part 1. Introduction. Semantic Data. - PowerPoint PPT Presentation

Citation preview

Page 1: Schema Free Querying  of Semantic Data

Schema Free Querying of Semantic Data

Lushan Han Advisor: Dr. Tim Finin

May 23, 2014 1

Page 2: Schema Free Querying  of Semantic Data

Introduction Related Work SFQ Interface Schema Network and Association Models Query Interpretation Evaluation Conclusion

Road Map

2

Page 3: Schema Free Querying  of Semantic Data

Part 1. Introduction

3

Page 4: Schema Free Querying  of Semantic Data

Semantic Data

A network of entities, which are annotated with types and interlinked with properties.

Increasing amount of Semantic Data

Examples: RDF semantic data LOD DBpedia Freebase

4

Page 5: Schema Free Querying  of Semantic Data

Objectives

Develop schema-free query interfaces Works with “semantic data” in many forms, e.g., RDF, Freebase,

RDBMS Allow casual users to freely query semantic data without learning

its schema Queries should be in the user’s conceptual world

Two existing interfaces: Natural Language Interface (NLI) Keyword Interface

Three hard problems

5

Page 6: Schema Free Querying  of Semantic Data

P1. No Practical Interface

Natural language interface NLP techniques are still not reliable to parse out the full relational

structure from natural language questions

Keyword interface Ambiguity and limited expressiveness

(e.g. “president children spouse”)

(e.g. Who was the author of the Adventures of Tom Sawyer and where was he born?)

6

Page 7: Schema Free Querying  of Semantic Data

SFQ Interface

Still in the user’s conceptual world Make implicit structure of NL questions explicit

Who was the author of the Adventures of Tom Sawyer and where was he born?

7

Page 8: Schema Free Querying  of Semantic Data

P2. Semantic Heterogeneity Problem

Many different ways to express (model) the same meaning

Vocabulary and structure mismatches between the user’s query and the machine’s representation

Existing methods: Labor-intensive and ad-hoc methods

Domain-specific syntactic or semantic grammars Mapping Lexicons (Mapping rules) Templates

Thesaurus (e.g. WordNet) is insufficient

8

Page 9: Schema Free Querying  of Semantic Data

P2. Examples

9

Page 10: Schema Free Querying  of Semantic Data

P2. More Examples

4 5

10

Page 11: Schema Free Querying  of Semantic Data

A purely computational approach

Lexical Semantic similarity Measures Capture flexible semantics

Statistical Association Measures Carry out disambiguation

A novel “overall semantic similarity” or fitness metric that combines Lexical semantic similarity measures statistical association measures structure features

Context-sensitive mapping algorithms

11

Page 12: Schema Free Querying  of Semantic Data

P3. Heterogeneous or unknown schema

Hard to reach consensus on a schema for the world

Open domain semantic data has heterogeneous or even unknown schema (e.g. Semantic Web data, DBpedia)

Traditional NLI systems are difficult to apply

Some modern systems Not produce formal queries (e.g. SQL or SPARQL). Directly search into the entity network for matchings

Computationally expensive and has ad-hoc natures

12

Page 13: Schema Free Querying  of Semantic Data

The schema network

Learn a schema statistically from the entity network by exploiting co-occurrences. The schema itself is also represented as a network

Mapping the user’s query into the schema network, instead of the entity network. Much more scalable Produce formal queries Enable joint disambiguation and context-sensitive mapping

algorithm

13

Page 14: Schema Free Querying  of Semantic Data

Thesis Statement

We can develop an effective and efficient algorithm to map a casual user's schema-free query into a formal knowledge base query language that overcomes vocabulary and structure mismatch problems by exploiting lexical semantic similarity measures, association degree measures and structural features.

14

Page 15: Schema Free Querying  of Semantic Data

Contributions

An intuitive SFQ interface that avoids the problem of extracting relations structure from NL queries

Novel algorithms mapping SFQ queries to KB queries addressing both vocabulary and structure mismatches

A novel approach to handle heterogeneous or unknown schemas by building a schema from an entity network

Define the probability of observing a path in a schema network and develop two novel statistical association models

An improved PMI metric and new semantic text similarity measures and algorithms

15

Page 16: Schema Free Querying  of Semantic Data

Part 2. Related Work

16

Page 17: Schema Free Querying  of Semantic Data

Natural Language Interface to Database (NLIDB) Systems

Early Systems in 70s, (e.g. LUNAR and LADDER) Domain-specific syntactic or semantic grammars Heavily customized to a particular application

Later systems in 80s and 90s. (e.g. TEAM, ASK, MASQUE) More general parser Require human-crafted lexicons, mapping rules and domain

knowledge to interpret the parse tree Allow knowledge engineers or end users to enrich lexicons and

add new mapping rules through an interactive interface More portable than early systems

17

Page 18: Schema Free Querying  of Semantic Data

Recent NLI SystemsSystem Data NL Parsing Vocabulary

MismatchStructureMismatch

Auto-matic

Limitations

PRECISE DB Tokenizer to get a collection of tokens Lexicon

bipartite matching

Yes• Very restricted domains

SCISSOR KB Semantic parser Machine learning Yes • Very restricted domains• Manually annotated training data

NaLIX XML Dependency parser to get adjacent tokens Lexicon

Adjacency matching

No• Restricted domains

ORAKEL RDF Syntactic parser to get logical lambda-calculus query Lexicon Lexicon No • Restricted domains

• Simple NL questions

FREyA RDF Syntactic parser to get a collection of terms Lexicon No No • Restricted domains

Aqualog RDF Shallow parsing and pattern rules to get relations Lexicon No No • Restricted domains

PANTO RDFSyntactic parser and a head-driven algorithm to get relations

Lexicon No Yes• Very restricted domains• Simple NL questions

True Knowledge(Evi)

KB1,200 templatesA very large repository of query rephrasing

Lexicon, 1,200 templates and a very large repository of query rephrasing

Yes

• Extremely laborious

PowerAqua RDF Shallow parsing and pattern rules to get relations Lexicon

partial matching

Yes• Directly match into the entity network• Not produce formal queries

Treo RDF Syntactic parser to get a ordered list of terms

Semantic similarity

No Yes • Directly match into the entity network• Not produce formal queries• Queries must be a single path • Produce no exact answers but triple paths that may contain the answers

18

Page 19: Schema Free Querying  of Semantic Data

Part 3. SFQ Interface

19

Page 20: Schema Free Querying  of Semantic Data

SFQ Examples1. Where was the author of the Adventures of Tom Sawyer born?

2. Give me authors in the CIKM conference

3. A more complicated one

20

Page 21: Schema Free Querying  of Semantic Data

Default Relations

The relation name can be left out

A stop word list for filtering relation names with words like in, of, has, from, belong, part of, locate and etc.

21

Page 22: Schema Free Querying  of Semantic Data

Envisioned Web Interface

22

Page 23: Schema Free Querying  of Semantic Data

Output (1)

23

Page 24: Schema Free Querying  of Semantic Data

Output (2)

24

Page 25: Schema Free Querying  of Semantic Data

Part 4. Schema Network and Association Models

25

Page 26: Schema Free Querying  of Semantic Data

Instance Data (ABox)

Two datasets The relation dataset (all relations between instances) The type dataset (all type definitions for instances)

Integrate all RDF data types into five types that are familiar to users ˆNumber, ˆDate, ˆYear, ˆText and ˆLiteral ˆLiteral is the super type of the other four

We use DBpedia for examples in the following slides

26

Page 27: Schema Free Querying  of Semantic Data

Automatically enrich the set of types

Automatically deduce types from relations Infer attribute types from data type properties

e.g. <Beijing>, population, “20693000” => ˆPopulation

Infer classes from object properties e.g. < Zelig>, director, <Woody Allen> => ˜Director

27

Page 28: Schema Free Querying  of Semantic Data

Counting Co-occurrence

28

Page 29: Schema Free Querying  of Semantic Data

The Schema Network

A statistical meta description of the underlying entity network, which is a network itself.

29

Page 30: Schema Free Querying  of Semantic Data

The Schema Path

A path on the schema network is called a schema path

A schema path P represents a composite relation

Example 1.

Example 2.

30

Page 31: Schema Free Querying  of Semantic Data

The Schema Path Probability

Measure the reasonableness of a path

The probability of “observing” a path on the schema network

(A1) we select the starting node c0 of the path randomly from all the nodes in the schema network

(A2) observe the path in a random walk starting with c0

31

Page 32: Schema Free Querying  of Semantic Data

Compute Transition Probability

0 ≤ ≤ 1

32

Page 33: Schema Free Querying  of Semantic Data

A Property about Schema Path

A schema path P and its return path P’ represent the same relation.

Given a schema path P and its return path P’ we have P(P ) = P(P’).

P

P’

33

Page 34: Schema Free Querying  of Semantic Data

Schema Path Model

Supposed to store and index all the schema paths with a length no larger than a given threshold and their probabilities

The only supported function is to return all the schema paths and their probabilities between two given classes.

Put in memory for fast computation

34

Page 35: Schema Free Querying  of Semantic Data

Schema Path Model Optimization

35

Page 36: Schema Free Querying  of Semantic Data

Concept Path

Group all the edges with the same direction between two nodes into a single edge

By analogy to schema path, we have concept path probability

Concept path frequency

36

Page 37: Schema Free Querying  of Semantic Data

Concept Association Knowledge (CAK) model

Pairwise associations (i) direct association between classes and properties (ii) indirect association between two classes

PMI measure

Our improved PMI measure

37

Page 38: Schema Free Querying  of Semantic Data

Concept Association Knowledge (CAK) model

Direct association between a directed class and a property p

Indirect association between two directed classes

38

Page 39: Schema Free Querying  of Semantic Data

CAK Examples

39

Page 40: Schema Free Querying  of Semantic Data

PMI* vs PMIThe most associated property for “Person” in DBpedia

PMI* PMI

40

Page 41: Schema Free Querying  of Semantic Data

Part 5. Query Interpretation

41

Page 42: Schema Free Querying  of Semantic Data

SFQ Interpretation

42

Page 43: Schema Free Querying  of Semantic Data

Two Phase Mapping Algorithm

43

Page 44: Schema Free Querying  of Semantic Data

Generating Candidates via Lexical Semantic Similarity

Page 45: Schema Free Querying  of Semantic Data

Disambiguation via Optimization

Page 46: Schema Free Querying  of Semantic Data

Concept Mapping Optimization Problem

46

Page 47: Schema Free Querying  of Semantic Data

A joint disambiguation example

Page 48: Schema Free Querying  of Semantic Data

Time Complexity of Concept Mapping Algorithm

A straightforward concept mapping algorithm

After exploiting locality – the optimal mapping choice of a property can be determined locally when the two classes it links are fixed

48

Page 49: Schema Free Querying  of Semantic Data

Relation Mapping Optimization Problem

H* : the set of top k3 concept mapping hypotheses The reduced mapping space for the SFQ

The optimization problem

49

Page 50: Schema Free Querying  of Semantic Data

Computing the fitness of a mapping σ on a relation r

Let

Two features and one parameter β Joint lexical semantic similarity between and P The schema path frequency of P The parameter β adjusts the relative importance of the two

features

50

Page 51: Schema Free Querying  of Semantic Data

Align terms in P to terms in r

The relation

The path C = <c0, c1, …, cl-1, cl>

P = <p0, p1, …, pl-1>

We already know and are paired with c0 and cl

We ignore all the intermediate classes c1, …, cl-1 Semantics in c1, …, cl-1 is overlapped with that in p0, p1, …, pl-1

the less terms we join, the less likely errors can occur

The only unaligned terms are and p0, p1, …, pl-1

51

Page 52: Schema Free Querying  of Semantic Data

Semantic Stretch and Heterogeneous Alignments

52

Page 53: Schema Free Querying  of Semantic Data

Cutting Function and Cutting Objective Function

Each cutting defines a function, referred to as ω

The product of similarity of the pairs in the minimum pair set that covers and every

53

Page 54: Schema Free Querying  of Semantic Data

Cutting Optimization Problem and a Greedy Algorithm

Cutting Optimization

The cutting space Ω has a size of Total running time

Greedy Algorithm: SmartCutter that run in First find the property in P that is the most similar to and

assume it is in the predicate region. Stretch to the left until we meet a property u that is more similar

to the and stretch to the right until meeting a property v that is more similar to the

54

Page 55: Schema Free Querying  of Semantic Data

Joint Lexical Semantic Similarity

tends to be biased towards small l, short paths. α is a parameter in the range [0..1].

High similarities in the subject and object regions but low similarities in the predicate region can still have a fairly high

The joint lexical semantic similarity between and P

55

Page 56: Schema Free Querying  of Semantic Data

Deal with Default Relation

Combining and

θ is a parameter in the range [0, ).

56

Page 57: Schema Free Querying  of Semantic Data

Formal Query Generating and Entity Matching

57

Page 58: Schema Free Querying  of Semantic Data

Part 5. Evaluation

58

Page 59: Schema Free Querying  of Semantic Data

Evaluation Settings Two very different datasets

DBLP+ DBLP augmented with data from CiteSeerX and ArnetMiner (narrow domain) DBpedia Structured data in Wikipedia (open domain)

Three similarity measures LSA semantic similarity (purely statistical) Hybrid semantic similarity (LSA + WordNet) String similarity (bigrams + Dice coefficient)

Performance metrics Mean Reciprocal Rank (how high the first correct interpretation are in the top-10 list) Mean Precision and Recall (evaluate the answers produced by the SPARQL queries)

Test environment PC with 2.33GHz Intel Core2 CPU and 8GB memory

59

Page 60: Schema Free Querying  of Semantic Data

DBLP+ Dataset Statistics

60

Page 61: Schema Free Querying  of Semantic Data

Degree of connectivity 18 x 18 class pairs resulted from pairing every C with every C

61

Degree of connectivity Degree of connectivity

distribution of connectivity degree when distance = 1 distribution of connectivity degree when distance ≤ 3

Num

ber

of c

lass

pai

rs

Num

ber

of c

lass

pai

rs

Page 62: Schema Free Querying  of Semantic Data

DBLP+ Query Set 64 test questions

31 Direct Single (DS) questions (e.g. Give me author x of the paper y ) 15 Indirect Single (IS) questions (e.g. Show person x who cites the person y ) 8 Direct Multiple (DM) questions (e.g. List person x who published the book y with

ISBN z ) 10 Indirect Multiple (IM) questions (e.g. List the institutions u of the author y with

whom the person x from the organization z has co-authored )

Rephrased to 220 SFQ queries for example, rephase “Give me author x of the paper y” to seven SFQ queries

62

Page 63: Schema Free Querying  of Semantic Data

Resolving Parameters

Use sufficiently large numbers to set k1, k2 and k3

k1 = 10 (the size of the class candidates list )

k2 = 20 (the size of the property candidates list )

k3 = 40 (the number of top hypotheses returned by the concept mapping phase)

Resolving α, β, γ, and θ First tune α and γ while fixing β = 0 and θ = 1 Next, tune β while still fixing θ = 1 Last, tune θ

63

Page 64: Schema Free Querying  of Semantic Data

Results of Tuning Parameters

Top-10 coverage of 220 SFQ hybrid 99.5% LSA 98.2% string 56.4%

64

Page 65: Schema Free Querying  of Semantic Data

Cross-Validation

Using all the queries:

65

Page 66: Schema Free Querying  of Semantic Data

DBpedia Dataset Statistics

66

Page 67: Schema Free Querying  of Semantic Data

Degree of connectivity 249 x 249 class pairs resulted from pairing every C with every C

67

connectivity degree among 249 classes when distance ≤ 2connectivity degree among 249 classes when distance = 1

degr

ee o

f co

nnec

tivity

degr

ee o

f co

nnec

tivity

Page 68: Schema Free Querying  of Semantic Data

DBpedia Query Set 2011 QALD (QA over Linked Data) workshop

50 training and 50 test questions on DBpedia 3.6 ground truth answers

33 questions from 50 QALD test questions that can be answered using only the DBpedia ontology Modify 7 questions due to unsupported operations and 1 question due to data

issue but we preserve the relational structure and vocabulary of the questions 27 DS questions (e.g. Which river does the Brooklyn Bridge cross?) 6 DM questions that contains two relations (e.g. Give me the official websites of

actors of the television show Charmed. )

Three graduate students who are unfamilar with DBpedia independently translated them into 99 SFQ queries .

68

Page 69: Schema Free Querying  of Semantic Data

Results

Top-10 coverage of 99 SFQ hybrid 88.9% LSA 82.8% string 51.5%

The coverage has an upper limit 91.9% 5 test cases due to ambiguty.

Translators’ interpretation changed the questions. 3 test cases due to an incorrect property name.

69

Page 70: Schema Free Querying  of Semantic Data

Generating SPARQL queries and answers

Use a non-empty strategy to automatically generate answers for a SFQ query Run SPARQL generated for the best interpretation of a SFQ query. If an empty result is returned, go to next interpretation and so on.

Results on 99 SFQ queries (33 NL questions) using the parameters learned on the DBLP+ dataset.

70

Page 71: Schema Free Querying  of Semantic Data

Compare with QALD Systems

Compare with two QALD systems on 30 test questions

Three questions are excluded because we made them easier by dropping the aggregation functions.

Among 30 questions, PowerAqua modified 8 questions and FREyA modified 4 questions.

71

Our system (hybrid) Our system (LSA) FREyA PowerAqua

Page 72: Schema Free Querying  of Semantic Data

Compare with QALD Systems

Compare with two QALD systems on 6 two-relation questions

Our Differences Both PowerAqua and FREyA use lexicons FREyA highly depends on the user’s interaction to perform mappings PowerAqua and FREyA tuned their systems on 50 training questions Both PowerAqua and FREyA use TBox data of DBpedia Ontology

72

Our system (hybrid) Our system (LSA) FREyA PowerAqua

Page 73: Schema Free Querying  of Semantic Data

Compare with Online Systems

Compare with two online systems on 33 test questions

Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.

73

Our system (hybrid) Our system (LSA) True Knowledge PowerAqua

Page 74: Schema Free Querying  of Semantic Data

Running time Comparison

QALD reported systems FREyA 36 seconds per question PowerAqua N/A

Online systems True Knowledge a few seconds PowerAqua 143.7 seconds

Our systems Hybrid 0.721 seconds LSA 0.766 seconds

Both True Knowledge and PowerAqua online systems include DBpedia data as part of their knowledge base.

74

Page 75: Schema Free Querying  of Semantic Data

Conclusion and Future Work The schema-free structured query approach allows people to query

semantic data without mastering formal queries or acquiring detailed knowledge of the classes.

Our system uses statistical data about lexical semantics and semantic data to generate most appropriate formal queries from a user’s intuitive query.

Our evaluation showed that the approach was both effective and efficient for two very different, large datasets

Our next step is to make the approach easier to apply to new RDF data collection and to a large LOD cloud and develop the envisioned web interface

75

Page 76: Schema Free Querying  of Semantic Data

Contributions The SFQ interface that works around the unsolved problem of parsing full

relational structure from natural language queries.

Novel context-sensitive and fully computation-based mapping algorithms that address both vocabulary and structure mismatch problems.

A novel approach to build a schema network from the entity network to deal with heterogenous or unknown schemas

Define the probability of observing a path on the schema network and develop two novel statistical association models

Improve a popular statistical association measure, PMI

Develop state of art and novel semantic simialrity measures

76

Page 77: Schema Free Querying  of Semantic Data

End

Thank you!!!Questions?

77