Natural Language Queries over Heterogeneous Linked Data Graphs: A Distributional-Compositional...

Preview:

DESCRIPTION

The demand to access large amounts of heterogeneous structured data is emerging as a trend for many users and applications. However, the effort involved in querying heterogeneous and distributed third-party databases can create major barriers for data consumers. At the core of this problem is the semantic gap between the way users express their information needs and the representation of the data. This work aims to provide a natural language interface and an associated semantic index to support an increased level of vocabulary independency for queries over Linked Data/Semantic Web datasets, using a distributional-compositional semantics approach. Distributional semantics focuses on the automatic construction of a semantic model based on the statistical distribution of co-occurring words in large-scale texts. The proposed query model targets the following features: (i) a principled semantic approximation approach with low adaptation effort (independent from manually created resources such as ontologies, thesauri or dictionaries), (ii) comprehensive semantic matching supported by the inclusion of large volumes of distributional (unstructured) commonsense knowledge into the semantic approximation process and (iii) expressive natural language queries. The approach is evaluated using natural language queries on an open domain dataset and achieved avg. recall=0.81, mean avg. precision=0.62 and mean reciprocal rank=0.49.

Citation preview

Natural Language Queries over

Heterogeneous Linked Data

Graphs:

A Distributional-Compositional Semantics

Approach

André Freitas and Edward CurryInsight Centre for Data Analytics

International Conference on Intelligent User Interfaces

Haifa, 2014

Talking to your (Big) Data

Motivation

Shift in the Database Landscape

Heterogeneous, complex and large-scale databases.

Very-large and dynamic “schemas”.

10s-100s attributes1,000s-1,000,000s attributescirca 2000

circa 2014

Databases for a Complex World

How do you query data on this scenario?

Vocabulary Problem for DatabasesQuery: Who is the daughter of Bill Clinton married to?

Semantic approximationSemantic Gap

Possible representations = Commonsense Knowledge

Semantics for a Complex World

Formal World Real World

Distributional Semantics

Query Approach

Does it work?

Addressing the Vocabulary Problem for Databases (with Distributional Semantics)

Gaelic: direction

Solution (Video)

More Complex Queries (Video)

Treo Answers Jeopardy Queries (Video)

http://bit.ly/1hWcch9

Evaluation

102 natural language queries (Test Collection: QALD 2011).

Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries).

Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances

Comparative Evaluation

Query Approach

Distributional Semantics

“Words occurring in similar (linguistic) contexts are semantically related.”

If we can equate meaning with context, we can simply record the contexts in which a word occurs in a collection of texts (a corpus).

This can then be used as a surrogate of its semantic representation.

Distributional Semantic Model

c1

child

husbandspouse

cn

c2

function (number of times that the words occur in c1)

0.7

0.5

Commonsense is here

Semantic Relatedness

θ

c1

child

husbandspouse

cn

c2

Works as a semantic ranking function

Approach Overview

Query Planner

Ƭ-Space

Large-scale unstructured data

Commonsense knowledge

Database

Distributional semantics

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

Approach Overview

Query Planner

Ƭ-Space

Wikipedia

RDF Data

Explicit Semantic Analysis (ESA)

Core semantic approximation &

composition operations

Query AnalysisQuery Query Features

Query Plan

Commonsense knowledge

Ƭ-Space

e

p

r

Core Operations

Query

Core Operations

Search & Composition Operations

Query

Search and Composition Operations Instance search

- Proper nouns- String similarity + node cardinality

Class (unary predicate) search- Nouns, adjectives and adverbs- String similarity + Distributional semantic relatedness

Property (binary predicate) search- Nouns, adjectives, verbs and adverbs- Distributional semantic relatedness

Navigation

Extensional expansion- Expands the instances associated with a class.

Operator application- Aggregations, conditionals, ordering, position

Disjunction & Conjunction Disambiguation dialog (instance, predicate)

Core Principles

Minimize the impact of Ambiguity, Vagueness, Synonymy.

Address the simplest matchings first (heuristics).

Semantic Relatedness as a primitive operation.

Distributional semantics as commonsense knowledge.

Question Analysis

Transform natural language queries into triple patterns

“Who is the daughter of Bill Clinton married to?”

Bill Clinton daughter married to

(INSTANCE) (PREDICATE) (PREDICATE) Query Features

PODS

Query Plan

Map query features into a query plan.

A query plan contains a sequence of core operations.

(INSTANCE) (PREDICATE) (PREDICATE) Query Features

Query Plan

(1) INSTANCE SEARCH (Bill Clinton) (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)

(3) e1 <- NAVIGATE (Bill Clintion, p1)

(4) p2 <- SEARCH PREDICATE (e1, married to)

(5) e2 <- NAVIGATE (e1, p2)

Instance Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

Instance Search

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...(PIVOT ENTITY)

(ASSOCIATED TRIPLES)

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

:Baptists:religion

:Yale_Law_School

:almaMater

...

sem_rel(daughter,child)=0.054

sem_rel(daughter,child)=0.004

sem_rel(daughter,alma mater)=0.001

Which properties are semantically related to ‘daughter’?

Navigate

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

Navigate

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

(PIVOT ENTITY)

Predicate Search

Bill Clinton daughter married to

:Bill_Clinton

Query:

Linked Data:

:Chelsea_Clinton

:child

(PIVOT ENTITY)

:Mark_Mezvinsky:spouse

Results

Conclusions

The compositional-distributional model supports a schema-agnostic natural language query mechanism over a large schema (open domain) database

Comprehensive and accurate semantic matching - Avg. recall=0.81, map=0.62, mrr=0.49 Medium-high expressivity

- 80% of queries answered Interactive query execution time

- Avg. 1.52 s (simple queries) – 8.53 s (all queries) / query Better recall and query coverage compared to

baselines with equivalent precision

Low adaptation effort for new datasets

Recommended