48
SEMANTIC SEARCH OVER BIG LINKED DATA Dr. Thanh Tran

Big data search

Embed Size (px)

DESCRIPTION

Big data search - current and future works

Citation preview

Page 1: Big data search

SEMANTIC SEARCH OVER BIG LINKED DATA

Dr. Thanh Tran

Page 2: Big data search

…AND THERE WAS LINKED DATA!

Page 3: Big data search

(Source: http://linkeddata.org/)

Page 4: Big data search

RDFA W3C Web standard for data representation and exchange

Allows different kinds of data to be captured as graphs

Graphs contain resource descriptions

Each is a set of triples

• Attribute values• Relations to other resources

Freddie Mercury

BrianMay

Queen

Liar 1971

memberm

embe

rproducer

formed in

Page 5: Big data search

source: http://linkeddata.org/

LINKED DATA CLOUD

(Source: http://linkeddata.org/)

Page 6: Big data search

OPPORTUNITIES (1) Data.gov: effective dissemination and consumption of public sector data

(Source: http://www.data.gov)

Page 7: Big data search

The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real…

“written by freddie queen single”

WKP:Page

OPPORTUNITIES (2) Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains

Page 8: Big data search

Freddie Mercury

BrianMay

Queen

Liar 1971

MusicBrainz;Artist

MusicBRainz:Band

MusciBrainz:Single

“written by freddie queen single”

member

mem

ber

producer

formed in

OPPORTUNITIES Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains

The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real…

WKP:Page

Page 9: Big data search

Freddie Mercury

BrianMay

Queen

Liar 1971

MusicBrainz;Artist

MusicBRainz:Band

MusciBrainz:Single

“written by freddie queen single”

same-as

member

mem

ber

producer

formed in

OPPORTUNITIES Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains

The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real…

WKP:Page

Page 10: Big data search

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar 1971 single

Freebase:Person

MusicBrainz;Artist

MusicBRainz:Band

MusciBrainz:Single

“written by freddie queen single”

same-as

member

mem

ber

producer

formed in

marital

statusOPPORTUNITIES

Linked Data Cloud: effective dissemination and consumption of data across datasets, across domains

The Freddie Mercury-written lead single "Seven Seas of Rhye" reached number ten in the UK, giving the band their first hit.[14] The album is the first real…

WKP:Page

same-as?

Page 11: Big data search

COGNITIVE CHALLENGESStructured data / database solution requires needs to be given as structured queries

Writing structured queries requires knowledge about

• Query language syntax and semantics• Datasets and their schemas• Links between datasets

<x, type, Single> <Freddie Mercury, writer, x><Freddie Mercury, member, Queen>

“written by freddie queen single”

Page 12: Big data search

SEMANTIC SEARCH OVER BIG LINKED DATA!

Page 13: Big data search

VISIONEnabling end users to retrieve and explore relevant knowledge from Big Linked Data via intuitive interfaces!

Page 14: Big data search

THE INFORMATION WORKBENCH DEMO

Facets

SyntacticCompletions

Keywords

Semantic Completions

Page 15: Big data search

(Source: http://www.fluidops.com/information-workbench/)

Page 16: Big data search

FOLLOWING AGENDATechnical Challenges

Big Picture of Previous & Current Work

Contributions & Innovations

Keyword Search over Big Linked Data

Where are we now?

What is to be done?

Page 17: Big data search

TECHNICAL CHALLENGES Linked Data is Big Data

Volume: numerous large datasets

• Processing all datasets possible/ needed?

Velocity: streams from sensors, live feeds etc.

• How to provide fresh, timely results?• Preprocessing possible?

Variety: different data formats + schemas are unknown, heterogeneous and rapidly changing • Making sense of the data?• Integrate and combine knowledge from different datasets?

Page 18: Big data search

BIG PICTUREPrevious & Current Work

Acquire

• Source selection [ISWC10, TKDE12b]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

Analyze

• Descriptive resource summary [ISWC11]

• Structural summary of datasets [TKDE12a]

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

VolumeFast access?

All data/datasets?

Page 19: Big data search

BIG PICTUREPrevious & Current Work

Acquire

• Source selection [ISWC10, TKDE12b]

• Stream-based processing of external sources [ISWC10b]

• Combining local & external sources [ESWC12]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

• On-demand search-driven data integration [WebSci12]

Analyze

• Descriptive resource summary [ISWC11]

• Structural summary of datasets [TKDE12a]

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

• Explorative Linked Data query processing [ESWC11]

• Multi-datasets search [WWW12]

VolumeFast access?

All data/datasets?

VelocityFresh results?

Preprocessing?

Heterogeneous Datasets/Schemas

Structured + Unstructured

Variety

Page 20: Big data search

KEYWORD SEARCH OVER BIG LINKED DATA

Page 21: Big data search

BIG PICTUREPrevious & Current Work

Acquire

• Source selection [ISWC10, TKDE12b]

• Stream-based processing of external sources [ISWC10b]

• Combining local & external sources [ESWC12]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

• On-demand search-driven data integration [WebSci12]

Analyze

• Descriptive resource summary [ISWC11]

• Structural summary of datasets [TKDE12a]

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

• Explorative Linked Data query processing [ESWC11]

• Multi-datasets search [WWW12]

VolumeFast access?

All data/datasets?

VelocityFresh results?

Preprocessing?

Heterogeneous Datasets/Schemas

Structured + Unstructured

Variety

Page 22: Big data search

KEYWORD SEARCH PROBLEM (1)

Freddie Mercury

BrianMay

QueenQueen

Elizabeth 1

Liar 1971 single

PersonArtist Band Single

member

mem

ber

producer

formed in

marital

status

writer

1) Query 1 1) Result 12) Query 2) Result 2… …

Set of QueriesSelection Set of Results

“written by freddie queen single”

Page 23: Big data search

KEYWORD SEARCH PROBLEM (2)Goal

• Finding “substructures”, e.g. Steiner Graph• Connecting keyword matching elements• AND-Semantics: contain one keyword matching element

for every query keyword

Problem

• Keywords produce large number of matching elements• Large number of connecting graphs• Search complexity increases exponentially with the size

of the data graphs & query keywords• Data graphs large in size

Page 24: Big data search

INDEX-BASED TOP-K KEYWORD QUERY PROCESSING [CIKM11B]

Cast problem as the one of index-based join processing

• Index-based data access (retrieval)• Join (combine)

Page 25: Big data search

D-LENGTH 2-HOP COVER GRAPH INDEX (1)

Use d-length 2-hop cover for graph indexing, i.e. a set of neighbourhood labels NBn for every node n

• If there is a path of length 2d or less between u and v then

• All paths of length 2d or less between u and v are:

• u and v are called center nodes and w is the hop node

emptyNBNB vu

vu NBNBwvwu ,,...,,...,

Page 26: Big data search

D-LENGTH 2-HOP COVER GRAPH INDEX (2)

A set of d-length neighborhoods is a d-length 2-hop cover

During construction, pruning paths reduces that size!

Freddie Mercury Liar

writer

Freddie Mercury

BrianMay

Queen

Liar 1971

Band

member

mem

ber producer

formed in

Liar

Single

Freddie Mercury

Artist

Freddie Mercury

Queenmember

Freddie Mercury

Queenmember

BrianMay

Queenmember

Queen Liarproducer

Queen Band

Queen 1971formed in

Freddie Mercury Liar

writer

LiarSingle

1-length 2-hop cover

path index

center/hop nodes

hop nodes

Freddie Mercury

Queen

Artist

member

Liar

writer

Freddie Mercury Liar

writer

Page 27: Big data search

TOP-K JOIN: NEIGHBORHOOD JOIN

Freddie Mercury

Artist

Freddie Mercury

Queenmember

Band

Freddie Mercury

Queenmember Brian

Maymember

Freddie Mercury

Queenmember Brian

Maymember

Freddie Mercury

Queenmember

Liarproducer

Freddie Mercury

Queenmember

1971formed in

Freddie Mercury

Liarwriter

Singleformed in

Freddie Mercury

Queenmember

Freddie Mercury

Liarwriter

2-length 2-hop cover

Freddie Mercury

Queenmember

BrianMay

Queenmember

QueenLiarproducer

QueenBand

Queen1971formed in

Freddie Mercury

Queenmember

Liarwriter

Freddie Mercury Queen

memberArtist

QueenLiarproducer

Single

Retrieve neighborhoods NBu and NBv for u and v

Join path entries in Nbu and NBv on hop nodes (rank join on sorted inputs)

Page 28: Big data search

TOP-K JOIN: GRAPH JOIN

Freddie Mercury

ArtistFreddie Mercury QueenmemberArtist

Freddie Mercury

ArtistFreddie Mercury Queen

member

Keyword GraphsComprise all paths of max length 2d between Freddie Mercury and Queen

Freddie Mercury

ArtistFreddie Mercury

Queenmember

LiarSingle

hop node 1

hop node 1

Expand to obtain Keyword Graph Neighborhoods containing free hop nodes

Page 29: Big data search

KEYWORD QUERY PROCESSING / PLANNING

Process

• Index access to retrieve keyword

neighborhoods• Rank (neighborhoods/graph)

join to connect keyword elements

Planning: which join order? Freddie Mercury

writerQueen Single

Page 30: Big data search

KEYWORD QUERY PROCESSING / PLANNING

Join order also determines results

• No single join order delivers all results (some might even be empty)

• We do not know in advance which orders deliver which results

Consider all possible join orders

Freddie Mercury

Queen

Liar

Single

membe

r producer

writer

Freddie Mercury

writerQueen Single

Produce results for d = 1!

Produce no results for d = 1!

“written by freddie queen single 1971”

1971

1971

Freddie Mercury

writer QueenSingle1971

formed in

Page 31: Big data search

INTEGRATED QUERY PLANTerminate early after computing top-k instead of all results

• Use rank join operators• Introduce top-k union operator

Freddie Mercury Queen Single

writer

Page 32: Big data search

TOP-K PLANSIntegrated Query Plan is composition of sub-plans

• Some might produce no results • Some sub-plans produce results earlier than others

Rank not only results, but also rank operators (hence plans)

• Global score of rank join operator, based on current results and upper bounds for subsequent join operations

• Only the operator with the highest global score can push results to subsequent operators

• Otherwise, activate lower level data access operators

Page 33: Big data search

INDEX-BASED TOP-K KEYWORD QUERY PROCESSING [CIKM11B]

Benefits

• One-order of magnitude faster performance than online graph exploration

• Compared with graph indexing approaches, our solution reduces storage requirement up to 86%, improves performance by more than 50% on average

Page 34: Big data search

SEARCH TECHNOLOGY INNOVATIONSIntegrated

Zero Upfront Effort / On-Demand

• Does not require preprocessing, upfront integration (Watson)

Fresh Results / Timely Response

Relational

• Entities (Yahoo!, Google, Facebook Graph Search)• Plus relations, paths, graphs…

Zero Manual Effort

• Does not require expert to specify search forms (E-commerce search), structure templates, translation rules and domain adaptation (Wolfram Alpha, Watson)

• Interpretation of keywords and structural context, i.e. relevant relations between entities through online graph exploration

Page 35: Big data search

WHAT HAVE WE ACHIEVED?

Volume: fast access? all data/datasets?

• Quick IR-style keyword-based lookup• Reduce search space / result candidates• Handle hundred of datasets with response time within

few seconds (with local sources)• Ranking performance consistently superior than state-of-

the-art (20% improvements in terms of F-measure) according to keyword search benchmark 2012

• Structured, semi-structured unstructured? hybrid data management?

Page 36: Big data search

WHAT HAVE WE ACHIEVED?

Velocity: fresh results? preprocessing?

• On-demand stream-based processing, i.e. exploration of sources, data integration and result combination at querying time

• No need to process / store all data • Fresh results from external sources can be guaranteed

Page 37: Big data search

WHAT HAVE WE ACHIEVED?

Variety: different datasets, schemas and formats

• Interpretation of data semantics and matching across datasets performed at querying time

• No assumptions of schema, i.e. can handle unknown, possibly semi-structured data

• Works well when data sources are homogenous, i.e. large overlaps / matching signals are numerous and specific heterogeneous data from different domains with small overlaps / no specific matching signals?

Page 38: Big data search

BIG PICTUREPrevious & Current & Future Work

Acquire

• Source selection [ISWC10, TKDE12b]

• Stream-based processing of external sources [ISWC10b]

• Combining local & external sources [ESWC12]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

• On-demand search-driven data integration [WebSci12]

• Heterogeneous data integration [ICDE13, WSDM13]

• Integration of hybrid big data

Analyze

• Descriptive entity summary [ISWC11]

• Structural summary of datasets [TKDE12a]

• Probabilistic models of text and structure [ICML13, SIGMOD13]

• Hybrid big data management

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

• Explorative Linked Data query processing [ESWC11]

• Multi-datasets search [WWW12]

VolumeFast access?

All data/datasets?

VelocityFresh results?

Preprocessing?

Heterogeneous Datasets/Schemas

Structured + Unstructured

Variety

Page 39: Big data search

CONCLUSIONSVision

• Enabling end users to retrieve and explore relevant knowledge from Big Linked Data via intuitive interfaces!

Status quo

• End users can retrieve complex knowledge (complex graphs) from hundreds of Linked Data sources

1-3 years from now

• Improve “integrated view” coverage from 30% to 80% • Coverage of structured and unstructured result (from sensors,

social networks etc.)

3-5 years from now

• Robust probabilistic models of hybrid Big Linked Data • For search, ranking, as well as analytics and prediction?

Page 40: Big data search

THANKS!

Tran Duc Thanh

[email protected]

http://sites.google.com/site/kimducthanh/

Page 41: Big data search

REFERENCES (1)• [ICML13] Veli Bicer, Thanh Tran

Topical Relational ModelSubmitted to International Conference on Machine Learning (ICML’13).

• [SIGMOD13]TopGuess: Query Selectivity Estimation over Text-rich Data GraphsSubmitted to SIGMOD13.

• [ICDE13] Yongtao Ma, Thanh TranTYPifier: Inferring the Type Semantics of Structured DataIn International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013 

• [WSDM13] Yongtao Ma, Thanh TranTYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data IntegrationIn International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013

• [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian RudolphManaging Structured and Semi-structured RDF Data Using Structure IndexesIn Transactions on Knowledge and Data Engineering journal.

• [TKDE12b] Thanh Tran, Lei ZhangKeyword Query RoutingIn Transactions on Knowledge and Data Engineering journal.

• [WWW12] Daniel Herzig, Thanh TranHeterogeneous Web Data Search Using Relevance-based On The Fly Data IntegrationIn Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012

• [WebSci12] Thanh Tran, Yongtao Ma, and Gong ChengPay-less Entity Consolidation – Exploiting Entity Search User Feedbacks for Pay-as-you-go Entity Data Integration

In Proceedings of Web Science Conference 2012 (WebSci'12). Evanston, USA, June, 2012• [CIKM11a] Günter Ladwig, Thanh Tran

Index Structures and Top-k Join Algorithms for Native Keyword Search DatabasesIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011

• [CIKM11b] Veli Bicer, Thanh TranRanking Support for Keyword Search on Structured Data using Relevance ModelsIn Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011

Page 42: Big data search

REFERENCES (2)• [ISWC11] Gong Cheng, Thanh Tran and Yuzhong Qu

RELIN: Relatedness and Informativeness-based Centrality for Entity SummarizationIn Proceedings of 10th International Semantic Web Conference (ISWC'11) . Koblenz, Germany, October, 2011 

• [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using CrowdsourcingIn Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011

• [DEXA11] Andreas Wagner, Günter Ladwig, Thanh TranBrowsing-oriented Semantic Faceted SearchIn Proceedings of 22nd International Conference on Database and Expert Systems Applications (DEXA'11). Toulouse, France, August, 2011

• [ISWC10a] Thanh Tran, Lei Zhang, Rudi StuderSummary Models for Routing Keywords to Linked Data SourcesIn Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010

• [ISWC10b] Günter Ladwig, Thanh TranLinked Data Query Processing StrategiesIn  Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010

• [JWS09] Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue PanSemplore: A Scalable IR Approach to Search the Web of DataIn Journal of Web Semantics 7 (3),September, 2009

• [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009

• [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009 

Page 43: Big data search

BACKUP

Page 44: Big data search

QUERY INTERPRETATION [ICDE09, SIGMOD09]

Focus on query interpretations instead of final answers

Leverage the power of underlying DB query engine for processing interpretations

Reduction of search space

• Query interpretation on structure summary generated from data• Exploration on reduced search space!

Focus on top-k results

• Top-k procedure for exploring and finding the k best results

Freddie Mercury

Queen Queen Elizabeth 1

single

PersonArtist Band Single Literal

member producer writer marital status

<x, type, Single> <Queen, producer, x><Freddie Mercury, writer, x><Queen, type, Band><Freddy Mercury, type, Artist>

“written by freddie queen single”

Page 45: Big data search

QUERY INTERPRETATIONBenefits

• Outperforms online bidirectional search by at least one order of magnitude

• Performance comparable with index-based approaches, but requires less space

Drawbacks

• “Meaningful” interpretations may generate empty results• Relies on DB query engine, native tailored optimization not possible

Page 46: Big data search

BIG PICTUREPrevious & Current Work

Acquire

• Source selection [ISWC10, TKDE12b]

• Stream-based processing of external sources [ISWC10b]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

• On-demand search-driven data integration [WebSci12]

Analyze

• Descriptive resource summary [ISWC11]

• Structural summary of datasets [TKDE12a]

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

• Explorative Linked Data query processing [ESWC11]

VolumeFast access?

All data/datasets?

VelocityFresh results?

Preprocessing?

Page 47: Big data search

BIG PICTUREPrevious & Current Work

Acquire

• Source selection [ISWC10, TKDE12b]

• Stream-based processing of external sources [ISWC10b]

• Combining local & external sources [ESWC12]

Organize

• Indexes for quick lookup of entities, relations and paths [JWS09, CIKM11a]

• On-demand search-driven data integration [WebSci12]

Analyze

• Descriptive entity summary [ISWC11]

• Structural summary of datasets [TKDE12a]

Search

• Entity & relational search and ranking [SIGIR11,CIKM11b]

• Keyword query processing [ICDE09, SIGMOD09]

• Explorative Linked Data query processing [ESWC11]

• Multi-datasets search [WWW12]

VolumeFast access?

All data/datasets?

VelocityFresh results?

Preprocessing?

Heterogeneous Datasets/Schemas

Structured + Unstructured

Variety

Page 48: Big data search

SEMANTIC SEARCH TECHNIQUES FOR LINKING

Linking homogenous data

• Given structured entity description, find matching entities described using same/similar schema

Linking heterogeneous data

• Given structured entity, find matching entities described using different schemas

Linking hybrid data

• Given text mentions, find matching entities (no schema)

Keyword search

• Given keywords, find matching entities (no schema)

name age

Tran Thanh 31

name age

Tran Thanh 31

id description

p1Tran Duc Thanh,

age 31, works at..

label age

Tran Duc Thanh 31

name age

Tran Thanh 31

content

Tran Duc Thanh, a researcher at

KIT…

name age

Tran Thanh 31

query

Tran Duc Thanh

Search-based Linking• Adopt methods for semantic matching and ranking for schema-

agnostic linking in hybrid & heterogenous data scenarios• Embed linking into the search-process to leverage user

feedbacks