13
An Analytical Study of Large SPARQL Query Logs Angela Bonifati * Lyon 1 University Lyon, France [email protected] Wim Martens University of Bayreuth Bayreuth, Germany [email protected] Thomas Timm University of Bayreuth Bayreuth, Germany [email protected] ABSTRACT With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end- users has become more and more common in SPARQL end- points. In this paper, we conduct an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, span- ning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries, that exhibits already interesting results on this generalized corpus, we drill deeper in the structural charac- teristics related to the graph and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs, and charac- terize their (hyper-)tree width. Moreover, we analyze the evolution of queries over time, by introducing the novel con- cept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users, and brings us to draw a number of conclusions and pinpoint future di- rections for SPARQL query evaluation, query optimization, tuning, and benchmarking. PVLDB Reference Format: Angela Bonifati, Wim Martens, and Thomas Timm. An Ana- lytical Study of Large SPARQL Query Logs. PVLDB, 11(2): 149-161, 2017. DOI: 10.14778/3149193.3149196 1. INTRODUCTION As more and more data is exposed in RDF format, we are witnessing a compelling need from end-users to formu- late more or less sophisticated queries on top of this data. * Partially supported by CNRS Mastodons MedClean. Supported by DFG grant MA 4938/2-1. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 44th International Conference on Very Large Data Bases, August 2018, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowment, Vol. 11, No. 2 Copyright 2017 VLDB Endowment 2150-8097/17/10... $ 10.00. DOI: 10.14778/3149193.3149196 SPARQL endpoints are increasingly used to harvest query results from available RDF data repositories. But how do these end-user queries look like? As opposed to RDF data, which can be easily obtained under the form of dumps (DB- pedia and Wikidata dumps [34, 35, 40]), query logs are often inaccessible, yet hidden treasures to understand the actual usage of these data. In this paper, we investigate a large cor- pus of query logs from different SPARQL endpoints, which spans over several years (2009–2017). In comparison to pre- vious studies on real SPARQL queries [3, 15, 25, 30, 31], which typically 1 investigated query logs of a single source, we consider a multi-source query corpus that is two orders of magnitude larger. Furthermore, our analysis goes sig- nificantly deeper. In particular, we are the first to do a large-scale analysis on the topology of queries, which has seen significant theoretical interest in the last decades (e.g., [9, 12, 14]) and is now being used for state-of-the-art struc- tural decomposition methods for query optimization [1, 2, 19]. As a consequence, ours is the first analytical study on real (and most recent) SPARQL queries from a variety of domains reflecting the recent advances in theoretical and system-oriented studies of query evaluation. Our paper makes the following contributions. Apart from classical measures of syntactic properties of the investigated queries, such as their keywords, their number of triples, and operator distributions, which we apply to our new corpus, we also mine the usage of projection in queries and subqueries in the various datasets. Projection indeed is the cause of increased complexity (from Ptime to NP-Complete) of the following central decision problem in query evaluation [8, 6, 22]: Given a conjunctive query Q, a database D, and a candidate answer a, is a an answer of Q on D? We then proceed by considering queries under their graph and hypergraph structures. Such structural aspects of queries have been investigated in the theory community for over two decades [12] since they can indicate when queries can be evaluated efficiently. Recently, several studies on new join algorithms leverage the hypergraph structure of queries in the contexts of relational and RDF query processing [1, 19]. Theoretical research in this area traditionally focused on conjunctive queries (CQs). For CQs, we know that tree- likeness of their structure leads to polynomial-time query evaluation [12]. For larger classes of queries, the topology of the graph of a query is much less informative. For instance, if we additionally allow SPARQL’s Opt operator, evalua- 1 The exception is [15], where logs from the Linked SPARQL Queries (LSQ) dataset were studied, combining data from four sources (from 2010 and 2014) that we also consider. 149

An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

An Analytical Study of Large SPARQL Query Logs

Angela Bonifati∗

Lyon 1 UniversityLyon, France

[email protected]

Wim MartensUniversity of BayreuthBayreuth, Germany

[email protected]

Thomas Timm†

University of BayreuthBayreuth, Germany

[email protected]

ABSTRACTWith the adoption of RDF as the data model for LinkedData and the Semantic Web, query specification from end-users has become more and more common in SPARQL end-points. In this paper, we conduct an in-depth analyticalstudy of the queries formulated by end-users and harvestedfrom large and up-to-date query logs from a wide varietyof RDF data sources. As opposed to previous studies, oursis the first assessment on a voluminous query corpus, span-ning over several years and covering many representativeSPARQL endpoints. Apart from the syntactical structure ofthe queries, that exhibits already interesting results on thisgeneralized corpus, we drill deeper in the structural charac-teristics related to the graph and hypergraph representationof queries. We outline the most common shapes of querieswhen visually displayed as undirected graphs, and charac-terize their (hyper-)tree width. Moreover, we analyze theevolution of queries over time, by introducing the novel con-cept of a streak, i.e., a sequence of queries that appear assubsequent modifications of a seed query. Our study offersseveral fresh insights on the already rich query features ofreal SPARQL queries formulated by real users, and bringsus to draw a number of conclusions and pinpoint future di-rections for SPARQL query evaluation, query optimization,tuning, and benchmarking.

PVLDB Reference Format:Angela Bonifati, Wim Martens, and Thomas Timm. An Ana-lytical Study of Large SPARQL Query Logs. PVLDB, 11(2): 149 - 161, 2017.DOI: 10.14778/3149193.3149196

1. INTRODUCTIONAs more and more data is exposed in RDF format, we

are witnessing a compelling need from end-users to formu-late more or less sophisticated queries on top of this data.

∗Partially supported by CNRS Mastodons MedClean.†Supported by DFG grant MA 4938/2-1.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 44th International Conference on Very Large Data Bases,August 2018, Rio de Janeiro, Brazil.Proceedings of the VLDB Endowment, Vol. 11, No. 2Copyright 2017 VLDB Endowment 2150-8097/17/10... $ 10.00.DOI: 10.14778/3149193.3149196

SPARQL endpoints are increasingly used to harvest queryresults from available RDF data repositories. But how dothese end-user queries look like? As opposed to RDF data,which can be easily obtained under the form of dumps (DB-pedia and Wikidata dumps [34, 35, 40]), query logs are ofteninaccessible, yet hidden treasures to understand the actualusage of these data. In this paper, we investigate a large cor-pus of query logs from different SPARQL endpoints, whichspans over several years (2009–2017). In comparison to pre-vious studies on real SPARQL queries [3, 15, 25, 30, 31],which typically1 investigated query logs of a single source,we consider a multi-source query corpus that is two ordersof magnitude larger. Furthermore, our analysis goes sig-nificantly deeper. In particular, we are the first to do alarge-scale analysis on the topology of queries, which hasseen significant theoretical interest in the last decades (e.g.,[9, 12, 14]) and is now being used for state-of-the-art struc-tural decomposition methods for query optimization [1, 2,19]. As a consequence, ours is the first analytical study onreal (and most recent) SPARQL queries from a variety ofdomains reflecting the recent advances in theoretical andsystem-oriented studies of query evaluation.

Our paper makes the following contributions. Apart fromclassical measures of syntactic properties of the investigatedqueries, such as their keywords, their number of triples, andoperator distributions, which we apply to our new corpus, wealso mine the usage of projection in queries and subqueriesin the various datasets. Projection indeed is the cause ofincreased complexity (from Ptime to NP-Complete) of thefollowing central decision problem in query evaluation [8,6, 22]: Given a conjunctive query Q, a database D, and acandidate answer a, is a an answer of Q on D?

We then proceed by considering queries under their graphand hypergraph structures. Such structural aspects of querieshave been investigated in the theory community for overtwo decades [12] since they can indicate when queries canbe evaluated efficiently. Recently, several studies on newjoin algorithms leverage the hypergraph structure of queriesin the contexts of relational and RDF query processing [1,19]. Theoretical research in this area traditionally focusedon conjunctive queries (CQs). For CQs, we know that tree-likeness of their structure leads to polynomial-time queryevaluation [12]. For larger classes of queries, the topology ofthe graph of a query is much less informative. For instance,if we additionally allow SPARQL’s Opt operator, evalua-

1The exception is [15], where logs from the Linked SPARQLQueries (LSQ) dataset were studied, combining data fromfour sources (from 2010 and 2014) that we also consider.

149

Page 2: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

tion can be NP-complete even if the structure is a tree [6].For this reason, we focus our structural study on CQ-likequeries.2 We develop a shape classifier for such queries andidentify their most occurring shapes. Interestingly enough,these queries have quite regular shapes. The overwhelmingmajority of the queries is acyclic (i.e., tree- or forest-shaped).We discovered that the cyclic queries mostly consist of acentral node with simple, small attachments (which we callflower). In terms of tree- and hypertreewidth, we discov-ered that the cyclic queries have width two, up to a fewexceptions with width three.

At this point we should make a note about interpreta-tion of our results. Even though almost all CQ-like querieshave (hyper-)treewidth one, we do not want to claim thatqueries of larger treewidth are not important in practice.The overwhelming majority of the queries we see in the logsare very small and simple, which we believe to be typical forSPARQL endpoint logs. For instance, the majority (>55%)of the queries in our logs only use one triple. One of ourdata sets, WikiData17 is not a SPARQL endpoint log andwe see throughout the paper that it has completely differentcharacteristics.

In order to gauge the performances of cyclic and acyclicqueries from a practical viewpoint, we have run a compara-tive analysis of chain and cycle queries synthetically gener-ated with an available graph and query workload generator[4]. This experiment showed different behaviors of SPARQLquery engines, such as Blazegraph and PostgreSQL withquery workloads of CQs of increasing sizes (intended as num-ber of conjuncts). It also lets us grasp a tangible differencebetween chain and cycle queries in either query engine, thisdifference being more pronounced for PostgreSQL. We mayinterpret this result as a lack of maturity of practical queryengines for cyclic queries, thus motivating the need of spe-cific query optimization techniques for such queries as in [1,19].

Finally, we deal with the problem of identifying sequencesof similar queries in the query logs. These queries are thenclassified as gradual modifications of a seed query, possiblyby the same user. We measure the length of such streaksin three log files from DBpedia. We conclude our studywith insights on the impact of our analytical study of largeSPARQL query logs on query evaluation, query optimiza-tion, tuning, and benchmarking.

Related Work. Whereas several previous studies have fo-cused on the analysis of real SPARQL queries, they havemainly investigated statistical features of the queries, suchas occurrences of triple patterns, types of queries, or queryfragments [3, 15, 25, 31]. The only early study that in-vestigated the relationship between structural features ofpractical queries and query evaluation complexity has beenpresented in [30]. However, they focus on a limited corpus(3M queries from DBpedia 2010) and in that sense theirfindings cannot be generalized. Our work moves onwardby precisely characterizing the occurrences of conjunctiveand non-conjunctive patterns under the latest complexityresults, by performing an accurate shape analysis of thequeries under their (hyper-)graph representation and intro-ducing the evolution of queries over time. USEWOD and

2We consider extensions with Filter and Opt, but only thosefor which we know that tree-likeness of their graph ensuresthe existence of efficient evaluation algorithms.

Table 1: Sizes of query logs in our corpus.

Source Total #Q Valid #Q Unique #Q

DBpedia9/12 28,534,301 27,097,467 13,437,966DBpedia13 5,243,853 4,819,837 2,628,005DBpedia14 37,219,788 33,996,480 17,217,448DBpedia15 43,478,986 42,709,778 13,253,845DBpedia16 15,098,176 14,687,869 4,369,781

LGD13 1,841,880 1,513,868 357,842LGD14 1,999,961 1,929,130 628,640

BioP13 4,627,271 4,624,430 687,773BioP14 26,438,933 26,404,710 2,191,152

BioMed13 883,374 882,809 27,030

SWDF13 13,762,797 13,618,017 1,229,759

BritM14 1,523,827 1,513,534 135,112

WikiData17 309 308 308

Total 180,653,910 173,798,237 56,164,661

DBpedia datasets have also been considered in [3]. It takesinto account the log files from DBpedia and SWDF reachinga total size of 3M. The work mainly investigates the numberof triples and joins in the queries. Based on the observationof [28] that SPARQL graph patterns are typically chain- orstar-shaped, they also look at their occurrences. They foundvery scarce chains and high coverage of almost-star-shapedgraph patterns, but they do not characterize the latter. Tothe best of our knowledge, we are the first to carry out acomprehensive shape analysis on such a large and diversecorpus of SPARQL queries.

A query analysis and clustering of DBPedia SPARQLqueries has been performed in [26] in order to build a setof prototypical benchmarking queries. Query logs have beeninspected in a user study in [17] to understand whether factsthat are queried together provide intra-fact relatedness inthe Linked Open Data graph. The objectives of both pa-pers are different from the one pursued in our work.

2. DATA SETSOur data set has a total of 180, 653, 910 queries, which

were obtained as follows. We obtained the 2013–2016 USE-WOD query logs, DBPedia query logs for 2013, 2014, 2015and 2016 directly from Openlink3, the 2014 British Museumquery logs from LSQ4, and we crawled the user-submittedexample queries from WikiData5 in February 2017. Theselog files are associated with 7 different data sources from var-ious domains: DBpedia, Semantic Web Dog Food (SWDF),LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum (BritM), and WikiData.

Table 1 gives an overview of the analyzed query logs,along with their main characteristics. Since we obtainedlogs for DBpedia from different sources, we proceeded asfollows. DBpedia9/12 contains the DBpedia logs from USE-WOD’13, which are query logs from 2009–2012. All otherDBpedia’X sets contain the query logs from the year ’X, be

3http://www.openlinksw.com4http://aksw.github.io/LSQ/5https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples

150

Page 3: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

it from USEWOD or from Openlink.6 We first cleaned thelogs, since some contained entries that were not queries (e.g.,http requests). In the following we only report on the actualSPARQL queries in the logs. For each of the logs, the ta-ble summarizes the total number of queries (Total) and thenumber of queries that we could parse using Apache Jena3.0.1 (Valid). From the latter set, we removed duplicatequeries, resulting in the unique queries that we could parse(Unique) and on which we focus in the remainder of thepaper.7 In summary, our corpus of query logs contains thelatest blend of USEWOD and Openlink DBPedia query logs(the latter providing 51M more queries in the period 2013–2016 than the USEWOD corpus), plus BritM and Wikidataqueries. We are not aware of other existing studies on sucha large and up-to-date corpus. Finally, although the onlineWikiData example queries (Feb 13th, 2017) are a manuallycurated set, there was one query that we could not parse.8

In the total unique data set, 2,496,806 queries (4.47%) donot have a body. All these queries are Describe queries andalmost exclusively occur in DBpedia14– DBpedia16.

3. PRELIMINARIESWe recall some basic definitions on RDF and SPARQL

[29, 30]. We closely follow the exposition of [30].

RDF. RDF data consists of a set of triples 〈s, p, o〉 where werefer to s as subject, p as predicate, and o as object. Accord-ing to the specification, s, p, and o can come from pairwisedisjoint sets I (IRIs), B (blank nodes), and L (literals) asfollows: s ∈ I ∪B, p ∈ I, and o ∈ I ∪B∪L. For this paper,the distinction between IRIs, blank nodes, and literals is notimportant.

SPARQL. For our purposes, a SPARQL query Q can beseen as a tuple of the form

(query-type, pattern P , solution-modifier).

We now explain how such queries work conceptually. Thecentral component is the Pattern P , which contains patternsthat are matched onto the RDF data. The result of thispart of the query is a multiset of mappings that match thepattern to the data.

The solution-modifier allows aggregation, grouping, sort-ing, duplicate removal, and returning only a specific window(e.g., the first ten) of the multiset of mappings returned bythe pattern. The result is a list L of mappings.

The query-type determines the output of the query. Itis one of four types: Select, Ask, Construct, and Describe.Select-queries return projections of mappings from L. Ask-queries return a Boolean and answer true iff the patternP could be matched. Construct queries construct a newset of RDF triples based on the mappings in L. Finally,Describe queries return a set of RDF triples that describes

6We discovered that we received three log files from USE-WOD as well as from Openlink, in the sense that only thehash values used for anonymisation were different. Theseduplicate log files were deleted prior to all analysis and arenot taken into account in Table 1.7We report in [7] the results for the Valid corpus, containingduplicates.8The query was called “Public Art in Paris” and was mal-formed (closing braces were missing and it had a bad aggre-gate). It was still malformed on June 29th, 2017.

the IRIs and the blank nodes in L. The exact output ofDescribe queries is implementation-dependent. Such queriesare meant to help users explore the data. Compared to[30], we allow more solution modifiers and more complexpatterns, as explained next.

Patterns. Let V = {?x, ?y, ?z, ?x1, . . .} be an infinite set ofvariables, disjoint from I, B, and L. As in SPARQL, wealways prefix variables by a question mark. A triple patternis an element of (I ∪ B ∪ V)× (I ∪ V)× (I ∪ B ∪ L ∪ V). Aproperty path is a regular expression over the alphabet I. Aproperty path pattern is an element of (I ∪B∪V)×pp× (I ∪B∪L∪V), where pp is a property path. A SPARQL patternis an expression generated from the following grammar:

P ::= t | pp | Q | P1 And P2 | P Filter R| P1 Union P2 | P1 Opt P2 | Graph iv P

Here, t is a triple pattern, pp is a property path pattern, Qis again a SPARQL query, R is a so-called SPARQL filterconstraint, and iv ∈ I∪V. We note that property paths (pp)and subqueries (Q) in the above grammar are new featuressince SPARQL 1.1. SPARQL filter constraints R are built-in conditions which can have unary predicates, (in)equalitiesbetween variables, and Boolean combinations thereof. Werefer to the SPARQL 1.1 recommendation [16] and the liter-ature [29] for the precise syntax of filter constraints and thesemantics of SPARQL queries. We write vars(P ) to denotethe set of variables occurring in P .

We illustrate by example how our definition correspondsto real SPARQL queries. The following query comes fromWikiData (“Locations of archaeological sites”, from [34]).

SELECT ?label ?coord ?subj

WHERE

{?subj wdt:P31/wdt:P279* wd:Q839954 .

?subj wdt:P625 ?coord .

?subj rdfs:label ?label filter(lang(?label)="en")}

The query uses the property path wdt:P31/wdt:P279*, lit-eral wd:Q839954, and triple pattern ?subj wdt:P625 ?coord.It also uses a filter constraint. In SPARQL, the And operatoris denoted by a dot (and is sometimes implicit in alternative,even more succinct syntax).

Finally, we define conjunctive queries, which are a centralclass of queries in database research and which we will buildon in the remainder of the paper. In the context of SPARQL,we define them as follows.

Definition 3.1. A conjunctive query (CQ) is a SPARQLpattern that only uses the triple patterns and the operatorAnd.

4. SHALLOW ANALYSISIn this section we investigate simple syntactical properties

of queries.

4.1 KeywordsA basic usage analysis of SPARQL features was done by

counting the keywords in queries. The results are in Ta-ble 2.9

9We also investigated the occurrence of other operators(Service, Bind, Assign, Data, Dataset, Values, Sample, GroupConcat), each of which appeared in less than 1% of thequeries. We omit them from the table for succinctness.

151

Page 4: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

Table 2: Keyword count in queries

Element Absolute Relative

Select 49,409,913 87.97%Ask 2,789,420 4.97%

Describe 2,578,311 4.59%Construct 1,386,908 2.47%

Distinct 12,198,198 21.72%Limit 9,545,249 17.00%

Offset 3,455,500 6.15%Order By 1,159,231 2.06%

Filter 22,547,561 40.15%And 15,863,942 28.25%

Union 10,465,706 18.63%Opt 9,106,419 16.21%

Graph 1,519,899 2.71%Not Exists 926,849 1.65%

Minus 766,380 1.36%Exists 5,499 0.01%

Count 320,035 0.57%Max 3,660 0.01%Min 3,632 0.01%Avg 263 < 0.01%Sum 68 < 0.01%

Group By 168,444 0.30%Having 12,276 0.02%

The first block in Table 2 describes the type of queries.In total, 87.97% of the queries are Select queries, 4.97% areAsk queries, 4.59% Describe queries, and 2.47% Constructqueries. There are, however, tremendous differences be-tween the data sets. BioMed13 has less than 13% Selectqueries and almost 85% Describe queries, whereas LGD13

has 28% Select queries and 71% Construct queries. Evenwithin the same kind of data, we see significant differences.DBpedia16 has 62% Select queries (and 34% Describe queries),whereas DBpedia15 has 81.5% Select queries and 11.5% Askqueries. The other DBpedia data sets have over 87.5% Selectqueries.

The second block in Table 2 contains solution modifiers,ordered by their popularity.10 Looking into the specific datasets, we see the following things stand out. Almost all (97%)of BritM14 queries use Distinct. This is similar, but to alesser extent in BioP13 (82%) and BioP14 (69%). In DBPe-dia we again see significant differences. From ’12 to ’16, wehave 18%, 8%, 11%, 38%, and 8% of queries with Distinctrespectively.

Limit is used most widely in SWDF13 (47%) and LGD14

(41%). The most prevalent data sets for queries with Offsetare LGD14 (38%), LGD13 (13%), and DBpedia13 (12%).

Order By is used by far the most in WikiData (42%), whichmay be due to the case that WikiData17 is not a querylog, but a wiki page that contains cherry-picked and user-submitted queries. These queries are intended to showcasesystem’s behavior or highlight features of the WikiData dataset and should therefore produce a nice output. The otherdata sets are true query logs, which may therefore also con-tain the “development process” of queries: users start by

10The remaining solution modifier, Reduced, was only foundin 1,113 queries.

asking a query and gradually refine it until they have theone they want. (We come back to this in Section 8).

The third block has keywords associated to SPARQL al-gebra operators that occur in the body. We see that Fil-ter, And, Union, and Opt are quite common.11 The nextcommonly used operator is Graph but, looking closer at ourdata, we see that 95% of the queries using Graph originatefrom BioP13 and BioP14. In these logs, 80% and 40% ofthe queries use Graph, respectively. The use of Filter rangesfrom 61% (LGD14) to 3% or less (BioMed13, BioP13).

The fourth block has aggregation operators. We were sur-prised that these operators are used so sparsely, even thoughaggregates are only supported since SPARQL 1.1 (March2013) [16]. In all data sets, each of these operators was usedin 3% or less of the queries, except for LGD14 (31% withCount) and WikiData17 (30% with Group By). We see ahigher relative use of aggregation operators in WikiData17

than in the other sets, which we again believe is due to thefact that the WikiData17 set is not a query log.

4.2 Number of Triples in QueriesIn order to measure the size of the queries belonging to

the datasets under study, we have counted the total numberof triples of the kind 〈s, p, o〉 contained in Select and Askqueries. In this experiment, we merely counted the numberof triples contained in each query without further investi-gating the possible relationships among them (such as joinconditions, unions etc.), which are under scrutiny later inthe paper. We focus solely on Select and Ask queries be-cause these are the type of SPARQL statements that trulyquery the data, as opposed to Describe statements (whichare exploratory) and Construct statements (which constructdata).12

The plot in the upper part of Figure 1 illustrates howqueries containing 0 to 11+ triples are distributed over theSelect and Ask queries in each of the data sets. A first ob-servation that we can draw from Figure 1 is that for themajority of the datasets, the queries with a low number oftriples (from 0 to 2) have a noticeable share within the totalamount of queries per dataset. Whereas these queries arealmost the only queries present in the BioP13 and BioP14

datasets, they have the least concentration in BritM14 andWikiData17. The latter datasets have in fact unique char-acteristics, BritM14 being a collection of queries with fixedtemplates and WikiData17 being the most diverse datasetof all, gathering queries of rather disparate nature that arerepresentatives of classes of real queries issued on Wiki-data. Finally, DBpedia9/12–DBpedia16, along with LGD14

and BioMed13 are the datasets exhibiting the most complexqueries with extremely high numbers of triples exceeding 10.

We should note that BioMed13 has almost 85% Describequeries and 2.42% Construct queries. The numbers reportedhere only describe the remaining 12.87%. The table at thebottom of Figure 1 shows the relative amounts of Select- andAsk queries per data set. It also shows the average number oftriples measured across all queries within each dataset. Wecan notice a relative increase of this average for DPpediafrom year 2014 up to year 2016 and BioPortal and LGD in

11Conjunctions in SPARQL are actually denoted by “.” or“;” for brevity, but we group them under “And” in this paperfor readability.

12For instance, 97% of the Describe statements in our corpusdo not have a body and therefore no triples.

152

Page 5: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

DBpedia9/12

DBpedia13

DBpedia14

DBpedia15

DBpedia16

LGD13

LGD14

BioP13

BioP14

BioMed13

SWDF13

BritM14

WikiData17

0 %

20 %

40 %

60 %

80 %

100 % 11+

10

9

8

7

6

5

4

3

2

1

0

Datasets DBpedia9/12 DBpedia13 DBpedia14 DBpedia15 DBpedia16 LGD13 LGD14 BioP13 BioP14 BioMed13 SWDF13 BritM14 WikiData17

S/A 99.15% 91.88% 95.38% 93.05% 63.99% 29.01% 97.47% 100% 99.69% 12.87% 96.14% 98.64% 99.68%Avg#T 2.38 3.98 2.09 2.94 3.78 3.19 2.65 1.16 1.42 2.44 1.51 5.47 3.94

Figure 1: Percentages of queries exhibiting different number of triples (in colors) for each dataset (top), theaverage numbers of triples of the queries for each dataset (Avg#T , bottom), and the percentage of Select/Askqueries (S/A, bottom).

between years 2013 and 2014. As expected, BritishM andWikidata have also relatively high average number of triples,compared to the other datasets.

Overall, we see that 56.45% of the Select and Ask queriesin our corpus use at most one triple, 90.76% uses at mostsix triples, and 99.32% at most twelve triples. The largestqueries we found came from DBpedia15 (209 and 211 triples)and BioMed13 (221 and 229 triples).

4.3 Operator DistributionIn Table 2 we see that Filter, And, Union, Opt, and Graph

are used fairly commonly in the bodies of Select- and Askqueries. We then investigated how these operators occurtogether. In particular, we investigated for which queriesthe body only uses constructs with these operators.13 14

The results are in Table 3, which has two kinds of rows.Each white row has, on its left, a set S of operators fromO = {Filter,And,Opt,Graph,Union} and, on its right, theamount of queries in our logs for which the body uses exactlythe operators in S (and none from O\S). The value for noneis the amount of queries that do not use any of the operatorsin O (including queries that do not have a body).

Conjunctive patterns with filters are considered to be animportant fragment of SPARQL patterns, because they arebelieved to appear often in practice [28, 39]

13There is one exception: For Wikidata, we removed SER-VICE subqueries before the analysis (which appears in 222of its queries and is used to change the language of the out-put).

14This study closely follows a similar one [30] that was doneon logs from DBpedia 2010. Our numbers should be com-pared to the numbers of ULog (the duplicate-free log) in[30].

Definition 4.1. A conjunctive pattern with filters (CPF)is a SPARQL pattern that only uses triples and the operatorsAnd and Filter.

Our logs contain 66.27% CPF patterns. Adding Opt to theCPF fragment would increase its relative size with 8.56%,resulting in 74.83% of our queries. (Similarly for Graph andUnion.) Table 3 classifies 96.37% of the Select- and Askqueries in our corpus. The remaining queries either useother combinations from O (0.30%) or use other featuresthan those in O in their body (3.33%) like Bind, Minus, sub-queries, or property paths.

There is a close relationship between CPF patterns andconjunctive queries that, in some cases, can be extended toalso include queries with Opt and Graph. We discuss this inmore detail in Section 5.

4.4 Subqueries and ProjectionOnly 304,234 (0.54%) queries in our corpus use subqueries.

The feature was most used in WikiData (9.74%), about anorder of magnitude more than in any of the other data sets.

Projection plays a crucial role in the complexity of queryevaluation. Many papers [6, 22, 20, 29, 30] define evalua-tion as the following question: Given an RDF graph G, aSPARQL pattern P , and a mapping µ, is µ an answer toP when evaluated on G? In other words, the question is toverify if a candidate answer µ is indeed an answer to thequery. If P is a CQ, this problem is NP-complete if thequeries use projection [8, 6, 22], but its complexity drops toPtime if projection is absent [29, 6, 22].15 Therefore, the

15This difference can be understood as follows: If the querytests the presence of a k-clique, then without projection weare given a k-tuple of nodes and need to verify if they form ak-clique. With projection, we need to solve the NP-completek-clique problem.

153

Page 6: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

Table 3: Sets of operators used in queries: Filter (F),And (A), Opt (O), Graph (G), and Union (U)

Operator Set Absolute Relative

none 17,482,313 33.49%F 9,936,557 19.04%A 3,911,748 7.49%

A, F 3,261,138 6.25%CPF subtotal 31,330,554 66.27%

O 542,900 1.04%O, F 1,791,512 3.43%A, O 1,728,907 3.31%

A, O, F 406,131 0.78%CPF+O +4,469,450 +8.56%

G 1,380,764 2.65%CPF+G +1,432,090 +2.74%

U 3,895,524 7.46%U, F 198,693 0.38%A, U 817,958 1.57%

A, U, F 812,381 1.56%CPF+U +5,724,556 +10.97%

A, O, U, F 4,084,154 7.82%

use of projection has a huge influence of the complexity ofquery evaluation.

Surprisingly, we discovered that at least 14.98% of thequeries use projection, which is about three times more thanwhat Picalausa and Vansummeren discovered in DBpedialogs from 2010 [30]. The 14.98% consists of 13.12% Selectqueries plus 1.86% Ask queries. Notice that the total numberof Ask queries (4.97%) is significantly higher, even thoughthey just return a Boolean value and one would intuitivelyexpect that almost all of them would use projection. Thereason is that most Ask queries do not use variables: theyask if a concrete RDF triple is present in the data. Follow-ing the test for projection in Section 18.2.1 in the SPARQLrecommendation [16], we classified these queries as not usingprojection.

Due to the use of the Bind operator, there was a numberof queries (1.3%) where we could not determine if they useprojection or not. Therefore the number of queries withprojection lies between 14.98% and 16.28%.

5. STRUCTURAL ANALYSISSPARQL patterns of Select or Ask queries using only triple

patterns and the operators And, Opt, and Filter (and, inparticular, not using subqueries or property paths) receivedconsiderable attention in the literature (see, e.g., [29, 20,6, 21, 22]). We refer to such patterns as And/Opt/Filterpatterns or, for succinctness, AOF patterns. Our corpushas 39.061.206 AOF patterns (74.83% of the Select- and Askqueries).

In Section 6 we investigate the graph- and hypergraphstructure of AOF patterns. The graph structure gives usa clear view on how such queries are structured and cantell us how complex such queries are to evaluate. For asignificant portion of queries, however, the graph structure isnot meaningful to capture their complexity (cf. Example 5.1)and we therefore need to turn to their hypergraph structure.

Since the graph structure may be easier to understand, weuse the graph structure whenever we can.

We provide some background on the relationship betweenthe (hyper)graph structure of queries and the complexityof their evaluation. Evaluation of CQs is NP-complete ingeneral [8], but becomes Ptime if their hypertree width isbounded by a constant [14]. Here, the hypertree width mea-sures how close the query is to a tree (the lower the width,the closer the query is to a tree). Several state-of-the-art joinevaluation algorithms (e.g., [1, 19]) effectively use the hy-pergraph structure of queries to improve their performance,even in the context of RDF processing [2]. We establish inSection 5.2 that there are significant performance differencesin today’s query engines, even when the hypertreewidth ofqueries just increases from one to two.

5.1 Graph and Hypergraph of a QueryWe first make more precise what we mean by the graph

and hypergraph of a query. An (undirected) graph G is apair (V,E) where V is its (finite) set of nodes and E is itsset of edges, where an edge e is a set of one or two nodes,i.e., e ⊆ V and |e| = 1 or |e| = 2. A hypergraph H consistsof a (finite) set of nodes V and a set of hyperedges E ⊆ 2V ,that is, a hyperedge is a set of nodes.

Most SPARQL patterns do not use variables as predicates,that is, they use triple patterns (s, p, o) where p is an IRI.We call such patterns graph patterns. Evaluation of graphpatterns is tightly connected to finding embeddings of thegraph representation of the query into the data.16 We definethe canonical graph of graph pattern P to be the followinggraph: E = {{x, y}) | (x, `, y) is a triple pattern in P and` ∈ I} and V = {x | (x, `, y) ∈ E or (y, `, x) ∈ E}.

Hypergraph representations can be considered for all AOFpatterns. The canonical hypergraph of a SPARQL patternP is defined as E = {X | there is a triple pattern t in P suchthat X is the set of blank nodes and variables appearing t}and V = ∪e∈Ee.

Example 5.1. Consider the following (synthetic) queries:

ASK WHERE {?x1 :a ?x2 . ?x2 :b ?x3 . ?x3 :c ?x4}

ASK WHERE {?x1 ?x2 ?x3 . ?x3 :a ?x4 . ?x4 ?x2 ?x5}

Figure 2 (top left) depicts the canonical graph of the firstquery, which is a sequence of three edges. (We annotatedthe edges with their labels in the query to improve under-standing.) The bottom left graph in Figure 2 shows why wedo not consider canonical graphs for queries with variableson the predicate position in triples. The topological structureof this graph is a sequence of three edges, just as for the firstquery. This completely ignores the join condition on ?x2.For this query, the canonical hypergraph in Figure 2 (right)correctly captures the cyclicity of the query.

5.2 Comparative Evaluation of Chain and Cy-cle Queries

We conducted a set of experiments aiming at comparingthe execution times of conjunctive queries whose canonicalgraphs exhibit specific shapes. We have chosen chain and cy-cle queries in this empirical study. A chain query (of length

16In particular, it consists of finding embeddings of the di-rected and edge-labeled variant of the graph, but we omitthe edge directions and -labels for simplicity. They do notinfluence the structure and cyclicity of graph patterns.

154

Page 7: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

x1 x2 x3 x4:a :b :c

x1 x3 x4 x5x2 :a x2

x2 x1 x3

x4

x5

Figure 2: Canonical graphs and hypergraph forqueries in Example 5.1.

k) is a CQ for which the canonical graph is isomorphic to theundirected graph with edges {x0, x1}, {x1, x2}, . . . , {xk−1, xk}.(The first query in Example 5.1 is a chain query of lengththree.) A cycle query (of length k) is a CQ for which thecanonical graph is isomorphic to {x0, x1}, . . . , {xk−1, x0}.These shapes have been selected as representatives of thequeries with hypertreewidth 1 and 2, respectively, and havealso been used to compare the performances of join algo-rithms in other studies, e.g., [19].

In order to generate query workloads containing the afore-mentioned types of queries, we have used gMark [4], a pub-licly available17 schema-driven generator for graph instancesand graph queries. We tuned gMark to generate diversequery workloads, each containing 100 chain and cycle queries,respectively.18 Each workload has been generated by usingchains and cycles of different length varying from 3 to 8. Inthese experiments, we have considered and contrasted twoopposite graph database systems, namely PostgreSQL [38],an open-source relational DBMS, and BlazeGraph [36], anhigh-performance SPARQL query engine powering the Wiki-media’s official query service [40] and thus used for Wikidatareal-world queries. We have run these experiments on 2-CPUs Intel Xeon E5-2630v2 2.6 GHz server19 with 128GBRAM and running Ubuntu 16.04 LTS. We used PostgreSQLv.9.3 and Blazegraph v.2.1.4 for the experimental setup. Weemployed the Bib use case in the gMark configuration [4] forthe schema of the generated graph (of size 100k nodes) andof the generated queries as well. We employed the queryworkloads in SQL and SPARQL as generated by gMark af-ter elimination of empty unions (since gMark is geared to-wards generating UCRPQs) and of the keyword Distinct inthe body of the queries. Since gMark allowed us to obtainmixed workloads of Select/Ask queries and we wanted to fo-cus on one query type at a time, we manually replaced theSelect clauses with compatible Ask clauses.

Figure 3 (top) depicts the average runtime (in ns, logscale)of our workloads of chain (cycle, resp.) queries with lengthfrom 3 to 8 on Blazegraph (BG) and PostgreSQL (PG). Wecan observe that the overall performance of BG is superiorto that of PG. Indeed, in PG many cycles queries are timedout (after 300s per query) and we expect that the real overallperformance of PG is even worse than the results reportedin Figure 3. Figure 3 (bottom) reports the reached timeoutsfor workloads of cycle queries of various sizes when executedin PG. It is worthwhile observing that for both systems the

17https://github.com/graphMark/gmark18We recall that gMark can generate queries of four shapes:chain, star, chain-star and cycle. We have thus cherry-picked chain queries as representatives of queries with hy-pertreewidth equal to 1.

19Every CPU has 6 physical cores and, with hyperthreading,12 logical cores.

W-3 W-4 W-5 W-6 W-7 W-8

108

109

1010

1011

Avg.

quer

yru

nti

me

per

work

load

W-x

(in

ns)

chainBG chainPG cycleBG cyclePG*

W-x W-3 W-4 W-5 W-6 W-7 W-8%t/o 18% 34% 43% 39% 43% 30%

Figure 3: Execution times (top) of diverse workloadof chain/cycle queries (of length 3,4,5,6) on Blaze-graph (BG) and Postgresql (PG). Number of time-outs per workload for CyclePG only (bottom). Cy-clePG times include t/o of 300s (per query).

difference between average runtime of chain query workloadsand cycle query workloads is non negligible, thus confirm-ing that we cannot ignore the graph representation and theshape of queries. This experiment also motivated us to digdeeper in the shape analysis of our query logs, which wereport in Section 6.

5.3 Classes of Queries for (Hyper)graphsWe now discuss the classes of queries for which we will in-

vestigate graph- and hypergraph structures in Section 6. Tothe best of our knowledge, all the literature relating (hy-per)graph structure of queries to efficient evaluation wasdone on AOF patterns, which is why we only consider AOFpatterns here. The simplest such queries are the CQs, whichmotivated the classical literature on query evaluation andhypertree structure [8, 14]. We discovered that 54.58% ofthe AOF patterns are CQs.

Next, we extend CQs with Filter and Opt such that therelationship between efficient query evaluation and their (hy-per)graph structure is still similar as for CQs. However, thisrequires some care, especially when considering Opt [6, 29].

We first define a fragment of CPF patterns that can bereadily translated to CQs and can be evaluated similarly.We say that a filter constraint R is simple if vars(R) containsat most one variable or is of the form ?x =?y.20 (An almostidentical class of queries was also considered in [30].)

Definition 5.2. A conjunctive query with filters (CQF

query) is a CPF pattern that only uses simple filters.

In our corpus, 84.08% of the AOF patterns are in CQF.

20If we encounter a filter constraint of the form ?x =?y, wecollapse the nodes ?x and ?y in the graph and hypergraphof the query.

155

Page 8: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

We now additionally consider Opt. Perez et al. [29] showedthat unrestricted use of Opt in SPARQL patterns makesquery evaluation Pspace-complete, which is significantlymore complex than the NP-completeness of CQF queries.They discovered that patterns that satisfy an extra condi-tion called well-designedness [29], can be evaluated moreefficiently. Letelier et al. show that, in the presence of pro-jection, evaluation of well-designed patterns is ΣP

2 -complete[22].

Definition 5.3. A SPARQL pattern P using only theoperators And, Filter, and Opt is well-designed if for everyoccurrence i of an Opt-pattern (P1 Opt P2) in P , the vari-ables from vars(P2) \ vars(P1) occur in P only inside i.21

In our corpus, 98.53% of the AOF patterns are well-designed(but do not necessarily have simple filters). Unfortunately,it is not yet sufficient for well-designed patterns to have ahypergraph of constant hypertreewidth for their evaluationto be tractable [6]. However, Barcelo et al. show that thiscan be mended by an additional restriction called boundedinterface width. We explain this notion by example and referto [6] for details.

Example 5.4. The following patterns come from [29, 22]:P1 = (((?A, name, ?N) Opt (?A, email, ?E))

Opt (?A, webPage, ?W))and P2 = ((?A, name, ?N)

Opt ((?A, email, ?E) Opt (?A, webPage, ?W)))Figure 4 has tree representations T1 and T2 for P1 and P2,respectively, called pattern trees. The pattern trees Ti areobtained from the parse trees of Pi by applying a standardencoding based on Currying [23, Section 4.1.1]. The encod-ing only affects the arguments of the Opt operators in thequeries. If the query also uses And, then it should first bebrought in Opt-normal form [29] and then turned into a pat-tern tree. The resulting pattern trees will then have a CQin each of its nodes.

Barcelo et al. define pattern trees to be well-designed if,for each variable, the set of nodes in which it occurs forms aconnected set. Notice that this is the case for T1 and T2. Itwould be violated in T1 if the root would not use the variable?A. Likewise, it would be violated in T2 if the node labeled(?A, email, ?E) would not use the variable ?A.

The interface width of the pattern trees is the maximumnumber of common variables between a node and its child.Both trees in Figure 4 (and both queries P1 and P2) thereforehave interface width one. (Common variables are bold inFigure 4.) If T1 would use variable ?W instead of ?N , thenits interface width would be two.

Definition 5.5. A SPARQL pattern P using only theoperators And, Filter, and Opt is in CQOF if it has a well-designed pattern tree with interface width 1.

Perhaps surprisingly, out of all queries that are well-designedand have simple filters, we only found 310 queries that hadan interface width more than one. In fact, 93.87% of theAOF patterns are CQOF queries.

21Perez et al.’s definition also has a safety condition on thefilter statements of the patterns, but the omission of thiscondition does not affect the results in this paper.

(?A, name, ?N)

(?A, email, ?E) (?A, webPage, ?W)

(?A, name, ?N)

(?A, email, ?E)

(?A, webPage, ?W)

T1: T2:

Figure 4: Pattern trees that correspond to thequeries in Example 5.4

CQ

CQF

CQO

F

0 %

20 %

40 %

60 %

80 %

100 %

11+

10

9

8

7

6

5

4

3

2

Figure 5: Size of CQ-like queries with at least twotriples.

6. SHAPE ANALYSISIn this section we analyze the shapes of the canonical

graphs and the tree- and hypertree width of CQ, CQF andCQOF queries. We start with a note on the size of thesequeries. Figure 5 shows the respective sizes of these queriesthat have at least two triples. The fractions of queries withone triple are 82%, 83.45%, and 75.52% for CQ, CQF, andCQOF, respectively. Unsurprisingly, small queries are morelikely to be in one of these fragments and, therefore, simplequeries are represented even more in these data sets than inthe overall data set. Nevertheless, we have CQs and CQF

queries with up to 81 triples and CQOF queries with up to229 triples.

6.1 Graph StructureWe analyse the graph structure of queries. Recall that

we only consider graph shapes for queries that do not usevariables in the predicate position of triples, for reasons ex-plained in Section 5.1. We therefore only consider graphpatterns in this subsection. We consider the remaining 6.96million queries in CQOF in Section 6.2.

We first recall or define the basic shapes of the canonicalgraphs that we will study in this section. The shapes chainsand cycle are already defined in Section 5.2. A chain setis a graph in which every connected component is a chain.(So, each chain is also a chain set.)

A tree is an undirected graph such that, for every pairof nodes x and y, there exists exactly one undirected pathfrom x to y. A forest is a graph in which every connectedcomponent is a tree.

A star is a tree for which there exists exactly one nodewith more than two neighbors, that is, there is exactly one

156

Page 9: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

node u such that there exist u1, u2, and u3, all pairwisedifferent and different from u, for which {u, ui} ∈ E foreach i = 1, 2, 3.

Inspired by the results obtained with gMark on syntheticqueries, we proceeded with the analysis of the query logsby looking at the encountered query shapes. Here, we con-sider queries as edge-labeled graphs, as defined in Section 5.In the next subsection we also investigate the hypergraphstructure.

We investigate CQs, CQF queries, and CQOF queries. Thelast two fragments are interesting in that they bring underscrutiny more queries than the plain CQ set of query logs (byan increase of roughly 40% and 47%, respectively). We firstwanted to identify classical query shapes, such as all vari-ants of tree-like shapes (single edges, chains, sets of chains,stars, trees, and forests). The results are summarized in thethree tables in Table 4. From the analysis, we can drawthe following observations. While tree-shaped queries evenin their simple forms (chain of length 1 or single edges) arevery frequent, the only observed exception occurs with starqueries, which have very low occurrence with respect to theother tree-like shapes.

Since simple queries are overrepresented in query logs (al-ready over 80% of CQF patterns uses only one triple, for ex-ample), it is no surprise that the overwhelming majority ofthe queries is acyclic, i.e., a forest. However, we also wantedto get a better understanding of the more complex queriesin the logs, so we also investigated the cyclic queries. Ourgoal is to obtain a cumulative shape analysis where simplershapes are subsumed by more sophisticated query shapes,with the latter reaching almost 100% coverage of the querylogs.

A first observation was that plain cycles are not very com-mon. By visually inspecting the remaining cyclic queries, weobserved that many of them could be seen as a node withsimple attachments, which we call flower.

Definition 6.1. A petal is a graph consisting of a sourcenode s, target node t, and a set of at least two node-disjointpaths from s to t. (For instance, a cycle is a petal that usestwo paths.) A flower is a graph consisting of a node x withthree types of attachments: chains (the stamens), trees thatare not chains (the stems), and petals.

An example of a real flower query posed by users in one ofour DBpedia logs is illustrated in Figure 6. It consists ofa central node with four petals (one of which using threepaths), ten stamens and zero stems attached.

We also considered sets of flowers, which we called flowersets, to further increase the ratio of queries that could beclassified from the original logs. The number of flowers andflower sets in the query logs overcome those of trees andforests by roughly 0.05%, respectively for CQ, CQF andCQOF , and for all the three fragments flowerSets queriescould get significantly closer to 100% coverage than plainforests.

In the above analysis, we have analyzed the shapes ofqueries when the latter are represented as canonical graphsas defined in Section 5, i.e., the nodes can be either variablesor constants. Constants are in fact necessary to fully char-acterize query shapes, even though they do not play a majorrole in query optimization, as variables do. For that reason,we have rerun the above analysis on queries excluding con-stants in order to identify the differences in the obtained

Figure 6: An example of a flower query found in ourDBpedia query logs (we added arrows to indicatethe edge directions in the query; labels are omittedfor confidentiality reasons).

shape classification. The most significant observation herewas that 9.66 million single edge CQs (78.70% of the singleedge CQs) uses constants.

For the queries with cycles, we also investigate what isthe length of the shortest cycle in the query. We discovered,for 39,471 queries, the shortest cycle has length three. For6,561 and 5,733 queries, the shortest cycles had length 4 and5, respectively. For 26 queries, the length was larger. Wefound two queries for which the shortest cycle was 14, whichis the largest value we found.

6.2 Tree- and HypertreewidthIt is well-known that the tree- or hypertreewidth of queries

are important indicators to gauge the complexity of theirevaluation. We therefore investigated the tree- and hyper-treewidth of CQ, CQF and CQOF queries. We do not for-mally define tree- or hypertreewidth in this paper but in-stead refer to an excellent introduction [13]. In the ter-minology of Gottlob et al., we investigate the generalizedhypertree width of the canonical hypergraphs of queries.

Treewidth. All shapes we discussed in Section 6.1 have tree-width at most two. Forests (and all subclasses thereof) havetreewidth one, whereas cycles, flowers, and flower sets havetreewidth two. We investigated the remaining queries byhand and discovered that one query had treewidth three(see Figure 7) and all others had treewidth two, see Table 4.From the treewidth perspective, it is interesting to note thatmany queries of treewidth two are flowers (Definition 6.1)or sets of flowers, which are a very restricted fragment.

Hypertree Width. We recall that we only considered canon-ical graphs for queries that do not use variables in the pred-icate position of triple patterns. In CQOF, 6,959,510 queriesused this feature and therefore we must consider the hyper-graph structure to correctly measure the cyclicity of thesequeries. We determined their (generalized) hypertree widthwith the tool detkdecomp from the Hypertree Decomposi-tions home page [10].

Our results are as follows. All the remaining queries hadhypertree width one, except for 86 queries with hypertreewidth two and eight queries with hypertree width three.

We also looked at the number of nodes in the hypertreedecompositions that the tool gave us, since this number can

157

Page 10: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

Table 4: Cumulative shape analysis of graph patterns in CQ, CQF , CQFO across all logsCQ

Shape #Queries Relative %

single edge 12,273,871 77.98%chain 15,561,944 98.87%

chain set 15,570,042 98.93%star 147,457 0.94%tree 15,723,163 99.90%

forest 15,731,535 99.95%

cycle 4,550 0.03%flower 15,730,043 99.94%

flower set 15,738,439 100.00%treewidth ≤ 2 15,739,056 100.00%

treewidth = 3 1 0.00%total 15,739,057 100.00%

CQF

#Queries Relative %

21,198,951 81.04%25,403,669 97.12%25,418,689 97.17%

702,228 2.68%26,127,544 99.88%26,143,128 99.94%

4,705 0.02%26,135,676 99.92%26,151,291 99.97%26,157,879 100.00%

1 0.00%26,157,880 100.00%

CQOF

#Queries Relative %

21,479,706 72.30%26,887,865 90.50%26,937,578 90.67%2,654,497 8.94%

29,599,539 99.63%29,651,600 99.81%

4,734 0.02%29,614,330 99.68%29,666,423 99.86%29,708,967 100.00%

1 0.00%29,708,968 100.00%

?subject nationality?subject birthPlace ?subject genre

?object genre?object birthPlace ?object nationality

Figure 7: The DBpedia query exhibiting tree widthequal to 3.

be a guide for how well caching can be exploited for queryevaluation [19] (the higher the number, the better cachingcan be exploited). For the queries with hypertree width one,the number of nodes in the decompositions corresponds totheir number of edges, which can already be seen in Fig-ure 5. (Nevertheless, we found several hundred queries withmore than 100 nodes in their hypertree decompositions, allof them occurring in DBpedia15 and DBpedia16.) Finally,we observed that the queries with hypertree width two orthree all had decompositions with up to ten nodes.

7. PROPERTY PATHSWe found 247,404 property paths in our corpus. Although

property paths are therefore rare in relation to the entirecorpus, this is not so for every data set: 92 queries (29.87%)in WikiData17 have property paths.

A large fraction of these property paths are extremelysimple. For instance, 63,039 property paths are !a (“followan edge not labeled a”) and 306 are ˆa (“follow an a-edgein reverse direction”). In the following, we focus on theremaining 184,059 property paths, which express queries onthe graph that do more than simply follow an edge (suchqueries are sometimes called navigational queries).

Here, 66,262 (36%) use reverse navigation, i.e., the oper-ator “ˆ”, within more complex expressions. In Table 5, wepresent an overview of the property paths different from !aand ˆa. In our classification, we treat ˆa and !a the same asa literal. For instance, we classify a/b, (ˆa)/b, and (!a)/b allas a1/ · · · /ak with k = 2. When ! appears in front of a morecomplex expression (as in !(a|b)), we treat it separately. Weonly found 10 expressions that use ! and are different fromthe expression !a.

Table 5: Structure of navigational property paths inour corpus

Expression Type Absolute Relative k

(a1| · · · |ak)∗ 72,009 39.12% 2–4a∗ 48,636 26.42%

a1/ · · · /ak 21,435 11.65% 2–6a∗/b 19,126 10.39%

a1| · · · |ak 16,053 8.72% 2–6a+ 3,805 2.07%

a1?/ · · · /ak? 2,855 1.55% 1–5a(b1| · · · |bk) 37 0.02% 2

a1/a2?/ · · · /ak? 31 0.02% 1–3(a/b∗)|c 15 0.01%a∗/b? 13 0.01%a/b/c∗ 11 0.01%!(a|b) 10 0.01%

(a1| · · · |ak)+ 10 0.01% 2(a1| · · · |ak)(a1| · · · |ak) 5 < 0.01% 2–6

a?|b 2 < 0.01%a∗|b 2 < 0.01%

(a|b)? 2 < 0.01%a|b+ 1 < 0.01%a+|b+ 1 < 0.01%(a/b)∗ 1 < 0.01%

Furthermore, each row represents the expression type listedon the left plus its symmetric form. For instance, when wewrite a∗/b, we count the expressions of the form a∗/b andb/a∗. The variant listed in the table is the one that occurredmost often in the data. That is, a∗/b occurred more oftenthan b/a∗.

Bagan et al. [5] proved a dichotomy on the data com-plexity of evaluating property paths under a simple pathsemantics, i.e., expressions can only be matched on pathsin the RDF graph in which nodes appear only once. Theyshowed that, although evaluating property paths under thissemantics is NP-complete in general, it is possible in Ptimeif the expressions belong to a class called Ctract. Remark-ably, we only found one expression in our corpus which isnot in Ctract, namely (a/b)∗. The complexity of enumerat-ing answers to property paths of the form as in Table 5 is

158

Page 11: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

studied in [24]. More precisely, the paper investigates enu-meration problems for simple transitive expressions, whichcapture most of the expressions in Table 5.

8. EVOLUTION OF QUERIES OVER TIMEIn a typical usage scenario of a SPARQL endpoint, a user

queries the data and gradually refines her query until the de-sired result is obtained. In this section, we analyse to whichextent such behavior occurs. The results are very prelimi-nary but show that, in certain contexts, it can be interestingto investigate optimization techniques for sequences of sim-ilar queries.

We consider a query log to be an ordered list of queriesq1, . . . , qn. We introduce the notion of a streak, which in-tuitively captures a sequence of similar queries within closedistance of each other. To this end we assume the exis-tence of a similarity test between two queries. We then saythat queries qi and qj with i < j match if (1) qi and qj aresimilar and (2) no query qi′ with i < i′ < j is similar toqi. A streak (with window size w) is a sequence of queriesqi1 , . . . , qik such that, for each ` = 1, . . . , k−1, we have thati`+1 − i` ≤ w and qi`+1 matches qi` .

In theory, it is possible for a query to belong to multiplestreaks. E.g., it is possible that q1 and q2 do not match,but query q3 is sufficiently similar to both. In this case, q3belongs to both streaks q1, q3 and q2, q3.

In the present study, we used Levenshtein distance as asimilarity test. More precisely, we said that two queriesare similar if their Levenshtein distance, after removal ofnamespace prefixes, is at most 25%.22 We removed names-pace prefixes prior to measuring their Levenshtein distance,because they introduce superficial similarity. As such, werequire queries to be at least 75% identical starting fromthe first occurrence of the keywords Select, Ask, Construct,or Describe. We took a window size of 30.

Streak Length. Since the discovery of streaks was extremelyresource-consuming, we only analysed streaks in three ran-domly selected log files from DBpedia14, DBpedia15, andDBpedia16. The sizes of these log files, each reflecting a sin-gle day of queries to the endpoint, were 273MiB, 803MiB,and 1004MiB respectively. For the ordering of the queries,we simply considered the ordering in the log files, since thelogs are sorted over time.

The results on streak length are in Table 6. Using windowsize 30, the longest streak we found had length 169 and wasin the 2016 log file. When we increased the window size,we noticed that it was still possible to obtain longer streaks.We believe that a more refined analysis on the encounteredstreaks can be carried out when tuning the window sizeand deriving more complex metrics on the similarity of thequeries within each streak. These issues are, however, sub-ject of further research, which we plan to pursue in futurework.

Evolution of Size and Structure. In addition to the lengthof streaks, we also investigated how the number of triplesand structure of queries in streaks change over time. To thisend, we needed to parse the queries in streaks. The three logfiles contain a combined amount of 510,361 streaks. Out of

22We normalized the measure by dividing the Levenshteindistance by the length of the longer string.

Table 6: Length of streaks in three single-day logs

Streak length #DBP’14 #DBP’15 #DBP’16

1–10 42,272 167,292 199,37511–20 3,732 24,001 37,40221–30 2,425 4,813 17,74931–40 884 667 5,84941–50 283 162 1,99851–60 88 40 71161–70 27 8 32271–80 15 4 12981–90 5 1 47

91–100 5 0 27>100 4 0 24

Table 7: Largest query occurring in streaks

Max Max#Triples #Streaks #Triples #Streaks

1 130,706 13–20 9,5092 41,811 21–30 5443 34,081 31–40 2334 9,990 41–50 865 3,325 51–60 446 1,733 61–70 327 8,465 71–80 178 10,604 81–90 119 7,837 91–100 3

10 1,080 101–110 911 51,521 > 110 712 43,819

these streaks, 321,042 have at least two queries and 234,627additionally have at least one query that parses. Remark-ably, in the latter set, only 1,402 streaks have an erroneousquery. Here, 1,202 have an erroneous query followed by acorrect one, and 789 have a correct query followed by anerroneous one.

We then investigated the number of triples of queries instreaks. We have 355,466 streaks for which have at least oneparseable query that contains at least one triple.23 Table 7contains, for each of the 355,466 streaks, what is the maxi-mal number of triples in any of its queries. We noticed thatthis number is quite stable: we only have 3,915 streaks inwhich this number changes during the streak.

Table 8 contains results on the shapes of queries in streaks.We considered chain queries, trees that branch (and there-fore are no chains), and cyclic queries, that is, queries thatcontain a cycle. Table 8 contains, for each subset S of thesethree shapes, the number of streaks that contain only shapesfrom S and the number of streaks that consist only of shapesfrom S.

Interestingly, we found a of correlation between streaklength and query shape and size. For instance, out of the526 streaks that contain a cyclic query, 472 (89.73%) onlyconsist of a single query. This strongly contrasts the entirelog, where only 189,319 streaks (37.10%) consist of a singlequery. Similarly, we have 1,378 streaks that contain a queryof at least 16 triples, but 1,332 of these streaks (96.66%)

23For 88,201 streaks, all queries had an empty body. Another31 streaks had a non-empty body, containing no triples.

159

Page 12: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

Table 8: Structures of queries appearing in the samestreak (chn = chain, bt = ’branching tree’, i.e., treethat is not a chain, cyc = cyclic)

Shapes #Streaks

containing chn 148,632consisting only of chn 147,106

containing bt 39,839consisting only of bt 39,810

containing cyc 526consisting only of cyc 493

containing bt and cyc 12consisting only of bt or cyc 40,315

containing bt and chn 2consisting only of bt or chn 186,918

containing chn and cyc 21consisting only of chn or cyc 147,620

consisting only of chn, bt, or cyc 187,444

only have a single query. This suggests that highly complexqueries are less likely to occur in longer streaks. We stressagain that the data sets used for this study only consistedof DPpedia query logs for three days, which is a very smallsample. We leave the evaluation on the total corpus forfuture work.

9. CONCLUSIONS AND DISCUSSIONWe have conducted an extensive analytical study on a

large corpus of real SPARQL query logs. Our corpus isinherently heterogeneous and consists of a majority of DB-pedia query logs along with query logs on biological datasets(namely BioPortal and BioMed datasets) and geological data-sets (LGD), query logs on bibliographic data (SWDF), andquery logs from a museum’s SPARQL endpoint (British Mu-seum). We have completed this corpus with the examplequeries from Wikidata (Feb. 2017), which are cherry pickedfrom real SPARQL queries on this data source.

Considerations About the Datasets. The majority of thedatasets exhibit similar characteristics, such as for instancethe simplicity of queries amounting to 1 or 2 triples. Theonly exception occurs with British Museum and Wikidatadatasets (Figure 1), where the former is a set of queries gen-erated from fixed templates and the latter is a query wikirather than a query log. Clearly, the DBpedia datasets arethe most voluminous and recent in our corpus, thus makingtheir results quite significant. For instance, despite the factthat single triple queries are numerous in these datasets,more complex queries (with 11 triples or more) have lotsof occurrences (up to 21% of the total number of queriesfor DBpedia13). Moreover, we observed that most of theanalyzed queries across all datasets are Select/Ask queries,which range between 91% and 99.88% for all datasets ex-cept DBpedia16 and LGD13, that have lower percentages.Therefore, we focused on such queries in the remainder ofthe paper since these queries turn out to be the queries thatusers most often formulate in SPARQL query endpoints.We have further examined the occurrences of operator dis-tributions and the number of projections and subqueries.This analysis lets us address a specific fragment, namely the

And/Opt/Filter patterns (AOF patterns). For such patterns,we derived the graph- and hypergraph structures and ana-lyzed the impact of the structure on query evaluation.

Benefits of Shape Analysis. We synthetically reproducedthe observed real chain and cycle query logs with a syntheticgenerator by building diverse workloads of Ask queries andmeasured their average runtime in two systems, Blazegraph,used by the Wikimedia foundation, and PostgreSQL. Inboth systems, the difference between average performancesof such different query shapes are perceivable. We decidedto dig deeper in the shape analysis in order to classify thesequeries under general query shapes as canonical graphs andcharacterize their tree-likeness as hypergraphs. We believethat this shape analysis can serve the need of fostering thediscussion on the design of new query languages for graphdata, as pursued by the LDBC Graph Query Language TaskForce [32, 11, 33]. It can also inspire the conception of novelquery optimization techniques suited for these query shapes,along with tuning and benchmarking methods. For instance,we are not aware of existing benchmarks targeting flowersand flower sets. The analysis on property paths showedthat these are not yet widely used in the entire corpus, eventhough they are numerous in the Wikidata corpus. A recentdiscussion (July 6th, 2017) in a Neo4J working group [37]concerned the support of full-fledged regular path queriesin OpenCypher. This discussion, and other discussions onstandard graph query languages [32, 11, 33] could benefitfrom our analysis, devoted to find which property paths areactually used most often when ordinary users have the powerof regular expressions.

Benefits of Streak Analysis. Finally, we performed a studyon the way users specify their queries in SPARQL query logs,by identifying streaks of similar queries. This analysis is forinstance crucial to understand query specification from realusers and thus usability of databases, which is a hot researchtopic in our community [18, 27].

Extensibility. Our analysis has been carried out with scriptsin different languages, amounting to a total of roughly 9, 000source lines of code (SLOC). We plan to make these scriptsopen-source and extensible to the new query logs that willbe produced by users on SPARQL endpoints in the nearfuture.

AcknowledgmentsWe would like to acknowledge USEWOD and Patrick vanKleef together with the entire team of OpenLink Softwarefor hosting the official DBPedia endpoint and granting usthe access to the large DBPedia query logs analysed in thispaper.

10. REFERENCES[1] C. R. Aberger, S. Tu, K. Olukotun, and C. Re.

EmptyHeaded: A relational engine for graphprocessing. In International Conference onManagement of Data (SIGMOD), pages 431–446,2016.

[2] C. R. Aberger, S. Tu, K. Olukotun, and C. Re. Oldtechniques for new join algorithms: A case study in

160

Page 13: An Analytical Study of Large SPARQL Query Logs · ious domains: DBpedia, Semantic Web Dog Food (SWDF), LinkedGeoData (LGD), BioPortal (BioP), OpenBioMed (Bio-Med), British Museum

RDF processing. In International Conference on DataEngineering (ICDE) Workshops, pages 97–102, 2016.

[3] M. Arias, J. D. Fernandez, M. A. Martınez-Prieto,and P. de la Fuente. An empirical study of real-worldSPARQL queries. CoRR, abs/1103.5043, 2011.

[4] G. Bagan, A. Bonifati, R. Ciucanu, G. H. L. Fletcher,A. Lemay, and N. Advokaat. gmark: Schema-drivengeneration of graphs and queries. IEEE Trans. Knowl.Data Eng., 29(4):856–869, 2017.

[5] G. Bagan, A. Bonifati, and B. Groz. A trichotomy forregular simple path queries on graphs. In Principles ofDatabase Systems (PODS), pages 261–272, 2013.

[6] P. Barcelo, R. Pichler, and S. Skritek. Efficientevaluation and approximation of well-designed patterntrees. In Principles of Database Systems (PODS),pages 131–144, 2015.

[7] A. Bonifati, W. Martens, and T. Timm. An analyticalstudy of large SPARQL query logs. CoRR,abs/1708.00363, 2017.

[8] A. Chandra and P. Merlin. Optimal implementation ofconjunctive queries in relational data bases. InSymposium on the Theory of Computing (STOC),pages 77–90, 1977.

[9] C. Chekuri and A. Rajaraman. Conjunctive querycontainment revisited. In International Conference onDatabase Theory (ICDT), pages 56–70, 1997.

[10] detkdecomp.wwwinfo.deis.unical.it/~frank/Hypertrees.Visited on August 10th, 2016.

[11] G. H. L. Fletcher, H. Voigt, and N. Yakovets.Declarative graph querying in practice and theory. InInternational Conference on Extending DatabaseTechnology (EDBT), pages 598–601, 2017.

[12] G. Gottlob, G. Greco, N. Leone, and F. Scarcello.Hypertree decompositions: Questions and answers. InPrinciples of Database Systems (PODS), pages 57–74,2016.

[13] G. Gottlob, G. Greco, and F. Scarcello. Treewidth andhypertree width. In Tractability: Practical Approachesto Hard Problems, pages 3–38. Cambridge UniversityPress, 2014.

[14] G. Gottlob, N. Leone, and F. Scarcello. Hypertreedecompositions and tractable queries. J. Comput.Syst. Sci., 64(3):579–627, 2002.

[15] X. Han, Z. Feng, X. Zhang, X. Wang, G. Rao, andS. Jiang. On the statistical analysis of practicalSPARQL queries. In WebDB, page 2, 2016.

[16] S. Harris and A. Seaborne. SPARQL 1.1 querylanguage. Technical report, World Wide WebConsortium (W3C), March 2013. https://www.w3.org/TR/2013/REC-sparql11-query-20130321.

[17] J. Huelss and H. Paulheim. What SPARQL query logstell and do not tell about semantic relatedness in LOD— or: The unsuccessful attempt to improve thebrowsing experience of DBpedia by exploiting querylogs. In ESWC Satellite Events, pages 297–308, 2015.

[18] H. V. Jagadish, A. Chapman, A. Elkiss,M. Jayapandian, Y. Li, A. Nandi, and C. Yu. Makingdatabase systems usable. In International Conferenceon Management of Data (SIGMOD), pages 13–24,2007.

[19] O. Kalinsky, Y. Etsion, and B. Kimelfeld. Flexiblecaching in trie joins. In International Conference onExtending Database Technology (EDBT), pages282–293, 2017.

[20] M. Kaminski and E. V. Kostylev. Beyondwell-designed SPARQL. In International Conferenceon Database Theory (ICDT), pages 5:1–5:18, 2016.

[21] M. Kroll, R. Pichler, and S. Skritek. On thecomplexity of enumerating the answers towell-designed pattern trees. In InternationalConference on Database Theory (ICDT), pages22:1–22:18, 2016.

[22] A. Letelier, J. Perez, R. Pichler, and S. Skritek. Staticanalysis and optimization of semantic web queries.ACM Trans. Database Syst., 38(4):25:1–25:45, 2013.

[23] W. Martens and J. Niehren. On the minimization ofXML schemas and tree automata for unranked trees.J. Comput. Syst. Sci., 73(4):550–583, 2007.

[24] W. Martens and T. Trautner. Enumeration problemsfor regular path queries. CoRR, abs/1710.02317, 2017.

[25] K. Moller, M. Hausenblas, R. Cyganiak,S. Handschuh, and G. Grimnes. Learning from linkedopen data usage: Patterns & metrics. In Web ScienceConference (WSC), 2010.

[26] M. Morsey, J. Lehmann, S. Auer, and A. N. Ngomo.DBpedia SPARQL benchmark — performanceassessment with real queries on real data. InInternational Semantic Web Conference (ISWC),pages 454–469, 2011.

[27] A. Nandi and H. V. Jagadish. Guided interaction:Rethinking the query-result paradigm. PVLDB,4(12):1466–1469, 2011.

[28] T. Neumann and G. Weikum. The RDF-3X engine forscalable management of RDF data. VLDB J.,19(1):91–113, 2010.

[29] J. Perez, M. Arenas, and C. Gutierrez. Semantics andcomplexity of SPARQL. ACM Trans. Database Syst.,34(3):16:1–16:45, 2009.

[30] F. Picalausa and S. Vansummeren. What are realSPARQL queries like? In International Workshop onSemantic Web Information Management (SWIM),pages 1–7, 2011.

[31] M. Saleem, I. Ali, A. Hogan, Q. Mehmood, and A.-C.Ngonga Ngomo. LSQ: The linked SPARQL queriesdataset. In International Semantic Web Conference(ISWC), pages 261–269, 2015.

[32] http://ldbcouncil.org.

[33] https://databasetheory.org/node/47.

[34] http://wikidata.org.

[35] http://wiki.dbpedia.org/datasets.

[36] http://www.blazegraph.com.

[37] http://www.opencypher.org/ocig2.

[38] http://www.postgresql.org.

[39] M. Vidal, E. Ruckhaus, T. Lampo, A. Martınez,J. Sierra, and A. Polleres. Efficiently joining grouppatterns in SPARQL queries. In Extended SemanticWeb Conference (ESWC), pages 228–242, 2010.

[40] D. Vrandecic and M. Krotzsch. Wikidata: a freecollaborative knowledgebase. Commun. ACM,57(10):78–85, 2014.

161