26
Undefined 0 (2018) 1–0 1 IOS Press A decade of Semantic Web research through the lenses of a mixed methods approach Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Sabrina Kirrane a , Marta Sabou b , Javier D. Fernández a , Francesco Osborne c , Cécile Robin d , Paul Buitelaar d , Enrico Motta c , and Axel Polleres a a Vienna University of Economics and Business, Austria E-mail: fi[email protected] b Vienna University of Technology, Austria E-mail:fi[email protected] c Knowledge Media institute (KMi), The Open University, UK E-mail:fi[email protected] d Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland E-mail: fi[email protected] Abstract. The identification of research topics and trends is an important scientometric activity, as it can help guide the direction of future research. In the Semantic Web area, initially topic and trend detection was primarily performed through qualitative, top-down style approaches, that rely on expert knowledge. More recently, data-driven, bottom-up approaches have been proposed that offer a quantitative analysis of the evolution of a research domain. In this paper, we aim to provide a broader and more complete picture of Semantic Web topics and trends by adopting a mixed methods methodology, which allows for the combined use of both qualitative and quantitative approaches. Concretely, we build on a qualitative analysis of the main seminal papers, which adopt a top-down approach, and on quantitative results derived with three bottom-up data-driven approaches (Rexplore, Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the latter for “fact-checking” on the former and also to derive key findings in relation to the strengths and weaknesses of top-down and bottom-up approaches to research topic identification. Although we provide a detailed study on the past decade of Semantic Web research, the findings and the methodology are relevant not only for our community but beyond the area of the Semantic Web to other research fields as well. Keywords: Research Topics, Research Trends, Linked Data, Semantic Web, Scientometrics 1. Introduction The term scientometrics is an all encompassing term used for an emerging field of research that analyses and measures science, technology research and innovation [18]. Although the term scientometrics is a broad term, in this paper, we focus on one particular sub field of scientometrics that uses topic analysis to identify trends in a scientific domain over time [15]. Under- standing topics and subsequently predicting trends in research domains are important tasks for researchers and represent vital functions in the life of a research community. Overviews of present and past topics and trends provide important lessons of how research inter- ests evolve and allow research communities to better plan future work, whereas visions of future topics can inspire and channel the work of a research community. Considering the critical role played by topic and trend analysis when it comes to identifying under- represented and emerging research topics, it is not sur- 0000-0000/18/$00.00 c 2018 – IOS Press and the authors. All rights reserved

1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

Undefined 0 (2018) 1–0 1IOS Press

A decade of Semantic Web research throughthe lenses of a mixed methods approachEditor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Sabrina Kirrane a, Marta Sabou b, Javier D. Fernández a, Francesco Osborne c, Cécile Robin d,Paul Buitelaar d, Enrico Motta c, and Axel Polleres a

a Vienna University of Economics and Business, AustriaE-mail: [email protected] Vienna University of Technology, AustriaE-mail:[email protected] Knowledge Media institute (KMi), The Open University, UKE-mail:[email protected] Insight Centre for Data Analytics, National University of Ireland, Galway, IrelandE-mail: [email protected]

Abstract. The identification of research topics and trends is an important scientometric activity, as it can help guide the directionof future research. In the Semantic Web area, initially topic and trend detection was primarily performed through qualitative,top-down style approaches, that rely on expert knowledge. More recently, data-driven, bottom-up approaches have been proposedthat offer a quantitative analysis of the evolution of a research domain. In this paper, we aim to provide a broader and morecomplete picture of Semantic Web topics and trends by adopting a mixed methods methodology, which allows for the combineduse of both qualitative and quantitative approaches. Concretely, we build on a qualitative analysis of the main seminal papers,which adopt a top-down approach, and on quantitative results derived with three bottom-up data-driven approaches (Rexplore,Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use thelatter for “fact-checking” on the former and also to derive key findings in relation to the strengths and weaknesses of top-downand bottom-up approaches to research topic identification. Although we provide a detailed study on the past decade of SemanticWeb research, the findings and the methodology are relevant not only for our community but beyond the area of the SemanticWeb to other research fields as well.

Keywords: Research Topics, Research Trends, Linked Data, Semantic Web, Scientometrics

1. Introduction

The term scientometrics is an all encompassing termused for an emerging field of research that analyses andmeasures science, technology research and innovation[18]. Although the term scientometrics is a broad term,in this paper, we focus on one particular sub fieldof scientometrics that uses topic analysis to identifytrends in a scientific domain over time [15]. Under-standing topics and subsequently predicting trends in

research domains are important tasks for researchersand represent vital functions in the life of a researchcommunity. Overviews of present and past topics andtrends provide important lessons of how research inter-ests evolve and allow research communities to betterplan future work, whereas visions of future topics caninspire and channel the work of a research community.

Considering the critical role played by topic andtrend analysis when it comes to identifying under-represented and emerging research topics, it is not sur-

0000-0000/18/$00.00 c© 2018 – IOS Press and the authors. All rights reserved

Page 2: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

2 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

prising that there have been a number of works fromSemantic Web researchers that take an introspectiveview of the community. Several papers endeavor topredict Semantic Web research topics and trends [1,2],or as the research advanced over the years, to anal-yse topics and trends within the community [13,16]. Inparallel, several researchers [5,19,23,26,27,30] are ac-tively working on tools and techniques that can be usedto automatically uncover research topics and trendsfrom scientific publications.

Most of the trend prediction/analysis papers in theSemantic Web area [1,2,13] adopt a top-down ap-proach that primarily relies on the knowledge, intuitionand insights of experts in the field. While undoubtedlythese are very valuable assets, trend-papers that purelyfollow this approach risk focusing on major topics andtrends alone while overlooking under-represented oremerging topics and trends. These shortcomings couldpotentially be addressed by (semi-) automatic, data-driven approaches, which identify research topics andtrends in a bottom-up fashion from large corpora.

The primary goal of this paper is to provide a morecomplete picture of Semantic Web topics and trendsin the last decade by relying on both top-down andbottom-up approaches. Our hypothesis being that thereis a high correlation between expert driven and datadriven topic and trend analyses, however by combin-ing both approaches it is possible to gain additional,valuable insights with respect to the Semantic Web re-search domain. Starting from this hypothesis, we de-vise two primary research questions:

(1) Is it possible to identify the predominant SemanticWeb research topics using both expert based pre-dictions and topic and trend identification tools?

(2) What are the strengths and weaknesses of expert-driven and data-driven topic and trend identifica-tion methods?

In order to answer the aforementioned researchquestions we adopt a mixed methods research method-ology [22], which involves the combination of quanti-tative and qualitative research methods, in order to gainbetter insights into Semantic Web topics and trends.Concretely, our study comprises three core tasks.

– Firstly, in a qualitative study we converge thefindings of three top-down style seminal papers[1,2,13] at different points in time, into a unifiedResearch Landscape.

– Secondly, we employ three alternative data-driven quantitative approaches in order to uncover

topics and trends from a corpus of Semantic Webpublications in a bottom-up fashion.

– Thirdly, we compare and contrast the topics de-rived from both the expert analysis and the datadriven approaches, in order to provide a moreholistic picture of Semantic Web research.

In order to enable the Semantic Web community tofurther build upon the results of our study, additionalinformation about the resources described in this paperare available via https://doi.org/10.5281/zenodo.1492693.

The remainder of the paper is structured as follows:Section 2 provides an overview of existing work on au-tomatic topic and trend analysis in the Semantic Webcommunity. Section 3 describes the mixed methodsmethodology that guided our analysis. Section 4 pro-vides a snapshot of the Semantic Web research com-munity based on the observations of several domainspecific experts [1,2,13]. This is followed by the pre-sentation of the topic analysis of papers published inthe main Semantic Web publishing venues over a 10year period from 2006 to 2015 in Section 5. A dis-cussion on the findings of our analysis is presented inSection 6. Finally, Section 7 concludes the paper andpresents directions for future work.

2. Related Work

Detecting topics that accurately represent a collec-tion of documents is an important task that has at-tracted considerable attention in recent years leadingto a variety of relevant approaches from different me-dia sources, such as news articles [10], social networks[7], blogs [25], emails [24], to name but a few. In thissection, we discuss approaches for the more focusedtask of detecting research topics from scientific litera-ture, and in particular topics that express methods em-ployed by the Semantic Web community.

A classical way to model the topics of a documentis to extract a list of significant terms [6] (e.g., usingtf-idf) and to cluster them [32]. Another common solu-tion is the adoption of probabilistic topic models, suchas Latent Dirichlet Allocation (LDA) [3] or Probabilis-tic Latent Semantic Analysis (pLSA) [17]. However,these generic approaches suffer from a number of lim-itations that often hinder their application for the taskof detecting scientific topics. Firstly, they produce un-labeled bags of words that are often difficult to asso-ciate with distinct research areas. Secondly, the num-

Page 3: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 3

ber of topics to be extracted needs to be known a pri-ori. Finally, using such methods it is not possible todistinguish research areas from other kinds of topicscontained in a document.

Therefore, several approaches were proposed tospecifically address the problem of detecting researchtopics. For instance, Morinaga et al [24] present amethod that exploits a Finite Mixture Model to de-tect research topics and to track the emergence of newtopics. Derek et al [11] developed an approach thatmatches scientific articles with a manually curated tax-onomy of topics that is used to analyse topics acrossdifferent timescales. Chavalarias et al [8] propose atool known as CorText that can be used to extract alist of n-grams from scientific literature and to performclustering analysis in order to discover patterns in theevolution of scientific knowledge. Other approachesexploit the citation graph. For example, Chen et al [9]designed a tool called CiteSpace which combines co-citation analysis and burst detection [21] to identifynew emerging trends. While, Jo et al [20] detect topicsby combining distributions of terms with the citationgraph related to publications containing these terms.

Public tools for the exploration of research datausually identify research areas by using keywords asproxies (e.g., DBLP++ [12], Scival1), adopting prob-abilistic topic models (e.g., aMiner [33]) or exploit-ing handcrafted classifications (ACM2, Microsoft Aca-demic Search3). However, these solutions suffer fromsome limitations. For example, keywords are unstruc-tured and usually noisy, since they include terms thatare not research topics. In addition, the quality of key-words assigned to a paper varies a lot according to theauthors and the venues. Probabilistic topic models pro-duce bags of words that are often not easy to map tocommonly known research areas within the commu-nity. Finally, handcrafted classifications are expensiveto build, requiring multiple expertise, and tend to agevery quickly, especially in a rapidly evolving field suchas Computer Science.

The Semantic Web Community has also produceda number of tools and techniques that use semantictechnologies for detecting and analyzing research top-ics. For instance, Bordea and Buitelaar [5] demonstratehow expertise topics extraction (with ranking and fil-tering) along with researcher relevance scoring can beused to build expert profiles for the task of expert find-

1http://www.elsevier.com/solutions/scival2https://www.acm.org/publications/class-20123http://academic.research.microsoft.com/

ing. In a related work, Monaghan et al. [23] presenttheir expertise finding platform Saffron based on thesame principles, and demonstrate how it can be usedto link expertise topics, researchers and publications,based on their analysis of the Semantic Web Dog Food(SWDF) corpus. The data is further enhanced withURIs and expertise topic descriptions from DBpediaand related information from the Linked Open Data(LOD) cloud. An alternative approach is adopted bythe Rexplore system [27], an environment for explor-ing and making sense of scholarly data that integratesstatistical analysis, semantic technologies, and visualanalytics. Rexplore builds on Klink-2 [26], an algo-rithm which combines semantic technologies, machinelearning and knowledge from external sources (e.g.,the LOD cloud, web pages, calls for papers) to auto-matically generate large-scale ontologies of researchareas. The resulting ontology is used to semanticallyenhance a variety of data mining and information ex-traction techniques, and to improve search and visualanalytics. Hu et al. [19] demonstrate how SemanticWeb technologies can be used in order to support sci-entometrics over articles and data submitted to the Se-mantic Web Journal as part of their open review pro-cess. Towards this end the authors provide external ac-cess to their semantified dataset, which is also linkedto external datasets such as DBpedia and the SemanticWeb Dog Food corpus. On top of this data they provideseveral interactive visualizations that can be used toexplore the data, ranging from general statistics to de-picting collaborative networks. Whereas Parinov andKogalovsky [30] describe the Socionet research infor-mation system that focuses on linking research objectsin general and research outputs in particular, the au-thors argue that information inferred from the seman-tic linkage of research objects and actors can be usedto derive new scientometric metrics.

An interesting case of data-driven analysis is thatreported in Glimm and Stuckenschmidt [16], lookingback at the last 15 years of Semantic Web researchthrough the lens of papers published at ISWC confer-ences from 2002 to 2014. The authors adopt an empiri-cal approach to better understand the topics and trendswithin the Semantic Web community, in which theyidentify 12 key topics that describe Semantic Web re-search and then manually classify papers published inISWC conference proceedings according to these top-ics. This work can also be categorized as a data-drivenanalysis of research topics and trends, which was per-formed completely manually.

Page 4: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

4 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

/

Fig. 1. Conceptual overview of topics detection approaches: main steps and data sources

Although data-driven approaches have been evalu-ated on their own, to date there is a lack of worksthat compare and contrast existing approaches, or in-deed evaluate them with respect to expert-driven ap-proaches. This paper fills this gap by adopting a holis-tic approach to topic and trend analysis, by analyzingthe results of three expert-based and data-driven topic-detection approaches in the context of Semantic Webresearch.

3. Background and Methodology

In order to gain a better understanding of the top-ics and trends in the Semantic Web community overa ten year period from 2006-2015, we adopt a mixedmethods approach to topic extraction and analysis,which combines both expert-based and data-driven ap-proaches. According to Leech and Onwuegbuzie [22],the mixed methods research methodology involves thecombination of quantitative and qualitative researchmethods in order to gain knowledge about some phe-nomenon under investigation. The mixed methods ap-proach that guided the work carried out in this study isillustrated in Figure 2.

3.1. Seminal paper qualitative topic analysis

The goal of the qualitative analysis of the seminalpapers was primarily to identify research topics men-tioned in [1,2,13]. The work was conducted in a twostep process. In Step 1 each paper was read by threeof the authors of this paper who were each tasked withidentifying technical research topics mentioned in thethree seminal papers (e.g., ontology, OWL). To keepthe analysis as objective as possible, the authors ex-tracted the exact wording used in the papers insteadof using synonyms more familiar to them. Following

on from this, the authors grouped extracted keywordsinto broader topic areas (e.g., ontologies and model-ing, logic and reasoning). In order to reduce any bias,in Step 2 the results of the aforementioned analysiswere discussed and aligned during a consensus work-shop. Where disagreement occurred with respect to thegrouping of keywords the seminal papers were con-sulted in order to better understand the context of thetopic, such that it was possible to reach consensus as toits categorization. The final outcome of the qualitativeanalysis is the unified Research Landscape, shown inTable 2 and discussed in detail in Section 4.

3.2. Semantic Web publications quantitative topicanalysis

Rather than using a single topic and trend identifi-cation tool in Step 3 we elected to perform the anal-ysis of a corpus of Semantic Web publications withthree different tools (i.e., PoolParty4, Rexplore5, andSaffron6), such that we could compare and contrast theresults obtained via the different tools.

Semantic Web Venues (SWVs) corpus: The corpus,which was analyzed by each of the tools, comprisespapers from five enduring international publishingvenues dedicated to Semantic Web research, namely:the International Semantic Web Conference (ISWC),the Extended Semantic Web Conference (ESWC), theSEMANTiCS conference, the Semantic Web Journal(SWJ) and the Journal of Web Semantics (JWS), overa 10 year period from 2006 to 2015 inclusive. Thesepublishing venues were chosen as they are dedicated toSemantic Web research and have been running contin-

4PoolParty, https://www.PoolParty.biz/system-architecture/5Rexplore, https://technologies.kmi.open.ac.uk/Rexplore/6Saffron, http://saffron.insight-centre.org/

Page 5: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 5

uously for several years. Although, the SEMANTiCSconference was traditionally seen as a more businessoriented event, it also has a strong academic compo-nent, with high overlap between the organizing andprogram committee members and the various commit-tees and boards of the other publishing venues. Forease of readability this corpus is simply referred to asthe SWVs corpus in the rest of the paper.

A conceptual topic extraction and analysis workflow:Generally speaking, the typical topic extraction andanalysis workflow, as depicted in Figure 1, is com-posed of the following sequential steps:

Taxonomy creation involves the creation of a topictaxonomy that guides the analysis process. Inpractice, this step can be achieved manually bydomain experts, or automatically with the taxon-omy being learned either from the documentcorpus of interest or from a larger externaldocument corpus.

Corpus Annotation concerns the annotation of thedocument corpus in terms of the taxonomytopics. Different annotation approaches rangefrom manually assigning each paper in a corpusto the most representative topics, annotating thedocument abstracts with the relevant topics, or an-notating the entire text of the paper based on atopic list or hierarchy.

Analytics refers to various analytical activitiesthat can be conducted over the annotateddocument corpus. For instance, documentclassification, trend detection, expert profilingand recommendations.

Data-driven topic extraction and analysis tools: Al-though all three tools conducted their analysis overthe same corpus, each of them employ different ap-proaches to topic extraction. An overview of the ap-proaches adopted by PoolParty, Rexplore, and Saffronwith respect to the main steps depicted in Figure 1 issummarized in Table 1 and described below:

PoolParty is a semantic technology suite that sup-ports the creation and maintenance of thesauriby domain experts [31]. Although PoolParty is acommercial product, a free version, which wasmade available in the context of the PROPELproject7 [14], was used to perform the analysis

7PROPEL, https://www.linked-data.at/

described in this paper. In the case of the analysisdescribed in this paper the taxonomy was createdfrom conference and journal metadata (i.e., callfor papers, sessions, tracks, special issues etc.),which have been manually curated by expertsfrom the Semantic Web community (i.e., confer-ence organizing committee and editorial boardmembers). In order to reduce the potential for biasduring the taxonomy construction, the classifica-tion, which was performed in the context of thePROPEL project, was collectively performed byfive Semantic Web experts. The topic frequencyanalysis was subsequently conducted by Pool-Party over the full text of the research articlesfrom the SWVs corpus, without any parameteri-zation.

Rexplore is an interactive environment for explor-ing scholarly data that leverages data mining, se-mantic technologies and visual analytics tech-niques [27]. In the context of this paper, we usedRexplore for tagging research papers with rel-evant research topics from the Computer Sci-ence Ontology (CSO), an existing ontology ofresearch areas that was automatically generatedfrom a large computer science corpus. The ap-proach for tagging the publications, which tookinto consideration their title, keywords, abstractand citations, is a slight variation of the methodadopted by Springer Nature for characterizingsemi-automatically their Computer Science pro-ceedings [28]. The analysis involved the genera-tion of statistical information based both on thenumber of papers and the citations associatedwith a topic. Considering that the Rexplore analy-sis is conducted over title, keywords, abstract andcitations it was also possible to analyse a broaderdataset comprising Scopus Computer Science pa-pers published from 2006-2015. No special pa-rameterization was used by the Rexplore in thecontext of this study.

Saffron is a topic and taxonomy extraction tool whosemain applications include expert finding, docu-ment classification and search [23]. In the contextof this paper, we used Saffron’s Natural LanguageProcessing (NLP) techniques to extract domain-specific terms based solely on the full text of ar-ticles in the SWVs corpus, and a novel taxonomygeneration algorithm that uses a global general-ity measure to direct the edges from generic con-cepts to more specific ones, in order to constructa topical hierarchy. Additional details on the al-

Page 6: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

6 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 2. Overview of the mixed methods-based methodology.

Table 1Comparison of the methods and data sets used by various topic and trend analysis tools.

Tool Taxonomy Creation Topic Taxonomy Document Corpus Corpus Annotation Topic Analysis Other AnalyticsPoolParty Manual Fairly broad/deep SWVs

2006-2015Automatic(full-text)

Topic frequency intext

Taxonomyextension

Rexplore Automaticfrom broaderexternal corpus

17K topics in CS,96 topic in SW,9 levels deep

SWVs2006-2015Scopus2006-2015

Automatic(abstracts, titles,keywords)

Number of papersand citationsassociated with atopic

Taxonomy learning,expert profiling

Saffron Automatic fromthe document cor-pus

Fairly broad/deep SWVs2006-2015

Automatic(full-text)

Topic frequencyand semanticrelatedness

Taxonomy learning,expert finding, doc-ument classificationand search

gorithms used for term (topic) extraction and forextraction of a topic taxonomy can be found in[4]. The topic frequency and relatedness analysiswas conducted automatically by Saffron over theSWVs corpus without the need of any additionalcorpora. Based on previous studies conducted bythe Saffron team, in terms of parameterization thetaxonomy was limited to 500 topics and topicsthat appear in at least 3 papers.

In Step 4we performed a syntactic analysis of thetop forty topics extracted by each of the data-driventools. Both singular and plural representations of atopic were treated as the same topic. Additionally, top-ics with a high syntactic correlation were treated asthe same, for instance knowledge base and knowledgebased systems. A detailed description of the respectiveanalysis performed by PoolParty, Rexplore and Saf-fron and the cross correlation of topics is presented inSection 5.

3.3. Cross correlation of results

The final stage of our analysis involved the align-ment of the topics identified by Rexplore, Pool-Party and Saffron with the Research Landscape topicsemerging from the analysis of the seminal papers. InStep 5 the output of each of the three data-driven ap-proaches was mapped by one of the authors of this pa-per to the topics of the Research Landscape. The prin-ciples used to guide the mapping process, which in-volved a combination of syntactic and semantic match-ing, can be summarized as follows:

Exact syntactic match: is the most straightforwardcase as topics that have exactly the same label(e.g., Linked Data) are already aligned.

Partial syntactic match: refers to cases where twotopics have similar but not exactly matching la-bels, however clearly refer to the same body ofresearch. For instance, Description Logics is asubtopic of Logic and Reasoning.

Page 7: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 7

Semantic match: denotes topics that have syntacti-cally completely disjoint labels but they are se-mantically related. Links between syntacticallydifferent labels are often recorded in our extendedResearch Landscape document, where severalkeywords were assigned to a larger overlappingtopic. For example, we assigned keywords suchas SPARQL to the Query Languages topic.

No match: is used to represent topics identified by thedata-driven approaches that are completely newand cannot be related to any of the topics of theResearch Landscape.

In order to reduce any bias, in Step 6 individualtopic alignments were cross-checked by the two addi-tional authors and further discussed during an analy-sis and cross-correlation workshop. The results of thisworkshop are depicted in Tables 3- 6 and further dis-cussed in Section 6.

4. Seminal Papers Topic and Trend Analysis

In the Semantic Web area, a handful of well-knownpapers identify research topics and discuss trendswithin the community [1,2,13]. Some of these paperspredict future topics [1,2], while others reflect on re-search topics in the past years or in the present [2,13].

4.1. The seminal papers

At the turn of the millennium (2001), Berners-Leeet al. [1] coined the term "Semantic Web" and set aresearch agenda for a multidisciplinary research fieldaround a handful of topics. Six years later, Feigen-baum et al. [13] analyzed the uptake of Semantic Webtechnologies in various domains as of 2007. In doingso, they provided a picture of the technologies avail-able at that time as well as the main challenges thatthese technologies could solve. The authors took a re-flective rather than predictive stance in their work. Onthe 15-year anniversary of the Semantic Web commu-nity, Bernstein et al. [2] provide their vision of re-search beyond 2016 by grounding their predictions inan overview of past and present research. Therefore,their paper is both reflective of past/present work andpredictive in terms of future research.

Each of the vision papers mentioned above are pri-marily based on the expert knowledge of the authorsand reflect their views, without aiming to be complete.Our objective is to use the topics identified in these

seminal papers as a baseline for a comparison withthe output of the three data-centric topic identificationmethods discussed in this paper. Note that, unlike ininformation retrieval research, the proposed ResearchLandscape (cf. Table 2) is by no means an absolutegold-standard that should be achieved, but rather actsas an intuitive comparison basis for understanding thestrengths and weaknesses of expert-driven versus data-driven topic identification methods.

4.2. Core topics from the seminal papers

After manually annotating research topics discussedin each of the seminal papers, we aligned the identi-fied topics across papers, and observed eleven core re-search topics that are mentioned by two or three of theseminal papers (cf. Table 2). All three papers agree onthe following eight core research topics:

Knowledge representation languages and stan-dards, such as XML, RDF and a so-called Seman-tic Web language, were considered crucial to enablingthe vision of intelligent software agents by Berners-Lee et al. [1]. Work on the development of web-basedknowledge representation languages (now also includ-ing OWL) continued over the next 7 years [13]. By2016 this was seen as a core line of research extendingalso to the standardisation of representation languagesfor services [2]. As for the future, Bernstein et al.[2] predict that knowledge representation research willfocus on representing lightweight semantics, dealingwith diverse knowledge representation formats and de-veloping knowledge languages and architectures for anincreasingly mobile and app-based Web.

Knowledge structures and modeling. Berners-Leeet al. [1] consider knowledge structures such as ontolo-gies, taxonomies and vocabularies as essential compo-nents of the Semantic Web. Follow up papers confirmactive research on the creation of ontologies [2,13].While, Bernstein et al. [2] introduce knowledge graphsas novel knowledge representation structures.

Logic and Reasoning. Berners-Lee et al. [1] as-sumed that inference rules and expressive rule lan-guages would enable logic-based automated reasoningon the Semantic Web. Their prediction was abundantlyconfirmed in follow-up papers: Feigenbaum et al. [13]reporting work on the development of inference en-gines for reasoning by 2007; and Bernstein et al. [2]confirming work on developing tractable and efficientreasoning mechanisms.

Search, retrieval, ranking, and question answer-ing. Besides intelligent agents, Berners-Lee et al. [1]

Page 8: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

8 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 2Research Landscape: Core and Marginal topics discussed in the seminal papers. Topics in () were only intuitively mentioned.

Berners-Lee et al. [1]Future

Feigenbaum et al. [13]Past (2000-2007)

Bernstein et al. [2]Past (2000-2016)

Bernstein et al. [2]Future from 2016

knowledge representationlanguages and standards

knowledge representationlanguages and standards

knowledge representationlanguages and standards

representing lightweightsemantics

ontologies and modeling,taxonomies, vocabularies

ontologies and modeling,taxonomies, vocabularies

ontologies and modeling,(PR) knowledge graphs

-

logic and reasoning logic and reasoning logic and reasoning -search and questionanswering

(ranking) (PR) question answeringsystems

-

(data integration) (ontology matching) (PR) needs-based,lightweight data integration

integration of heterogeneousdata

proof & trust privacy, trust,access control

personal information,privacy

trust & data provenance(representation, assessment)

databases semantic web databases database managementsystems

-

decentralization (decentralization) vastly distributedheterogeneous data

(decentralization)

Cor

eto

pics

(machine learning,prediction, analysis,automatic report)

knowledge extraction anddiscovery

latent semantics,knowledge acquisition,ontology learning

-

- query language (SPARQL) developing efficient querymechanisms

-

- (linked data, DBpedia) (PR) linked data(open government data),(social data)

-

intelligent software agents - multilingual intelligentagents

-

(Internet of Things) - - high volume and velocity ofdata, e.g., streaming &sensor data

- (scalability, efficiency, ro-bust semantic approaches)

- scale changes drastically

(semantic web services) - - -- visualization - -- change management and

propagation- -

- (social semantic web,FOAF)

-

Mar

gina

ltop

ics

- - - data quality, e.g.,representation, assessment

predicted that search and question answering pro-grams would also benefit from the Semantic Web.In 2007, Feigenbaum et al. [13] indirectly refer tothis topic in the context of ranking, however this re-search topic becomes increasingly important accord-ing to Bernstein et al. [2] who describe work on ques-tion answering systems based on semantic markup andlinked data from the Web (e.g., IBM Watson).

Matching and Data Integration. Ontology match-ing and data integration were already intuitively men-tioned, but not concretely named, by Berners-Leeet al. [1]. Data integration played an important rolein many commercial applications developed up un-til 2007 and opened up the need for change manage-ment and change propagation across integrated datasets [13]. By 2016, a new trend towards needs-based,lightweight data integration is observed [2]. For the fu-ture, Bernstein et al. [2] discuss the need to integrate

heterogeneous data as part of the broader topic of datamanagement.

Privacy, Trust, Security, and Provenance. Berners-Lee et al. [1] envision proofs and digital signatures askey aspects of the Semantic Web in order to enablemore trustworthy data exchange and the topic of pri-vacy was also mentioned in 2007 [13]. According toBernstein et al. [2] future work should focus on the rep-resentation and assessment of provenance information,as part of the broader topic of data management.

Semantic Web Databases. Similarly to Berners-Lee et al. [1], Feigenbaum et al. [13] discuss re-search topics around the development of SemanticWeb tools as instrumental for commercial uptake, es-pecially ontology editors (e.g., Protégé) and SemanticWeb databases (e.g., triple stores). According to Bern-stein et al. [2] many of these tools evolved into com-mercial tools by 2016.

Page 9: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 9

Distribution, decentralization, and federation.Berners-Lee et al. [1] envisioned that the SemanticWeb would be as decentralized as possible, bringingnew interesting possibilities at the cost of losingconsistency. Feigenbaum et al. [13] exemplified oneof these novel scenarios by mentioning FOAF as anexample of a decentralized social-networking system.Bernstein et al. [2] commented on this topic briefly,confirming that modern semantic approaches alreadyintegrate distributed sources in a lightweight fashion,even if the ontologies are contradictory.

Besides the aforementioned core topics, three im-portant topics were not predicted by Berners-Lee et al.[1], but were mentioned by the other two papers:

Knowledge extraction, discovery and acquisition. In2007, Feigenbaum et al. [13] hint at this topic withterms such as machine learning, prediction and anal-ysis. Automatic knowledge acquisition was boostedby more powerful statistical and machine learning ap-proaches as well as improved computational resources[2]. For the future, Bernstein et al. [2] identify a needfor new techniques to extract latent, evidence-basedmodels (ontology learning), to approximate correct-ness and to reason over automatically extracted ontolo-gies/knowledge structures. An increasing importanceis given to using crowdsourcing for capturing collec-tive wisdom and complementing traditional knowl-edge extraction techniques.

Query Languages and Mechanisms. By 2007, re-search also focused on the development of query lan-guages, most notably SPARQL [13] and developing ef-ficient query mechanisms [2].

Linked Data. By mentioning DBpedia, Feigenbaumet al. [13] intuitively pointed to the future researchtopic of Linked Data. This topic became well estab-lished by 2016 and a new wave of structured data avail-able on the web (e.g., open government data, socialdata) further extended research on the Linked (open)Data topic [2].

4.3. Marginal topics from the seminal papers

Our analysis also identified several marginal topics,mentioned by two of the seminal papers (Table 2), asfollows:

Intelligent software agents. The underpinning themeof Berners-Lee et al. [1]’s vision paper was intelligentsoftware agents that would provide advanced function-ality to users by being able to access the meaning ofSemantic Web data. Interestingly, this topic has not

been mentioned until recently, when Bernstein et al.[2] discuss work on training conversational intelligentagents based on multilingual textual data on the web.

Internet Of Things. The application of SemanticWeb to physical objects within the context of the futureInternet Of Things (IoT) was intuitively mentioned byBerners-Lee et al. [1]. This topic was not mentionedby any of the follow-up papers, even thought it is con-sidered to play an important role in the future. Indeed,Bernstein et al. [2] predict that dealing with high vol-ume and velocity data will be necessary due to the in-creased number of streaming data sources from sen-sors and the IoT. They envision techniques for the se-lection of streaming data (data triage), for decision-making on streaming sensor data as well as the inte-gration of streaming sensor data with high quality se-mantic data.

Scalability, efficiency and robustness. Feigenbaumet al. [13] position scalability, efficiency and robustsemantic approaches as key factors needed to ad-dress Semantic Web challenges, in particular integra-tion, knowledge management and decision support. Inturn, Bernstein et al. [2] recognize that new research isneeded given that the scale changes drastically.

Semantic Web Services. Berners-Lee et al. [1] alsoenvisioned the applicability of Semantic Web tech-nologies for advertising and discovering web-services.

Human-Computer Interaction. Feigenbaum et al.[13] mention visualization as features of user-centricapplications.

Change management and propagation. Feigenbaumet al. [13] mention or hint that change managementand change propagation across integrated data sets isneeded to accompany data integration research.

Social semantic web. Although predicting futuretrends was not their explicit goal, by mentioning FOAFFeigenbaum et al. [13] intuitively pointed to the futureresearch topic on the Social Semantic Web.

Data quality Under the heading of data manage-ment, Bernstein et al. [2] group work on data integra-tion, data provenance and new technologies that shouldallow representing and assessing data quality, such astask-focused quality evaluation (e.g., is a resource ofsufficient quality for a task?).

4.4. Trends

Although the seminal papers focus primarily on re-search topic identification, they also offer some hintson the way these topics evolve over time (i.e., trends).

Page 10: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

10 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

In 2001, Berners-Lee et al. [1], used a fictitious sce-nario to describe a vision of a web of data that canbe exploited by intelligent software agents that carryout data centric tasks on behalf of humans. Addition-ally the paper identifies the infrastructure necessary torealize this vision focusing on four broad areas of re-search, namely: expressing meaning, knowledge repre-sentation, ontologies and intelligent software agents.

In 2007, Feigenbaum et al. [13] reflected on theideas presented in [1] and highlighted that although theoriginal autonomous agent vision was far from beingrealized, the technologies themselves were proving tobe highly effective in terms of tackling data integra-tion challenges in enterprises especially in the life sci-ences and health care domains. Furthermore, the au-thors highlighted that consumers were starting to adoptFOAF profiles and to embrace decentralized social-networking. However, they also point to new privacyconcerns when linking disparate data sources.

In 2016, discussing present research topics, Bern-stein et al. [2] noted a large spectrum between two op-posite research lines on expressivity and reasoning onthe Web on the one hand and ecosystems of LinkedData on the other. Particularly notable is the adoptionof Semantic Web technologies in several large, moreapplied systems centered around knowledge graphs,which use Semantic Web representations yet ensurethe functionality of applied systems which resulted inless formal and precise representations than expectedat the earlier stages of Semantic Web research. Basedon these considerations, the authors predict movingfrom logic-based to evidence-based approaches in aneffort to build truly intelligent applications using vast,heterogeneous, multi-lingual data.

5. Semantic Web Publications Topic and TrendAnalysis

In this section we describe the results of topic andtrend analysis by employing data-driven tools. Thebottom up analysis was performed with three differenttools (i.e., PoolParty, Rexplore, and Saffron) that en-able users to gain insights into the various research top-ics that appear in research papers published at popularSemantic Web publishing venues.

5.1. PoolParty quantitative analysis

The analysis conducted by PoolParty was based ona coarse grained taxonomy of 3,420 unique dictio-

nary topics (that were crowd-sourced from experts inthe community in the form of conference and jour-nal metadata), which was generated by assigning eachtopic to one or more foundational technologies workedon by the community.

The chart presented in Figure 3 provides details onthe % coverage for each of the eighteen foundations,across the five venues for the 10-year timeframe underexamination. As expected, Knowledge Representation& Data Creation/Publishing/Sharing is the top foun-dation, with almost 23% of the total occurrences inall documents. This foundation includes several topicsthat are fundamental to the Semantic Web community(i.e., the ability to represent semantic data and to pub-lish and share such data). Next in order of importance,the management of such knowledge (Data Manage-ment) and the construction of feasible systems (Sys-tem Engineering), constitute almost 16% and 11% ofthe occurrences, respectively. Important functional ar-eas such as Searching, Browsing & Exploration, DataIntegration and Ontology/Thesaurus/Taxonomy Man-agement also figure strongly in comparison to the otherfoundations (all of them with more than 7.5% occur-rences). In contrast, very specific topics, such as For-mal Logic, Formal Languages, Description Logics &Reasoning, and Concept Tagging & Annotation repre-sent a modest 4.4% and 2.6% respectively, and cross-topics, such as Human Computer Interaction & Visual-ization, Machine Learning, Computational Linguistics& NLP, Security & Privacy, Recommendations, andAnalytics are only marginally represented. Topics thatrelate to Quality, Dynamic Data & Streaming, and Ro-bustness, Scalability, Optimization & Performance arealso under-represented (at around 2%).

In order to gain some insights into the researchtrends over the last decade, Figure 4 depicts thegrowth/decline of each of the foundations over the 10-year timeframe. Although the general trend for all top-ics shows year on year increases, we note that Robust-ness, Scalability, Optimization & Performance, Dy-namic Data & Streaming, Searching, Browsing & Ex-ploration, and Machine Learning have increased bymore than 200% since 2005. In contrast, Security &Privacy, and Ontology/Thesaurus/Taxonomy Manage-ment have had marginal growth of only 30% for thesame period.

Figure 5 focuses on the growth/decline of the top 20multi-word topics. Interestingly, results show a sharpincrease of Linked Data at the expense of SemanticWeb. Note also that Natural Language is in the top-10 multi-word topics, even though this is a cross topic

Page 11: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 11

Fig. 3. PoolParty: % coverage per foundational technology across the 5 venues for the 10-year timeframe

Fig. 4. PoolParty: Growth/Decline of foundational technologies across the 5 venues for the 10 year timeframe

which may be more represented in a different com-munity. Finally, the decrease in the occurrence of WebServices can also be seen here.

5.2. Rexplore quantitative analysis

Since it is interesting to compare the trends exhib-ited by high-tier domain conferences with the ones

Page 12: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

12 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

(a) top 10 multi-word topics (b) top 11-20 multi-word topics

Fig. 5. PoolParty: Growth/Decline of the (a) top 10 and (b) top 11-20 multi-word topics across the 5 venues for the 10 year timeframe.

Fig. 6. Rexplore: Frequent topics (excluding subtopic) in MV (blue) and FSW (red).

appearing in the full literature, Rexplore comparesthe corpus described in Section 3 (here labelled MainVenues, MV) with a second data set containing all pa-pers that reference the Semantic Web in the Rexploredata set ((here labelled Full Semantic Web, FSW).

FSW includes 32,431 publications and it was producedby selecting all articles associated with the topic Se-mantic Web or with its 96 associated subtopics in CSO(e.g., Linked Data, RDF, Semantic Web Services) inboth MV and in a dump of all Scopus papers in Com-

Page 13: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 13

Fig. 7. Rexplore: Frequent Semantic Web subtopics in MV (blue) and FSW (red).

Fig. 8. Rexplore: Number of publications associated with eight Semantic Web subtopics in MV.

puter Science for the interval 2000-2015. The analysisof both data sets was limited to the 2006-2015 period.The analysis presented here follows the Expert-DrivenAutomatic Methodology (EDAM) [29] for performingsystematic reviews of scholarly articles. EDAM is amethodology that reduces the amount of manual te-dious tasks involved in systematic reviews by 1) ap-plying data driven methods for automatically generat-ing an ontology of research areas, 2) revising it withdomain experts, and 3) using it to annotate papers andproduce relevant analytics. We will first analyse themain topics appearing in each data set and then zoomon Semantic Web subtopics.

Figure 6 shows the main research fields addressedby the Semantic Web papers in both MV and FSW,

ranked by the percentage of their publications in thefield of Semantic Web. We excluded from this viewany super and sub areas of Semantic Web that will bediscussed later in detail. Unsurprisingly, the topic On-tology appears in about 61.2% of the papers (55.3%for FSW), followed by Artificial Intelligence (35.1%,27.2%), Information Retrieval (32.7%, 25.2%), QueryLanguages (26.5%, 17.1%) and Knowledge Base Sys-tem (17.5%, 12.7%). Interestingly, these five core re-search areas appear more often in the main venues(+7.1% in average), but they are also very importantareas for the FSW dataset. Other research areas appearmore prominently in one of the datasets. The QueryLanguage area is much more frequent in the MV, prob-ably due to the fact that the main venues traditionally

Page 14: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

14 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 9. Rexplore: Number of publications associated with eight Semantic Web subtopics in FSW.

are focused on Semantic Web query languages, such asSPARQL. Formal Logic has a similar behavior (10.9%,6.6%), suggesting a stronger focus of the main venueson this topic. Conversely, other research fields appearmore often in the FSW dataset. This is the case of Nat-ural Language Processing (17.5%, 18.9%), HumanComputer Interaction (9.8%, 15.5%), Web Services(5.6%, 13.4%), Electronic Commerce (3.2%, 4.3%)and Ubiquitous Computing (3.1%, 5.3%). This seemsto suggest that there is a good amount of research inthe intersection of these topics and Semantic Web thatis not fully represented in the main venues.

The Semantic Web field subsumes several heteroge-neous research areas dealing with different aspects ofits vision. Figure 7 shows the popularity of the mainSemantic Web direct subtopics in the two datasets. Weinclude in this view also the area of Ontology Engi-neering, which is not formally a sub-topic of Seman-tic Web, since a very large portion of its outcomesare published in the main Semantic Web venues. It isagain interesting to consider the difference between thedatasets. The topics Linked Data (23.4, 8.8%), Ontol-ogy Matching (9.5%, 2.1%), OWL (9.4%,6.7%), andSPARQL (3.5%,1.9%) are more frequent in the mainvenues. Conversely publications addressing Seman-tic Search (4.0%, 10.6%) and Semantic Web Services(1.7%, 3.9%) are more popular outside these venues.

Figure 8 and Figure 9 show the popularity of themain sub-topics over the years. The two main dynam-ics, evident in both datasets, are the fading of Seman-tic Web Services and the rapid growth of Linked Data

and to a lesser extent of SPARQL. Indeed, SemanticWeb Services is one of the main areas in 2004, andan integral part of the initial Semantic Web vision [1].However, the number of papers about these topics con-sistently decreases and from 2013 there are almost nopublications about them in the SW corpus and veryfew in FSW. The second trend is the steady growthof Linked Data from 2007. In 2015 about half of Se-mantic Web papers in the main venues refer to thistopics. Interestingly, both trends are first anticipatedby the main venues, and only later evident also in theFSW dataset. It thus seems that the tendencies of themain venues influence in time all the Semantic Webresearch.

5.3. Saffron quantitative analysis

Saffron employs a domain-independent approach totopic extraction, which is one of its biggest advan-tages compared to most systems in the area, in thatit does not require external domain-specific classifica-tions. Such information is often not readily availableespecially in niche domains, and creating a classifi-cation is very costly in terms of time, human exper-tise needs, and maintenance. Saffron bypasses this bar-rier by automatically building a domain model fromthe input corpus itself, and by capturing the expertiseknowledge of the corpus by isolating its most genericconcepts. The constructed hierarchical taxonomy canbe visualized as a graph. We use Cytoscapes, an opensource software tool for complex networks graph visu-

Page 15: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 15

alization8. It allows us to perform a network analysison the output provided by Saffron, and a customiza-tion of the layout. In our case, the size and the color ofthe nodes are proportional to the number of neighborseach topic is connected to.

Figure 10 shows the general picture of the graph dis-playing the interconnected topics from the results ofthe analysis. The size and the colour of the nodes in thegraph are related to the number of edges that are con-nected to them, ie. the bigger nodes with red shades arethe most connected topics while the smaller and bluenodes are the leaves of the tree. The first and predomi-nant node (i.e., the root of the taxonomy) is the Seman-tic Web topic itself. Around it, several main clusterswith major keyphrases emerge, including: RDF Dataand Linked Data, followed by Natural Language, DataSource and Reasoning Task. A strong focus is also puton Machine Learning, Ontology Engineering, QueryExecution, and the mark-up language OWL-S. By con-centrating on the clusters, we identify the importanceof data in terms of its representation (RDF Data, RDFGraph, Linked Data), its accessibility (Open Data),and its querying (Query Execution, Query Process-ing). Some other main interests in the domain are vis-ible, represented by a cluster made up of Natural Lan-guage and topics related to the querying of informa-tion such as Semantic and Keyword Search, KeywordQuery, Semantic Similarity or Information Retrieval.Natural Language is also connected to another dom-inating topic that is Machine Learning, associated toOntology Matching and Mapping. One of the branchesoriginating from the Semantic Web topic brings to-gether concepts related to the structure and represen-tation of the ontology (Knowledge Base, KnowledgeRepresentation, Ontology Language, OWL Ontology),while a sub-branch leads to logic and reasoning relatedtopics (Description Logic, Reasoning task, ReasoningAlgorithm). The Ontology Engineering node is relatedto topics such as User Interface, Ontology Develop-ment and Ontology Editing.

As demonstrated above, the main nodes are at thecentre of clusters of topics that are semantically relatedto them. In the following analysis, we focus thus onthe evolution of those major terms, which are the mostprominent for a cluster. We selected the top 20 topicterms (i.e., the most connected ones) and observe thedistribution of their use in the SW corpus. The twocharts in Figure 11 show the percentage of documents

8Cytoscapes, http://www.cytoscape.org/

containing the aforementioned topics (i.e., the numberof documents where the term appear at least once), peryear. We observe that Semantic Web as a topic is on thedecline with a decrease of 20% between the beginningand the end of the studied time frame. It is still the mostused topic nonetheless, reaching 91% of distributionin the documents in 2006 and lowering down to 70%in 2015. The reasons for this decline could be mani-fold: the term/field may be so established that it is notnamed explicitly in the papers anymore or the commu-nity is trying to re-brand their research with new termssuch as Linked Data.

Indeed, the most significant progression is the use ofthe term Linked Data. While it was completely miss-ing in 2006, it experienced a very rapid growth in par-ticular between 2008 and 2010 where its rise was 9-fold, to eventually reach 64% of the distribution in thedocuments by 2015. Similarly, the Open Data topic in-creased from about 1% in 2006/2007 to 45% in 2015.Other emerging topics include: Query Execution, ap-pearing in 2015 in 15% of the publications as well asRDF Data and Data Source which doubled their pres-ence since 2006. Topics whose popularity increasedby at least twice their initial proportion include RDFGraph (with two peaks in 2008 and 2014), MachineLearning (with a peak in 2012) and Query Processing(with a small peak in 2009 then a quite steady line).

Among the topics experiencing strong variationsthrough time, the term Web Service is a declining one.After experiencing a peak of use in 2008 with a 40%distribution in the documents, it then dropped to lessthan 20% in 2015. Semantic Search experienced twosmall peaks in 2008 and 2011, and slight drops in2010 and 2012 to a more steady curve thereafter. Sometopics appear to be consistent over the years, such asOntology Engineering, while some others are morevolatile. The Natural Language topic, despite beingequally cited in 2006 and in 2015, gradually droppedin the first half of the period examined, to gain in pop-ularity again after 2011. Keyword Search shows quite avaried pattern, with drops in 2007, 2010 and 2012, andpeaks in 2009 and 2011. As for Service Description,it increased slowly up to 13% by 2009, but graduallydeclined towards its initial value by 2015.

5.4. Comparison of top forty topics extracted by eachtool

Table 8 and Table 9 in Appendix A, highlight thetop 40 multi-word topics that were extracted by at leasttwo data-driven tools and those that were only identi-

Page 16: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

16 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Fig. 10. Saffron: Taxonomy of Semantic Web topics.

(a) top 10 topics (b) top 11-20 topics

Fig. 11. Saffron: Topic term occurrence evolution over time for (a) top 10 topics and (b) top 11-20 topics.

fied by a single data-driven tool, respectively, based ona simple syntactic matching of the topics. After nor-malizing the topic names across the sets, we found 86unique topics. 12 of these were detected by all sys-tems and 23 by at least two systems. We thus com-puted the Spearman’s Rank-Order Correlation on theintersection of the three sets. We found that Rexploreand Poolparty exhibit a moderate correlation (ρ = 0.61)

and a statistically significant association (p (2-tailed)= 0.035). Conversely, the list produced by Saffron isnot correlated with the ones of Rexplore (ρ = 0.01, p(2-tailed) = 0.966.) or Poolparty (ρ = 0.01, p (2-tailed)= 0.681).

The topics uncovered by all three tools could be cat-egorized as reflecting the core focus of the community(knowledge base, linked data, semantic search, seman-

Page 17: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 17

tic web, web services, ontology matching, query lan-guages) and several well established sub-communities(information retrieval, machine learning, natural lan-guage processing, ontology engineering). The topicsuncovered by two ore more tools further elaborate oncore research topics within the community (data in-tegration, data source, linked open data (LOD), on-tology language, open data, query processing, socialnetworks, user interfaces, web data, web semantics,web ontology language (OWL)). While, the topics un-covered by only one tool are a mix between support-ing technology (e.g., rdf data, rdf Graph, search en-gines, logic programming, SPARQL), very specific top-ics (e.g., human computer interaction, stream process-ing, data privacy, federated query processing), com-monly used data sources (e.g., DBpedia, wikipedia),and frequently used terms (e.g., on the web, use cases,web of data).

Although in this paper we do not go into details ofthe specific algorithms employed by PoolPary, Rex-plore and Saffron, it is possible to speculate as to whycertain topics appear in the top forty list of the varioustools. For instance, considering that the PoolParty tax-onomy is created from conference and journal meta-data, it is not surprising that topics such as case studies,use cases and references to on the web or web of dataappear, as these terms could frequently occur in callsfor papers. In the case of Rexplore we see evidenceof broader topics, such as artificial intelligence andhuman computer interaction that are reflective of thebroader nature of the Rexplore taxonomy, which wasgenerated from a more general computer science cor-pus. Finally, considering that Saffron not only learnsthe topics from the corpus, but also tries to identify dis-tinguishing topics for papers, it is not surprising thatwe see evidence of specfic topics such as federatedquery processing and stream processing.

6. Topic Alignment and Findings

In this section, we compare and contrast the top-ics extracted by the three bottom-up data-driven ap-proaches (Rexplore, Saffron, PoolParty) and the coreand marginal topics mentioned in the seminal Seman-tic Web papers (discussed in Section 4), with primarytopics identified by the data-driven approaches pre-sented in Section 5. Initially we conducted the map-ping exercise with the top 20 topics, however after see-ing that there were no mappings for several core topicswe elected to use the top 40 multi-word topics from

PoolParty, Rexplore and Saffron (see Table 7 in Ap-pendix A).

6.1. Core and marginal topic analysis

The analysis presented in this section is based ona comparison between the core and marginal topicsmentioned in the seminal Semantic Web papers andthe predominant topics uncovered by PoolParty, Rex-plore and Saffron. In contrast to the aforementioneddata-driven topic analysis, which was based primarilyon the syntactic cross-correlation of topics extractedby PoolParty, Rexplore and Saffron, the analysis pre-sented in this section is based on the clustering of sim-ilar topics.

Core topic analysis: As shown in Table 3 all threedata-driven approaches uncovered eight out of elevenof the Research Landscape topics and all topics wereuncovered by at least one data-driven approach. No-table omissions include the distribution, decentraliza-tion and federation topic, which was not uncoveredby PoolParty and Rexplore, the privacy, trust, security,and provenance topic, which did not figure in the pri-mary topics uncovered by Saffron, and the semanticweb databases topic which was not ranked highly byRexplore.

Marginal topic analysis: Comparing the output fromthe data-driven approaches to the marginal topics pre-sented in Table 4 we observe reduced coverage, withthe multilingual intelligent agents and change man-agement and propagation topics not featuring in anyof the top 40 topic lists produced by PoolParty, Rex-plore and Saffron. While, the scalability, efficiency, ro-bust semantic approaches topic was only identified byPoolParty and not by Rexplore and Saffron.

Additional topics: In order to complete the analysisin Table 5 we highlight the topics that were extractedby the data-driven approaches, however were not men-tioned in the seminal papers. All three tools identifiedtopics that are very general in nature and as such couldnot be easily mapped to the primary topics appearingin the seminal papers. For instance, recommendations,use cases, case studies, open data, information sys-tems, web data, semantic technology, and structureddata in the case of PoolParty, computational linguis-tics, recommender systems, mobile devices, cloud com-puting, e-learning system, robotics, electronic com-merce systems, and decision support systems in thecase of Rexplore, and open data, web data, web tech-

Page 18: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

18 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 3Core research topics identified in the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsCore topic PoolParty Rexplore Saffron PoolParty Rexplore Saffron

knowledgerepresentation languagesand standards

knowledge representation, knowledge based systems,knowledge representation,Resource DescriptionFramework (RDF), WebOntology Language (OWL)

rdf data, owl s, blank node,object property

Knowledge structuresand modeling

ontology/thesaurus/taxonomymanagement, websemantics, ontology engi-neering, ontology language,data models, ontologymatching

ontology, ontology engineer-ing

owl ontology, ontologyengineering, rdf graph,data model, ontology lan-guage, ontology editing,web semantics, ontologydevelopment, ontologymatching

logic and reasoning description logic, formallogic/ formal languages/de-scription logics, logicprogramming

formal logic, descriptionlogic, Web Ontology Lan-guage (OWL)

reasoning task, descriptionlogic

search, retrieval,ranking, questionanswering

search engines, semanticsearch, web search, naturallanguage, searching/ brows-ing/ exploration, computerlinguistics & NLP systems,information retrieval

information retrieval, se-mantic search/similarity,computer linguistics

keyword search, semanticsearch, natural language, in-formation retrieval

matching and dataintegration

ontology matching, ontologyalignment, similarity mea-sures, data integration

ontology matching, data inte-gration

ontology matching, seman-tic similarity

privacy, trust, security,provenance

- security & privacy security of data, data privacy -

semantic web databases - data sets, knowledge base,data source, knowledge man-agement, data management

knowledge base systems data source, relationaldatabase, knowledge base

distribution,decentralization,federation

- - - - federated query, federatedquery processing

query languages andmechanisms

query languages, query an-swering, query processing

query languages, SPARQL,SPARQL queries

query execution, keywordquery, query processing,query language

linked data linked data, linked open data,semantic web, web of data,data integration, data cre-ation/publishing/sharing

linked data, semantic web,linked open data, data inte-gration

linked data, semantic web

knowledge extraction,discovery andacquisition

information retrieval, ma-chine learning, extraction,data mining, text mining,entity, extraction, analytics,machine learning

information retrieval, natu-ral language processing, datamining, machine learning,natural language processingsystems

machine learning, informa-tion retrieval

nology in the case of Saffron. Several of the topicsuncovered by Rexplore stand out from the others asthey are not topics per se but rather application or usecase oriented keywords that were not extracted fromthe seminal papers.

6.2. Evidence of future topics

Besides using the data-driven approaches to look forevidence of the topics that the community have beenactively working on, we also investigated if the data-driven approaches could also find evidence of futuretrends predicted in the seminal papers, in particularthose mentioned by Bernstein et al. [2]. According to

our mapping presented in Table 6, evidence with re-spect to each of the four main lines of future researchtopics was uncovered by at least one of the data-drivenapproaches. Interestingly, all approaches found topicsrelating to the Internet of Things, streaming and sen-sor data, indicating a rise in importance of this topicwithin the Semantic Web community. However, at thesame time, the other three topics that relate to scale, in-telligent software agents and quality were only weaklyidentified by the seminal papers.

Page 19: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 19

Table 4Marginal research topics identified in the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsMarginal topic PoolParty Rexplore Saffron PoolParty Rexplore Saffronmultilingual intelligentagents

- - - - - -

semantic web services web service, semantic webservice

web services, semantic webservices

web service, service descrip-tion

visualization, userinterfaces andannotation

user interfaces, semantic an-notation, human computerinteraction & visualization,annotation, concept tagging

human computer interaction,visualization

user interface

(scalability, efficiency,robust semanticapproaches)

- - robustness, scalability, opti-mization and performance

- -

change management andpropagation

- - - - - -

(social semantic web,FOAF)

social network social networks social medium

Table 5Research topics covered by the data-driven approaches that were not identified by the seminal papers.

PoolParty Rexplore Saffronrecommendations, use cases, case studies, opendata, information systems, web data, semantictechnology, structured data

computational linguistics, recommender systems,mobile devices, cloud computing, e-learning sys-tem, robotics, electronic commerce systems, deci-sion support systems

open data, web data, web technology

Table 6Visionary research topics from the seminal papers and their coverage by the data-driven approaches.

Coverage Matched topicsFuture topic PoolParty Rexplore Saffron PoolParty Rexplore Saffron

scale changes drastically - robustness, scalability, op-timization and performance

- -

intelligent software agents - - - artificial intelligence -

(Internet of Things), highvolume and velocity of data,e.g., streaming & sensor data

dynamic data / streaming Internet of Things stream processing

data quality, e.g,representation, assessment

- - quality - -

6.3. Evidence of trends

In the following we summarize the analysis of thetrends identified by PoolParty (cf. Figure 4- 5), Rex-plore (cf. Figure 8- 9) and Saffron (cf. Figure 11). Thefoundational topic and trend analysis conducted viaPoolParty did not yield any useful results, as gener-ally speaking work on each of the foundational top-ics appear to be increasing year on year. A cross cor-relation of the trends highlighted by PoolParty, Rex-plore and Saffron provides evidence that topics such aslinked data, open data and data sources have an up-ward trend, whiles topics such as semantic web, webservice, service description and ontology matching ap-pear to be on a downward trend. When it comes totrend analysis using the data-driven approaches, it isclear that neither foundational topic analysis nor topic

specific analysis, provides us with enough evidenceto confirm the visions outlined in the seminal papers.For this there is a need for a more focused analysisthat maps visions to relevant research topics and usesyear on year aggregate counts to depict trends. Al-though, Fernandez Garcia et al. [14] made some ini-tial attempts at mapping the trends identified by Pool-Party to the visions from the seminal paper, unfortu-nately such a mapping is not very straightforward evenfor manual mappings and as such is left to future work.

6.4. Mixed methods observations

The comparative analysis of the research topicsidentified with the qualitative and quantitative meth-ods, discussed in the previous sections, reveals several

Page 20: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

20 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

interesting observations on the benefits and drawbacksof these approaches, as discussed next.

Qualitative vs. Quantitative approaches. Compar-ing the quality of topic detection using data-drivenmethods with that of expert-driven methods (cf. Ta-ble 3), we observe that data-driven approaches had ahigh recall when it comes to detecting core topics iden-tified by experts in the seminal papers. Data-drivenmethods failed however to cover multidisciplinary top-ics, (i.e., topics that cross boundaries between areas),such as distribution, decentralization, federation, orprivacy, trust, security, provenance, or semantic webdatabases. These weakly covered topics are particu-larly interesting, as they indicate research areas that,although considered important by experts, have not yetattracted a critical mass of research to be reliably iden-tified with quantitative methods.

Analyzing the coverage of marginal topics (cf. Ta-ble 4), we find an opposite phenomenon of researchtopics for which there is marginal agreement amongexperts, but strong data-driven evidence of work onthose topics. Indeed, data-driven approaches confirmsome of the marginal topics such as social semanticweb and human computer interaction. These are topicson which a sufficient volume of work is performed toallow identification by data-driven approaches, but forwhich a core community has not yet been formed.

As expected, the coverage of visionary topics ( Ta-ble 6) was lower. Although these periphery topics aresomehow addressed by the Semantic Web community,the data-driven analysis failed to represent them withthe required fine-grained details. It is clear from theresults of our analysis that further work on trend de-tection and analysis is needed in order to better detectemerging topics and to understand the research gapswith respect to the vision.

A major benefit of data-driven methods is that theyare capable of providing evidence of the popularity ofresearch areas and topics over time and consequentlycan be used to derive research trends (although theseare somewhat sensitive to the available data and canbe less accurate when data is missing, for instance to-wards the end of the analysis period). When it comesto topics that appear in the Research Landscape but areunderrepresented according to our data-driven analy-sis, such information could be used to encourage pub-lications on these topics via calls for papers of futureconferences or via workshops or journal special issues.

Comparison of Quantitative Methods. For thequantitative analysis of our work, we employed data-

driven methods that differed, among others, in the waythe topic taxonomy was created. In the case of Pool-Party a manually built topic taxonomy was employedwhich closely reflected the topics on which the com-munity are looking for in call for papers or in con-ference programs. Rexplore made use of the CSO on-tology, a large-scale ontology of computer science ex-tracted from a very large corpus and covering key re-search areas as well as associated research topics. Fi-nally, Saffron extracted its taxonomy of topics entirelyfrom the corpus under analysis and used clustering toidentify topics that belong to a research area (with-out actually deriving research area names). Obviously,these approaches of procuring the topic taxonomy aredecreasing in terms of cost as per the time of expertinvolvement.

In terms of overall performance, (cf. Tables 3, 4, 6),PoolParty identified 17/21 core, marginal and futuretopics (10/11 core topics; 4/6 marginal topics; 3/4 fu-ture topics). Together with Saffron, PoolParty identi-fied the most core topics, while achieving the high-est recall for the other two topic categories too (i.e.,marginal and future topics). Closely after PoolParty,Rexplore identified 14 of the 21 topics of the ResearchLandscape (9/11 core topics; 3/6 marginal topics; 2/4future topics), identifying in each category just onetopic less than PoolParty. Finally, Saffron is overallvery close in its coverage to that of the other two toolsby identifying 13 out of 21 topics (10/11 core topics;2/6 marginal topics; 1/4 future topics). While having avery good coverage of the core topics, Saffron’s per-formance was remarkably inferior to the other tools forthe other topic categories, where it primarily identifiedthose topics which were already identified by the othertools. From the above, we conclude that the use of a-priory built taxonomies of research areas, while moreexpensive, leads to a better coverage of research topics,especially in the analysis of marginal or emerging re-search topics. Moreover, we attribute the high successof PoolParty to covering research topics to the fact thatit relied on a high-quality, manually built topic taxon-omy that was well aligned to the domain as the topicswere extracted from conference and journal metadata.

While the most cost-effective, Saffron identified abag of topics that was less straightforward to align toresearch areas than the output of the other two ap-proaches that relied on taxonomies of research areas(and associated topics). The alignment and interpreta-tion of Saffron topics required expert knowledge andtherefore Saffron should ideally also be used in set-tings where such expert knowledge is available.

Page 21: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 21

While PoolParty had the best performance in con-firming research topics from the qualitative analysis,Rexplore provided the most additional topics (cf. Ta-ble 5), clearly identifying research topics at the in-tersection of the Semantic Web and other researchcommunities (e.g., computational linguistics and cloudcomputing), thus providing invaluable support in po-sitioning the work of our community in a broader re-search context.

7. Conclusion

The analysis of research topics and trends is an im-portant aspect of scientometrics which is expandingfrom qualitative expert-driven approaches to also in-clude data-driven methods. The Semantic Web com-munity is no different, with several seminal papers re-flecting on and predicting the work of the commu-nity and data-driven methods (based on Semantic Webtechnologies) trying to achieve similar topic and trenddetection activities (semi-)automatically.

With this study, we aimed to go beyond the vari-ous views on our community’s Research Landscapescattered in several papers and obtained with differ-ent methods. To that end, we proposed the use of amixed methods approach that can converge, unify butalso critically compare conclusions reached with bothexpert or data-driven approaches. Finally, we concludethis study by revisiting the original research questions:

Is it possible to identify the predominant Seman-tic Web research topics using both expert basedpredictions and topic and trend identification tools?A key benefit and novelty of our work is that we iden-tified and aligned core research topics mentioned inthe seminal papers and then verified these using data-driven methods. After extracting, grouping and align-ing the topics from the seminal papers, we concludedon eleven core Semantic Web topics (cf. Table 3), out ofwhich eight were confirmed by all the data-driven ap-proaches, while the remaining three indicate topics thatare important but not sufficiently represented in papersat the key Semantic Web venues. Besides these coretopics, we capture six marginal topics (cf. Table 4) outof which two are very strongly supported by evidencefrom data-driven methods.

From a trends perspective, it was clearly visible thattopics such as linked data, open data and data sourceshave increased in importance over the years. While, atthe same time, topics such as semantic web, web ser-

vice, service description and ontology matching seemto appear less and less. Although we could speculate asto why this is the case (e.g., a push by the communitytowards using semantic technology to open up and linkdata may have caused a decline in work in relation toservice based machine-to-machine interaction), how-ever a more in depth analysis, involving sources otherthan over research papers, would be needed in order toconform our suspicions.

Looking into the future, we identify four future top-ics (cf. Table 6), from which the topics on IoT, sen-sor and streaming data has ample evidence in theanalyzed research corpus. Finally, the Rexplore data-driven method provided insights into the interactionsof our fields with other research areas, highlightingits cross-disciplinary nature. Considering the growinginterest in scientomentrics within the Semantic Webcommunity, our findings could be used as a base-line for benchmarking other topic and trend detectionmethods for the same time period, or extended to caterfor more recent work by the community.

What are the strengths and weaknesses of expert-driven and data-driven topic and trend identifi-cation methods? Qualitative, expert-driven methodsbenefit from insights by experts who reflect on pastor present research topics and trends and predict fu-ture directions. As such, they remain valuable assetsin the scientometrics tool-box. Data-driven methodschallenge expert-analysis by providing a surprisinglyhigh recall, especially for core research topics, and nat-urally less for marginal and emerging topics. However,a major benefit of data-driven methods is that theirfindings are backed-up by quantitative data which canbe used to perform a range of other analytics such asresearch trend detection or identifying connections be-tween research topics.

A key element of the data-driven approaches consid-ered here is the use of a topic taxonomy which can bederived with costly, manual effort, semi-automaticallyor fully-automatically. Not-entirely surprising, well-curated taxonomies lead to the best performance, butthese naturally age very quickly and their mainte-nance is not sustainable. Therefore, semi-automatic orfully-automatic taxonomy construction methods offera cheaper and more sustainable alternative with only aslight loss of recall.

In this paper, we proposed and demonstrated theuse of a mixed methods approach, which combinesboth qualitative and quantitative methods in an attemptto overcome their respective weaknesses. This mixed

Page 22: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

22 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

methods approach has several strengths. Firstly, it al-lowed us to synchronize the results of several qual-itative studies and propose a unified Research Land-scape of the area. Secondly, by comparing and con-trasting the Research Landscape with the results of thedata-driven methods, we could: (1) confirm those top-ics that are both seen as important by experts and forwhich quantitative evidence can be gathered - these areclearly core topics in the community; (2) identify top-ics that experts consider important but for which data-driven methods do not (unanimously) find sufficientevidence in the corpus - these are topics that the com-munity should encourage; (3) identify topics on whichnot all experts agree (which is natural given somebias inadvertently brought in by experts) but whichare strongly represented in the research data - thesetopics could benefit from community building efforts.To summarize, mixed methods allows for drawing in-teresting conclusions in areas where quantitative andqualitative methods agree or disagree. A weak point ofthe presented method is the use of manual extractionand alignment of topics which could have introducedbias. We tried to minimize this by performing eachof these steps with multiple experts and then reachingagreement where their opinions differed.

In terms of future work, in this paper we havemainly focused on two approaches to analyse and re-flect about the past and to some extent the future devel-opment of our research community, using expert opin-ions and applying our own data-driven methods. How-ever, we could go a third way, adopting either (as men-tioned in the end of Section 4.2) emerging methodssuch as crowdsourcing for a similar reflectional exer-cise. That is, based on the findings and topics presentedhere, let the community itself on a larger scale than re-lying on the insights of a few of its established experts,assess the importance and future of topics for the com-munity. Such an analysis should probably counteractbiases in terms of ensuring that researchers do not as-sess/favor the (future) importance of their own field ofresearch, but we would expect this to be an interestingfuture direction.

Additional avenues for further study include: a morefocused analysis that maps visions to relevant researchtopics and generates the corresponding trends; thedeepening of the work to better understand the typeof coverage offered in each of the identified researchtopics; and a broadening of the work to consider notonly the research topics but also the application areas

and domains where these technologies are routinelyapplied.

Also, it would be interesting to test this method inother communities (e.g., Software Engineering) and tofurther improve the topic alignment methods to furtherreduce bias.

References

[1] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The se-mantic web. Scientific American, 284(5):28–37, 2001.

[2] Abraham Bernstein, James Hendler, and Natalya Noy. A newlook at the semantic web. Communications of the ACM, 59(9):35–37, 2016.

[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latentdirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

[4] Georgeta Bordea. Domain adaptive extraction of top-ical hierarchies for Expertise Mining. Thesis, Na-tional University of Ireland, Galway, Ireland, 2013. uri:http://hdl.handle.net/10379/4484.

[5] Georgetas Bordea and Paul Buitelaar. Expertise mining. InProceedings of the 21st National Conference on Artificial In-telligence and Cognitive Science, Galway, Ireland, 2010.

[6] Michel Callon, Jean-Pierre Courtial, William A Turner, andSerge Bauin. From translations to problematic networks: Anintroduction to co-word analysis. Information (InternationalSocial Science Council), 22(2):191–235, 1983.

[7] Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. Emerg-ing topic detection on twitter based on temporal and socialterms evaluation. In Proceedings of the tenth internationalworkshop on multimedia data mining, page 4. ACM, 2010.

[8] David Chavalarias and Jean-Philippe Cointet. Phylomemeticpatterns in science evolutionâATthe rise and fall of scientificfields. PloS one, 8(2):e54847, 2013.

[9] Chaomei Chen. Citespace ii: Detecting and visualizing emerg-ing trends and transient patterns in scientific literature. Journalof the Association for Information Science and Technology, 57(3):359–377, 2006.

[10] Xiang-Ying Dai, Qing-Cai Chen, Xiao-Long Wang, and JunXu. Online topic detection and tracking of financial news basedon hierarchical clustering. In Machine Learning and Cyber-netics (ICMLC), 2010 International Conference on, volume 6,pages 3341–3346. IEEE, 2010.

[11] Sheron Levar Decker. Detection of bursty and emerging trendstowards identification of researchers at the early stage oftrends. PhD thesis, uga, 2007.

[12] Jörg Diederich, Wolf-Tilo Balke, and Uwe Thaden. Demon-strating the semantic growbag: automatically creating topicfacets for faceteddblp. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pages 505–505. ACM,2007.

[13] Lee Feigenbaum, Ivan Herman, Tonya Hongsermeier, EricNeumann, and Susie Stephens. The semantic web in action.Scientific American, 297(6):90–97, 2007.

[14] Javier David Fernandez Garcia, Elmar Kiesling, Sab-rina Kirrane, Julia Neuschmid, Nika Mizerski, AxelPolleres, Marta Sabou, Thomas Thurner, and Peter

Page 23: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 23

Wetz. Propelling the potential of enterprise linkeddata in austria. roadmap and report., 2016. URLhttps://www.linked-data.at/wp-content/uploads/2016/12/propel_book_web.pdf.

[15] Volker Frehe, Vilius Rugaitis, and Frank Teuteberg. Sciento-metrics: How to perform a big data trend analysis with sci-enceminer. In GI-Jahrestagung, 2014.

[16] Birte Glimm and Heiner Stuckenschmidt. 15 years of semanticweb: An incomplete survey. KI-Künstliche Intelligenz, 30(2):117–130, 2016.

[17] Thomas Hofmann. Probabilistic latent semantic indexing. InProceedings of the 22nd annual international ACM SIGIR con-ference on Research and development in information retrieval,pages 50–57. ACM, 1999.

[18] William Hood and Concepción Wilson. The literature of bib-liometrics, scientometrics, and informetrics. Scientometrics,52(2):291–314, 2001.

[19] Yingjie Hu, Krzysztof Janowicz, Grant McKenzie, KunalSengupta, and Pascal Hitzler. A linked-data-driven andsemantically-enabled journal portal for scientometrics. InInternational Semantic Web Conference, pages 114–129.Springer, 2013.

[20] Yookyung Jo, Carl Lagoze, and C Lee Giles. Detecting re-search topics via the correlation between graphs and texts. InProceedings of the 13th ACM SIGKDD international confer-ence on Knowledge discovery and data mining, pages 370–379. ACM, 2007.

[21] Jon Kleinberg. Bursty and hierarchical structure in streams.Data Mining and Knowledge Discovery, 7(4):373–397, 2003.

[22] Nancy L Leech and Anthony J Onwuegbuzie. A typology ofmixed methods research designs. Quality & quantity, 43(2):265–275, 2009.

[23] Fergal Monaghan, Georgeta Bordea, Krystian Samp, and PaulBuitelaar. Exploring your research: Sprinkling some saffron onsemantic web dog food. In Semantic Web Challenge at the In-ternational Semantic Web Conference, volume 117, pages 420–435, 2010.

[24] Satoshi Morinaga and Kenji Yamanishi. Tracking dynamics oftopic trends using a finite mixture model. In Proceedings of thetenth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 811–816. ACM, 2004.

[25] Mizuki Oka, Hirotake Abe, and Kazuhiko Kato. Extractingtopics from weblogs through frequency segments. In Proc.of the Workshop on the Weblogging Ecosystem: Aggregation,Analysis and Dynamics, 2006.

[26] Francesco Osborne and Enrico Motta. Klink-2: integratingmultiple web sources to generate semantic topic networks.In International Semantic Web Conference, pages 408–424.Springer, 2015.

[27] Francesco Osborne, Enrico Motta, and Paul Mulholland. Ex-ploring scholarly data with rexplore. In International semanticweb conference, pages 460–477. Springer, 2013.

[28] Francesco Osborne, Angelo Salatino, Aliaksandr Birukou, andEnrico Motta. Automatic classification of springer nature pro-ceedings with smart topic miner. In International SemanticWeb Conference, pages 383–399. Springer, 2016.

[29] Francesco Osborne, Patricia Lago, Muccini Henry, and MottaEnrico. Reducing the effort for systematic reviews in softwareengineering. In preparation, 2018.

[30] Sergey Parinov and Mikhail Kogalovsky. Semantic linkages inresearch information systems as a new data source for sciento-

metric studies. Scientometrics, 98(2):927–943, 2014.[31] Thomas Schandl and Andreas Blumauer. Poolparty: Skos the-

saurus management utilizing linked data. In Extended Seman-tic Web Conference, pages 421–425. Springer, 2010.

[32] J Michael Schultz and Mark Liberman. Topic detection andtracking using idf-weighted cosine coefficient. In Proceedingsof the DARPA broadcast news workshop, pages 189–192. SanFrancisco: Morgan Kaufmann, 1999.

[33] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, andZhong Su. Arnetminer: extraction and mining of academic so-cial networks. In Proceedings of the 14th ACM SIGKDD inter-national conference on Knowledge discovery and data mining,pages 990–998. ACM, 2008.

Page 24: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

24 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Appendix

A. Additional results

Table 7Extended topics: Top-40 multiwords in Poolparty and top-40 topics in Rexplore (MV) and Saffron

Poolparty Rexplore Saffron1 semantic web semantic web semantic web2 linked data ontology rdf data3 knowledge base artificial intelligence linked data4 web service information retrieval natural language5 web semantics query languages data source6 data source linked data reasoning task7 data sets knowledge based systems machine learning8 description logic natural language processing systems query execution9 on the web Computational Linguistics owl S

10 natural language formal logic ontology engineering11 use cases data mining rdf Graph12 social network knowledge representation User Interface13 query languages human computer interaction service description14 search engines ontology matching open data15 query answering web ontology language (OWL) semantic search16 user interfaces description logic query processing17 semantic annotation linked open data (LOD) keyword search18 information retrieval data integration keyword query19 web of data web services owl ontology20 open data resource description framework (RDF) web service21 data models security of data query language22 semantic search ontology engineering data model23 ontology matching semantic search/similarity ontology matching24 information systems social networks web data25 query processing SPARQL federated query26 machine learning data privacy stream processing27 ontology language recommender systems relational database28 semantic web service electronic commerce blank node29 linked open data sensors information retrieval30 logic programming ubiquitous computing ontology language31 knowledge management semantic information description logic32 data integration SPARQL queries federated query processing33 ontology engineering pattern recognition semantic similarity34 semantic technology data visualization object property35 ontology alignment knowledge acquisition ontology editing36 web search information technology social medium37 web data mobile devices knowledge base38 structured data wikipedia web technology39 case studies machine learning web semantics40 similarity measures DBpedia ontology development

Page 25: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach 25

Table 8Extended topics extracted by two or more tools

Topic PoolParty Rexplore Saffron

description logic

information retrieval

knowledge base

linked data

machine learning

natural language processing

ontology engineering

semantic search

semantic web

web services

ontology matching

query languages

data integration -

data source -

linked open data (LOD) -

ontology language -

open data -

query processing -

social networks -

user interfaces -

web data -

web semantics -

web ontology language (OWL) -

Page 26: 1 IOS Press A decade of Semantic Web research …Saffron, PoolParty), on a corpus of Semantic Web papers published between 2006 and 2015. In this process, we both use the In this process,

26 S. Kirrane et al. / A decade of Semantic Web research through the lenses of a mixed methods approach

Table 9Extended topics extracted by only one tool

Topic PoolParty Rexplore Saffron

artificial intelligence - -

blank node - -

case studies - -

Computational Linguistics - -

data mining - -

data models - -

data privacy - -

data sets - -

data visualization - -

DBpedia - -

electronic commerce - -

engineering data model - -

federated query - -

federated query processing - -

formal logic - -

human computer interaction - -

information systems - -

information technology - -

keyword query - -

keyword search - -

knowledge acquisition - -

knowledge management - -

knowledge representation - -

logic programming - -

mobile devices - -

object property - -

on the web - -

ontology - -

ontology alignment - -

ontology development - -

ontology editing - -

owl ontology - -

owl S - -

pattern recognition - -

query answering - -

rdf data - -

rdf Graph - -

reasoning task - -

recommender systems - -

relational database - -

resource description framework (RDF) - -

search engines - -

security of data - -

semantic annotation - -

semantic information - -

semantic similarity - -

semantic technology - -

semantic web service - -

sensors - -

service description - -

similarity measures - -

social medium - -

SPARQL - -

SPARQL queries - -

stream processing - -

structured data - -

systems query execution - -

ubiquitous computing - -

use cases - -

web of data - -

web search - -

web technology - -

wikipedia - -