42
Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer 1 Alexey Boyarsky 1, 2 Philippe Cudr´ e-Mauroux 3 Paolo De Los Rios 4 1 Distributed Information Systems Laboratory (LSIR) School for Computer and Communication Science Ecole Polytechnique F´ ed´ erale de Lausanne, Switzerland 2 Instituut-Lorentz for Theoretical Physics Leiden University, The Netherlands 3 eXascale Infolab University of Fribourg Switzerland 4 Laboratory of Statistical Biophysics (LBS) Institute of Theoretical Physics School of Basic Sciences Ecole Polytechnique F´ ed´ erale de Lausanne, Switzerland 1

Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Crowdsourced conceptualization ofcomplex scientific knowledge and discovery of discoveries

Karl Aberer1

Alexey Boyarsky1,2

Philippe Cudre-Mauroux3

Paolo De Los Rios4

1Distributed Information Systems Laboratory (LSIR)School for Computer and Communication Science

Ecole Polytechnique Federale de Lausanne, Switzerland

2Instituut-Lorentz for Theoretical PhysicsLeiden University, The Netherlands

3eXascale InfolabUniversity of Fribourg

Switzerland

4Laboratory of Statistical Biophysics (LBS)Institute of Theoretical Physics

School of Basic SciencesEcole Polytechnique Federale de Lausanne, Switzerland

1

Page 2: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

1 Summary of the research plan

What should be the structure and the semantic organization of scientific knowledge? How can it be built? What defines a“discovery”? Under which conditions can a discovery “emerge” from a scientific infrastructure? What parts of the discoveryprocess can be automated? An inter-disciplinary team of physicists, complexity scientists and computer scientists is clearlyrequired to answer those fundamental questions.

The nature of scientific discoveries is drastically changing. Fewer and fewer scientific advances are carried out by smallgroups working in their laboratories in isolation. In today’s data-driven sciences (be it biology, physics, complex systems oreconomics), the progress is increasingly achieved by collaborations of scientists with heterogeneous expertise, working in parallel,and having a very contextualized, local view on their problems and results. The research process is often so complex that nosingle expert (inside or outside of the group) can claim to fully understand the setup of an experiment or all sources of systematicerrors. We expect that this will result in a fundamental phase transition in how scientific results are obtained, represented, used,communicated and attributed. Different to the classical view of how science is performed, important discoveries will not only bethe result of exceptional individual efforts and talents, but alternatively an emergent property of a complex community-basedsocio-technical system. This has fundamental implications on how we perceive the role of technical systems and in particularinformation processing infrastructures for scientific work: they are no longer a subordinate instrument that facilitates daily workof highly gifted individuals, but become an essential tool and enabler for performing scientific progress, and eventually mightbe the instrument within which scientific discoveries are made, represented, and brought to use.

A central element of such systems is a field-specific ontology, i.e., a structured (and therefore machine-readable) organization ofthe knowledge created by the research groups, as well as a formal description of the information and processes they exploit.However, for a heterogeneous and large group of knowledge workers, such ontologies tend to becomes rather non-trivial,non-hierarchical, often containing conflicting views, etc. Manipulating scientific information automatically or semi-automaticallyrequires a fundamental new step in information management, through the development of new tools facilitating the elicitationand the automatic manipulation of complex and dynamic knowledge networks.

Our goal in this project is to design new scientific knowledge management methods that can implicitly and automaticallyfollow the entire life cycle of modern large-scale scientific endeavors relying on participatory, self-organizing and decentralizedinteractions. As such, it significantly deviates from many of today’s information systems, supporting science, which often consistof rigid and centrally administered repositories. The new nature of scientific collaboration is however clearly reflected in ourproject, through the four technical pillars underpinning our approach: i) crowdsourcing rather than harvesting of knowledgeinput ii) self-organizing data integration iii) decentralized trust management and iv) analysis of emergent network properties.

Our key idea is hence to create a knowledge system that is continuously developed and updated by the experts as part oftheir daily activities. Collective efforts of the community, applied as part of their everyday work, get accumulated such thatthe environment hosting those activities becomes “intelligent” – capable of analyzing and processing the scientific knowledge thatit manages. This gradually creates a “ROBOT-EXPERT” that can continuously analyze (“understand”) scientific information,structure it and play the role of human experts in translating and communicating scientific results, facilitating information exchangeand even scientific discoveries.

2

Page 3: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

2 Research Plan

2.1 Formulation of the problem and current state of research in the field

Complexity of today’s scientific process. One of the main challenges that our society is facing in its technological and socialdevelopment is the increasing complexity of the systems that we are trying to organize, to manage or to model. In the case ofsocial and financial systems, this challenge is there from the very beginning – the systems in question are intrinsically complex andstrongly-correlated by their nature. In the natural sciences, we are facing the challenge of emergent complexity. Progress in sciencetoday requires deeper and deeper competencies to use sophisticated methods that are necessary to deepen our understandingof Nature. At the same time, every new step demands the combination of results obtained by different methods. The researchoutput of each scientist increasingly relies therefore on the results of other groups (often far beyond the expertise of this scientist),resulting in collaborations of researchers with very different expertise (from few to hundreds and even thousands as in the caseof the Large Hadron Collider in CERN). Moreover, in interdisciplinary research cross-field collaborations (e.g., between computerscience and sociology to deal both with big data as well as with the human component) are key success-factors. Even thoughscientific discoveries by individual scientists or individual teams still form the basis of the innovation process, a single scientist hasno more the capacity to process all the information necessary to fully understand and comprehend all the experimental factsand models of large-scale scientific endeavors. Moreover, it is becoming de facto impossible for individuals to understand thefull implications of their local discoveries in today’s scientific landscape.The problem: complexity of the scientific output, not only of the data. Scientists are increasingly successful in tackling thechallenge of the “data deluge”: recent tools for managing scientific data concentrate mostly on scalable and distributed storage lay-ers and on effective computational frameworks to automate the processing of large amounts of rigid, well-structured data [75, 113].Large supercomputers, advanced numerical systems and distributed data processing frameworks (e.g., GRID systems [57]) allowto process billions of objects from astronomical catalogs or genomic data to petabytes of records produced by the LHC detectors.

However, the problem of complexity concerns not only the experimental data, but also the generated scientific knowledge:complex scientific concepts; abstract notions; their relation to phenomena and data putting the experiments into the proper context.The growing level of complexity of scientific knowledge requires automated support also at this level. We need tools that support thescientific workflow at essentially every step, starting from data collection and data analysis and ending with the most advanced rea-soning and inference steps — dynamically (re)interpreting and combining disparate results scattered across heterogeneous networksof scientific papers, scientific data, semantic information, etc. into something novel and potentially useful to the scientific community.

We can even envisage in such a context certain discoveries of such a complexity that individual scientists may no more be ableto fully appreciate the underlying models and methods used by the system. Humans might only be able to interpret a subsetof the full results that indicate the existence of such a discovery in the scientific system, which would call for self-awarenessqualities for the socio-technical system (i.e., mechanisms to observe the processes and semi-automatically detect and summarizethe emergence of important properties, data and theories across the ecosystem).

The goal of this project is to make a significant step towards the semi-automated conceptualization of scientific knowledgeand its large scale analysis. This includes developing a participatory platform for knowledge elicitation as well asresolving a number of deep theoretical problems that have to be tackled prior to designing such a system. The resultsof these theoretical investigations will be implemented in the participatory platform, in order to increase its usabilityand data-harvesting power and at the same time providing a validation framework for tuning the methods and algorithms.

The project will be based on an existing system, ScienceWISE.info, (already used by many scientists for semantically importing,storing and searching scientific data), jointly created by the participants of the proposed project and will develop it further, usingthe results of the project.

While a number of recent projects have already tried to tackle the problems of automatic scientific knowledge acquisition andanalysis (see e.g. [47, 79, 109, 111] or [73]) they are all facing intricate challenges related to the new participatory, decentralized,

3

Page 4: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

and bottom-up nature of scientific collaboration. In that context, our project represents a significant step forward in the sense that itmimics the very nature of modern scientific endeavors by being built on crowdsourcing, self-organizing and emergent techniques.

The role of experts in the scientific workflow. Nowadays, it is human scientists (“experts”) who play the role of connectingdifferent pieces of the complex research landscape, outlined above, making the cutting edge research results accessible to colleaguesfrom other domains, and even more so to various “end users” – such as businesses or technology companies. For automatedmanipulation with conceptualized scientific information these functions should be automated as well. The scientific expertspossess the following essential skills:

A) understanding the exact meaning of scientific concepts within a given context; distinguishing between directly measured andderived quantities; understanding underlying assumptions and the fundamental differences between observed phenomena andtheir mathematical abstraction;

B) understanding the “mental map of a field of research”, grasping the main conceptual ingredients of a given field (bothof phenomenological and formal nature). This mental map provides a large-scale overview of a more detailed knowledge-graphcapturing inter-related concepts and their relations to scientific advances;

C) ranking results or experts on a case-by-case (for practical reasons it is often important to identify the best experts in anarrow, precisely defined field; for instance, young researchers, whose contributions in the field so far might be limited, maybe the ideal candidates to summarize some advances or contribute to specific, rapidly evolving fields);

D) understanding the methods of analysis within a given field of study and the ability to properly use them (or even developthem beyond the state-of-the-art).

This project aims to formalize targeted parts of scientific activities, making it possible to crowdsource implicit “expert”knowledge and its automated use. We plan to design a generic socio-technical system that will not only liberate scientistsfrom routine work but also act as a giant crowdsourcing conceptualization engine for complex scientific fields, thatcontinuously elicits concepts and contextual information from the experts’ daily activities.

How can we create an automated system that successfully imitates the functionalities of a human expert or at least assist scientists,freeing them from most of the routine work? We see the four major steps that should be undertaken (each of them facingchallenges beyond the current state of the art) before a realization of such a ROBOT-EXPERT system can appear (cf. [139]):1

I. Knowledge Elicitation: One of the key elements in developing effective scientific semantic aware tools is a comprehensiveontology of scientific knowledge. While in some important cases (e.g., in bioinformatics) it is possible to create large ontologiesof sufficiently homogeneous concepts and automatically manipulate them using formal rules (see e.g. [75, 113]), we need amuch more complex and broad ontology:

The goals of the project require an ontology that on the one hand adequately represents in a structured form theconceptualized scientific knowledge and on the other hand is suitable for the automated manipulation. Such an ontologyshould, in particular (i) be heterogeneous, containing concepts of very different natures (theoretical ideas, methods,phenomena, data, instrumentation, etc.); (ii) allow for expressing and managing alternative (and even conflicting) pointsof view ; (iii) support context-dependent concept meaning, non-hierarchical forms of relations, rules of combining differententities into composite concepts etc.

This ontology should be first designed and, second, – populated. For the ontology design, we need to develop complex yetflexible structures and formats for adequate representation of these data. Self-organization of schema and instance data (i.e.,decentralized noisy data integration) needs to be realized to incorporate and semantically relate the various knowledge graphscreated by the knowledge elicitation component.

1See our award-winning paper [6], presented at ISWC 2011 in the “Outrageous ideas” track.

4

Page 5: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

For the ontology population, we choose an approach based on a combination of existing automatic methods with novel approachesthat will enable human-machine collaboration between scientists and the knowledge management infrastructure and rely on implicitcrowd-sourcing of the scientific community input. The crowd-sourcing approach, having unique power for knowledge harvesting,implies however a number of problems in knowledge integration and knowledge verification (trust) that will be discussedin the corresponding sections below. This makes our project distinct from existing successful approaches to semi-automaticontology creation include examples like YAGO [119] and DBPedia [19] that started from the human-generated Wikipedia.

To stimulate the human-computer interaction at the desired level and efficiently assist experts in their everyday’s activitiesa participatory platform should not only store manually introduced information but also continuously analyze it and complete itwith whatever can be formally derived (knowledge structure analysis). This includes combining existing concepts into newcomplex concepts, generating new relations, recommendations, community detection, etc. These results are then proposed tousers, who will verify it through usage, generate feedback and provide use-cases for crowd-sourcing.

Therefore the goal to create a scientific knowledge elicitation participatory platform requires solving a number of problems inknowledge integration, verification, and structure analysis. Every such solution will have impact outside the scope of this project.

II. Knowledge Integration and Ranking: global consistency of the crowd-sourced knowledge should be checked, to makesure that separate pieces do make sense when added together. Various semi-automated method exist today to match instancesor entities, however this project will go well beyond the state-of-the-art by developing integration for knowledge-subgraphs, i.e.,taking as input graphs of interconnected entities or concepts and creating non-trivial links between the subgraphs as output.

Also, new ranking and search functionalities will have to be developed in order to take advantage of the integrated knowledgegraph. Search and mining systems on the web and in enterprises are currently shifting from a document-centric to an entity-centricperspective. That is, instead of retrieving documents given a text-oriented query, information is aggregated, retrieved, and presentedaround entities (e.g., persons, locations, organizations). Such a change influences the entire system architecture and asks for thedesign of a novel user experience. In the present project, we plan to develop novel search functionalities for the end-users based onintegrated knowledge graphs, lineage information and third-party knowledge bases.

III. Knowledge Verification: the unsupervised nature of knowledge acquired through crowd-sourcing and automated reasoningputs forwards the problem of trust. If both arbitrary experts and the system itself relate existing information and produce newknowledge, how can one assess the validity of the research process? For instance, how could our system assess the credibility ofscientific claims, results, or methods? Obviously, new trust mechanisms need to be developed, for example to automatically detectscientific fallacies emerging from our integrated knowledge system, or to discover systematic errors or bias. While computationaltrust mechanisms have been developed for the Web, our project will foster the development of a new breed of trust mechanismsbased on the dynamic evolution of the scientific knowledge inside our system. The observation of the knowledge flow, of theemergence and fall of concepts and theories inside our system will be of particular importance in that context.

IV. Knowledge Structure Analysis: The crowd-sourced knowledge has a complex structure, and can be thought of as amulti-layer graph with different types of nodes (research subjects, research data, research publications, scientists, experiments,etc.) and links between them. Development of a participatory platform requires quantitative analysis of such a multi-network(proximity and similarity measures, clustering algorithms, characteristics of structural and temporal features, etc.) The theoryof such multi-networks (that appear also in biological, social, and economic systems) is currently under active investigation inthe field. The major obstacle that currently prevents a comprehensive understanding of such multi-level structures is the lackof an adequate theoretical framework. Therefore in this project we plan to capture the origin of the coupling between the variouslayers of the network, and thus to characterize the system dynamically and to suggest generative (or predictive) models of it.These results will find practical applications in the platform we will develop.

At the second stage of the project, when more complex data will become available within our system, the analysis of itsdynamical evolution and topological structures will allow to reveal many hidden connections. As one of the most ambitiousgoals in this project, we even envision the possibility to detect latent scientific discoveries already present in the data prior toany humans — i.e., to foster the DISCOVERY OF DISCOVERIES.

5

Page 6: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

2.2 Current state of our own research

ScienceWISE platform. To make a first step towards the tools for the automated support of the scientific process, members of thecurrent proposal created ScienceWISE.info – a platform for semantically importing, storing and searching scientific data and a se-mantic recommender’s system.2 It allows a community of scientists, working in a specific domain to generate dynamically as a partof their everyday work an interactive semantic environment, consisting of highly-structured meta-data (i.e., an ontology) with directconnections to the body of research papers, authors, and topics. Figure 2 on page 14 presents a high-level architecture of the system.

The platform includes a number of elements, making it a preferred experimental framework for our approach: (a) an expandingcollection of field-specific expert-community-ranked encyclopedia articles; (b) an ontological structure (concepts and logicalrelations between them) encompassing this encyclopedia; (c) established connections of ontology entries to a vast collectionof research papers: ArXiv.org [1, 65, 66], the CERN Document Server (CDS) [2]; (d) an operational platform with a growinguser community, allowing scientists to annotate and conceptually index (bookmark) the research papers; link them against theontology, validate and dynamically update the ontology through annotation, etc.3

The ScienceWISE Ontology4 underpins the whole system and is the result of a large crowdsourcing effort of the physicscommunity. To create the initial version of the ontology, we have performed a semi-automated import from many science-orientedontologies and online encyclopedias. After this initial step, ScienceWISE users (who are the domain experts) are now allowedto edit elements of the ontology (e.g., adding new definitions or new relations) in order to improve its quality. Presently, theScienceWISE ontology counts tens of thousands of entries, each with its own definitions, alternative forms, and semantic relationsto other entries. The semantic relations are both of general (e.g., is a part of ) and field-specific (is a model of, is observed in) nature.

The ScienceWISE system is public and accessible by scientists via ArXiv.org as well as via several bibliographic systems(CERN Document Server [2] and NASA ADS [3]). The system currently counts above 500 of active users (and several newregistration daily), tens of thousands of conceptually indexed and annotated papers. Many of the features discussed in this proposalwere explicitly demanded by the groups of ScienceWISE users and their implementation will increase usability and popularity.

State of research of the four groups participating in the project. The participants of the project have a long track record inthe various fields relevant to the goals of the project.I. Prof. A. Boyarsky is responsible for the first sub-project. He initiated the ScienceWISE project and leads it together withProf. K. Aberer since 2009. Being an active researcher in physics and astronomy, he closely works with the group of active test-users(physicists) and presents the system to a wide scientific community. Based on demand and feedback, generated by this interactionhe leads the design of the system and its new functionalities and supervises on a daily basis the team of interns, PhD studentsand postdocs who are developing the system. In collaboration with other PIs of the current proposal, he worked on a number ofscientific projects triggered by ScienceWISE and its needs: (a) creation of the ScienceWISE ontology, its “RDFization”, publishingat Linked Open Data [18]; (b) together with P. De Los Rios he worked on dynamical clustering of heterogeneous graphs andontology crowdsourcing, comparing approaches developed within Computer Science and Physics communities [28]; (c) togetherwith K. Aberer and P. Cudre-Mauroux he developed novel algorithms of expert ranking [6]; of tags recommendation [105] and ofthe word sense disambiguation in the scientic domain [106] based on the ScienceWISE ontology. This joint work received “BestDemo” award and one of the three awards at “Outrageous ideas” track at ISWC-2011 [6, 7]. A. Boyarsky, is an experiencedphysicist with a proven track-record, working for many years at the interfaces between various subjects.5

II. Prof. P. Cudre-Mauroux is responsible for the second sub-project and is a pioneer in the fields of self-organizing information[9, 10, 40], decentralized data integration [36, 38], emergent properties of information [35], entity disambiguation [43, 105, 121],and scientific data management [31, 42, 44, 106, 116]. This pertinent background as well as his recent work on large-scaleSemantic Web data analytics [136] will be used to define novel methods for scientific data analysis and integration with the

2The system is publically available at http://sciencewise.info.3For more details see Ref. [7] — which was awarded the best demo award at ISWC-2011.4Accessible for browsing via http://ScienceWISE.info/ontology5See citation Summary in High-energy physics bibliography system http://inspirehep.net/search?p=author:A.Boyarsky.2&of=hcs.

6

Page 7: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

final goal of presenting the user with high-level analysis and recommendations such as, for example, future research directions,experimental suggestions, new available dataset, etc. Previous entity search approaches from his collaborators [50] will be extendedto entity-centric search in the context of scientific entities stored in the ScienceWISE collaborative platform. Finally, novel searchand diversification techniques will be developed to support the every-day scientific workflow.

III. Prof. K. Aberer is responsible for the third sub-project and has a long experience both in computational trust and reputationframeworks. In [12] he presented one of the first approaches to decentralized trust management models. In [127, 130], he usedbelief propagation and clustering techniques to isolate lying agents in a market of web-services. In [126, 132], the use of graphicalmodels to represent the dependencies among different agents’ quality parameters and their associated contextual factors wasbrought to use. In [131, 133], he analytically described the cooperation conditions for a computational trust learning algorithmswith reduced accuracy and cost and experimentally verify these results in various settings with honest, malicious and strategicplayers. In [129] he analyzed and compared the adversarial cost to attack ranking systems that use various trust measures todetect and eliminate malicious ratings to systems that use no such trust mechanism. K. Aberer also has a long-standing experiencein working on platforms supporting scientific work (see e.g. [45, 81, 103] or [33]).

IV. Prof. P. De Los Rios is responsible for the fourth sub-project. He is a statistical physicist working at the crossroads ofphysics, complexity science and biological physics. He has been working for several years now in complex networks theory,both developing its foundations and applying it to problems issued from other disciplines. In relation to the present proposal,he has focused on using clustering techniques to extract information from networks, and developed algorithms to assess thereliability of the detected communities [62]. He has also been aiming at renormalizing networks by finding ways to reduce theirsize while preserving some relevant structural or dynamical features [63, 64].

External collaborations. The participants of the project have multiple active collaborations that ensures their access to a widernetwork of experts in the domains, complimentary to their expertise (such as social sciences, e-Science, digital libraries, statistics,etc.).– K. Aberer and A. Boyarsky are involved in the FuturICT project, bringing together hundreds of experts from ICT, computational

social sciences, game theory, complexity sciences, etc.– K. Aberer is a member of the Network of Excellence in Internet Science reuniting computer scientists, complexity scientists

and social scientists (http://www.internet-science.eu)– A. Boyarsky collaborates within the ScienceWISE project with the leading providers of scientific bibliographic meta-data

and content: ArXiv.org team in Cornel University Library [1], CERN Document Server [2], NASA Astrophysical Datasystem [3], INSPIRE high-energy physics literature database [4].

– A. Boyarsky and P. Cudre-Mauroux collaborate with C. Gueret and the group of F. van Harmelen from the Vrije Universiteitof Amsterdam, experts on knowledge representation and reasoning.

2.3 Research plan of the entire project

The ultimate goal of this project is to turn the ScienceWISE platform into a ROBOT-EXPERT — a comprehensive system,scrutinizing all creative steps performed by individual scientists within it, generating new information (trying to classify, corroborate,enrich and ultimately combine local data, hypotheses, workflows and conclusions in light of all scientific artifacts contributedto the system), thus automating part of the expert input and assisting the human experts in their scientific activities.

To reach this goal, we need to develop next-generation knowledge management methods drawing both from Computer Scienceand from Complexity Sciences. The theoretical results of this project will be tested within the ScienceWISE by implementing newalgorithms developed in this project as new features of the platform. This will in turn allow to provide more data of better qualityand allow to gauge further the implemented approaches. The project therefore realizes a synergy between the theoretical andexperimental approaches. Such “information cycles” will eventually lead to the creation of the system capable of (re)interpretingand combining disparate results scattered across heterogeneous networks of scientific papers, data, semantic information, etc.

7

Page 8: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

SP1: Knowledge Elicitation

SP2: KnowledgeIntegration

arX

iv:1

112.

2220

v1 [

astro

-ph.

CO]

9 D

ec 2

011

Mon. Not. R. Astron. Soc. 000, 000–000 (0000) Printed 13 December 2011 (MN LATEX style file v2.2)

The Origin of Disks and Spheroids in Simulated Galaxies

Laura V. Sales1, Julio F. Navarro2, Tom Theuns3,4, Joop Schaye5, Simon D. M. White1,Carlos S. Frenk3, Robert A. Crain5 and Claudio Dalla Vecchia6

1 Max Planck Institute for Astrophysics, Karl-Schwarzschild-Strasse 1, 85740 Garching, Germany2 Department of Physics and Astronomy, University of Victoria, Victoria, BC V8P 5C2, Canada3 Institute for Computational Cosmology, Department of Physics, University of Durham, South Road, Durham, DH1 3LE, UK4 Department of Physics, University of Antwerp, Campus Groenenborger, Groenenborgerlaan 171, B-2020 Antwerp, Belgium5 Leiden Observatory, Leiden University, PO Box 9513, 2300 RA Leiden, Netherlands6 Max Planck Institute for Extraterrestrial Physics, Giessenbachstrae 1, 85748 Garching, Germany

13 December 2011

ABSTRACTThe major morphological features of a galaxy are thought to be determined by the assem-bly history and net spin of its surrounding dark halo. In the simplest scenario, disk galaxiesform predominantly in halos with high angular momentum and quiet recent assembly history,whereas spheroids are the slowly-rotating remnants of repeated merging events. We explorethese assumptions using one hundred systems with halo masses similar to that of the MilkyWay, identified in a series of cosmological gasdynamical simulations: the Galaxies - Inter-galactic Medium Calculation (GIMIC). At z = 0, the simulated galaxies exhibit a wide varietyof morphologies, from dispersion-dominated spheroids to pure disk galaxies. Surprisingly,these morphological features are very poorly correlated with their halo properties: disks formin halos with high and low net spin, and mergers play a negligible role in the formation ofspheroid stars, most of which form in-situ. With hindsight, this weak correlation betweenhalo and galaxy properties is unsurprising given the small fraction of the available baryons(� 40%) that end up in galaxies. More important to morphology is the coherent alignment ofthe angular momentum of baryons that accrete over time to form a galaxy. Spheroids tend toform when the spin of newly-accreted gas is misaligned with that of the extant galaxy, leadingto the episodic formation of stars with different kinematics that cancel out the net rotation ofthe system. Disks, on the other hand, form out of gas that flows in with similar angular mo-mentum to that of earlier-accreted material. Gas accretion from a hot corona thus favours diskformation, whereas gas that flows “cold”, often along separate, misaligned filaments, favoursthe formation of spheroids. In this scenario, most spheroids consist of superpositions of stellarcomponents with distinct kinematics, age, and metallicity, an arrangement that might surviveto the present day given the paucity of major mergers. Since angular momentum is acquiredlargely at turnaround, morphology is imprinted early by the interplay of the tidal field and theshape of the material destined to form the galaxy.

Key words: Galaxy: disk – Galaxy: formation – Galaxy: kinematics and dynamics – Galaxy:structure

1 INTRODUCTION

Galaxies exhibit a spectacular variety of morphologies, fromspheroids to disks to bars to peculiar galaxies of irregular shape.Many physical properties, such as gas content, average stellar age,and the rate of current star formation, are known to correlate withmorphology. Of such properties, the one that seems most tractablefrom a theoretical perspective is the relative importance of or-ganized rotation in the structure of a galaxy. This is commonlyreferred to as the disk-to-spheroid ratio, since stellar disks are

predominantly rotationally-flattened structures whereas spheroidshave shapes largely supported by velocity dispersion.

Since Hubble (1926) published his original morphologicalclassification scheme, our understanding of the provenance of thesetwo defining features of galaxy morphology has been constantlyevolving. Spheroids were once thought to originate in the swifttransformation of an early-collapsing, non-rotating cloud of gasinto stars (Eggen et al. 1962; Partridge & Peebles 1967; Larson1974), whereas disks were envisioned to result from the collapse ofclouds with high angular momentum and inefficient star formation

c� 0000 RAS

arX

iv:1

112.

2220

v1 [

astro

-ph.

CO]

9 D

ec 2

011

Mon. Not. R. Astron. Soc. 000, 000–000 (0000) Printed 13 December 2011 (MN LATEX style file v2.2)

The Origin of Disks and Spheroids in Simulated Galaxies

Laura V. Sales1, Julio F. Navarro2, Tom Theuns3,4, Joop Schaye5, Simon D. M. White1,Carlos S. Frenk3, Robert A. Crain5 and Claudio Dalla Vecchia6

1 Max Planck Institute for Astrophysics, Karl-Schwarzschild-Strasse 1, 85740 Garching, Germany2 Department of Physics and Astronomy, University of Victoria, Victoria, BC V8P 5C2, Canada3 Institute for Computational Cosmology, Department of Physics, University of Durham, South Road, Durham, DH1 3LE, UK4 Department of Physics, University of Antwerp, Campus Groenenborger, Groenenborgerlaan 171, B-2020 Antwerp, Belgium5 Leiden Observatory, Leiden University, PO Box 9513, 2300 RA Leiden, Netherlands6 Max Planck Institute for Extraterrestrial Physics, Giessenbachstrae 1, 85748 Garching, Germany

13 December 2011

ABSTRACTThe major morphological features of a galaxy are thought to be determined by the assem-bly history and net spin of its surrounding dark halo. In the simplest scenario, disk galaxiesform predominantly in halos with high angular momentum and quiet recent assembly history,whereas spheroids are the slowly-rotating remnants of repeated merging events. We explorethese assumptions using one hundred systems with halo masses similar to that of the MilkyWay, identified in a series of cosmological gasdynamical simulations: the Galaxies - Inter-galactic Medium Calculation (GIMIC). At z = 0, the simulated galaxies exhibit a wide varietyof morphologies, from dispersion-dominated spheroids to pure disk galaxies. Surprisingly,these morphological features are very poorly correlated with their halo properties: disks formin halos with high and low net spin, and mergers play a negligible role in the formation ofspheroid stars, most of which form in-situ. With hindsight, this weak correlation betweenhalo and galaxy properties is unsurprising given the small fraction of the available baryons(� 40%) that end up in galaxies. More important to morphology is the coherent alignment ofthe angular momentum of baryons that accrete over time to form a galaxy. Spheroids tend toform when the spin of newly-accreted gas is misaligned with that of the extant galaxy, leadingto the episodic formation of stars with different kinematics that cancel out the net rotation ofthe system. Disks, on the other hand, form out of gas that flows in with similar angular mo-mentum to that of earlier-accreted material. Gas accretion from a hot corona thus favours diskformation, whereas gas that flows “cold”, often along separate, misaligned filaments, favoursthe formation of spheroids. In this scenario, most spheroids consist of superpositions of stellarcomponents with distinct kinematics, age, and metallicity, an arrangement that might surviveto the present day given the paucity of major mergers. Since angular momentum is acquiredlargely at turnaround, morphology is imprinted early by the interplay of the tidal field and theshape of the material destined to form the galaxy.

Key words: Galaxy: disk – Galaxy: formation – Galaxy: kinematics and dynamics – Galaxy:structure

1 INTRODUCTION

Galaxies exhibit a spectacular variety of morphologies, fromspheroids to disks to bars to peculiar galaxies of irregular shape.Many physical properties, such as gas content, average stellar age,and the rate of current star formation, are known to correlate withmorphology. Of such properties, the one that seems most tractablefrom a theoretical perspective is the relative importance of or-ganized rotation in the structure of a galaxy. This is commonlyreferred to as the disk-to-spheroid ratio, since stellar disks are

predominantly rotationally-flattened structures whereas spheroidshave shapes largely supported by velocity dispersion.

Since Hubble (1926) published his original morphologicalclassification scheme, our understanding of the provenance of thesetwo defining features of galaxy morphology has been constantlyevolving. Spheroids were once thought to originate in the swifttransformation of an early-collapsing, non-rotating cloud of gasinto stars (Eggen et al. 1962; Partridge & Peebles 1967; Larson1974), whereas disks were envisioned to result from the collapse ofclouds with high angular momentum and inefficient star formation

c� 0000 RAS

arX

iv:1

112.

2220

v1 [

astro

-ph.

CO]

9 D

ec 2

011

Mon. Not. R. Astron. Soc. 000, 000–000 (0000) Printed 13 December 2011 (MN LATEX style file v2.2)

The Origin of Disks and Spheroids in Simulated Galaxies

Laura V. Sales1, Julio F. Navarro2, Tom Theuns3,4, Joop Schaye5, Simon D. M. White1,Carlos S. Frenk3, Robert A. Crain5 and Claudio Dalla Vecchia6

1 Max Planck Institute for Astrophysics, Karl-Schwarzschild-Strasse 1, 85740 Garching, Germany2 Department of Physics and Astronomy, University of Victoria, Victoria, BC V8P 5C2, Canada3 Institute for Computational Cosmology, Department of Physics, University of Durham, South Road, Durham, DH1 3LE, UK4 Department of Physics, University of Antwerp, Campus Groenenborger, Groenenborgerlaan 171, B-2020 Antwerp, Belgium5 Leiden Observatory, Leiden University, PO Box 9513, 2300 RA Leiden, Netherlands6 Max Planck Institute for Extraterrestrial Physics, Giessenbachstrae 1, 85748 Garching, Germany

13 December 2011

ABSTRACTThe major morphological features of a galaxy are thought to be determined by the assem-bly history and net spin of its surrounding dark halo. In the simplest scenario, disk galaxiesform predominantly in halos with high angular momentum and quiet recent assembly history,whereas spheroids are the slowly-rotating remnants of repeated merging events. We explorethese assumptions using one hundred systems with halo masses similar to that of the MilkyWay, identified in a series of cosmological gasdynamical simulations: the Galaxies - Inter-galactic Medium Calculation (GIMIC). At z = 0, the simulated galaxies exhibit a wide varietyof morphologies, from dispersion-dominated spheroids to pure disk galaxies. Surprisingly,these morphological features are very poorly correlated with their halo properties: disks formin halos with high and low net spin, and mergers play a negligible role in the formation ofspheroid stars, most of which form in-situ. With hindsight, this weak correlation betweenhalo and galaxy properties is unsurprising given the small fraction of the available baryons(� 40%) that end up in galaxies. More important to morphology is the coherent alignment ofthe angular momentum of baryons that accrete over time to form a galaxy. Spheroids tend toform when the spin of newly-accreted gas is misaligned with that of the extant galaxy, leadingto the episodic formation of stars with different kinematics that cancel out the net rotation ofthe system. Disks, on the other hand, form out of gas that flows in with similar angular mo-mentum to that of earlier-accreted material. Gas accretion from a hot corona thus favours diskformation, whereas gas that flows “cold”, often along separate, misaligned filaments, favoursthe formation of spheroids. In this scenario, most spheroids consist of superpositions of stellarcomponents with distinct kinematics, age, and metallicity, an arrangement that might surviveto the present day given the paucity of major mergers. Since angular momentum is acquiredlargely at turnaround, morphology is imprinted early by the interplay of the tidal field and theshape of the material destined to form the galaxy.

Key words: Galaxy: disk – Galaxy: formation – Galaxy: kinematics and dynamics – Galaxy:structure

1 INTRODUCTION

Galaxies exhibit a spectacular variety of morphologies, fromspheroids to disks to bars to peculiar galaxies of irregular shape.Many physical properties, such as gas content, average stellar age,and the rate of current star formation, are known to correlate withmorphology. Of such properties, the one that seems most tractablefrom a theoretical perspective is the relative importance of or-ganized rotation in the structure of a galaxy. This is commonlyreferred to as the disk-to-spheroid ratio, since stellar disks are

predominantly rotationally-flattened structures whereas spheroidshave shapes largely supported by velocity dispersion.

Since Hubble (1926) published his original morphologicalclassification scheme, our understanding of the provenance of thesetwo defining features of galaxy morphology has been constantlyevolving. Spheroids were once thought to originate in the swifttransformation of an early-collapsing, non-rotating cloud of gasinto stars (Eggen et al. 1962; Partridge & Peebles 1967; Larson1974), whereas disks were envisioned to result from the collapse ofclouds with high angular momentum and inefficient star formation

c� 0000 RAS

- Paper Similarity & Classification- Entity Extraction- Large-Scale Reasoning- Crowdsourcing, etc.

- Scientific Concept Integration- Entity Disambiguation- Scientific Lineage- Semantic Ranking

SP3: Trust in Discoveries

- Trust Mechanisms- Analysis of scientific fallacies- Trust Assessment- Evaluation

Scientific Papers

Scientists Input

External Knowledge Bases

Robot-Expert -

SP4: Multilevel Networks and Clustering

- Comparison of Micro-Networks- Knowledge Diversity- Knowledge Evolution- Time-Series Analysis

Semantically organized

Scientific Data

Scientific Community

Figure 1: Organisation of the SINERGIA project. The work is divided into 4 sub-packages (SP1 – SP4), that challenge theproblems, outlined in Section 2.1. The ScienceWISE system realizes the validation platform for methods developed withinthe sub-projects and provides the semantically organized data, used by the other sub-projects to check and improve their methods.Successful applications of the deliverables of the sub-projects makes knowledge acquisition by the ScienceWISE system moreeffective, providing in turn more data and “thought food” for the other sub-projects. The results of many iterative improvements willeventually lead to development of automated tools, such as ROBOT-EXPERT, that shall be delivered to the scientific community.

into a powerful and dynamic knowledge base useful to the whole scientific community.To realize the goals outlined in this proposal, one should start from within one particular scientific community, large enough to

provide enough statistics and social power, but still small and homogeneous enough to collectively discuss non-trivial informationand methods. We also believe that this project can only be realized bottom up, starting from pre-existing concepts, data and scientistsinside one particular community. We propose to perform the first experiment within the physicists community, expecting that throughinter-disciplinary subjects the system gradually expands its boundaries to computer sciences, econophysics, life sciences, etc.

In the following, we sketch a research agenda to develop new models and algorithms facilitating the elicitation and the automaticmanipulation and analysis of complex and dynamic knowledge graphs for scientific applications. Each of the following pointsis described in greater detail in Sections 4–7.

Semi-Automatic Knowledge Elicitation (SP1) takes care of “semantifying” of all scientific input within the ScienceWISEsystem (whether user-entered or derived) and crowdsourcing content. It hence creates “subgraphs of knowledge” that all other sub-projects utilize as data. This will be achieved by combining state-of-the-art methods of information extraction, ontology building,automated reasoning, etc. with the power of crowdsourcing of the expert community. The system will log the expert activityand, by means of data mining techniques, will continuously import new conceptualizations and learn new scientific workflows.The results of this human-powered knowledge elicitation activity will be represented and stored in the newly developed formats,and will represent the input of the following aggregation, analysis, and discovery steps. The main objective will be to automaticallygenerate all possible information, demanding for expert’s input whenever necessary but sparing them from routine manual work.6

In particular, formal ontological representation of composite scientific concepts will allow automatically generate all admissiblecombinations of existing ontological entities and links between them for subsequent verification (thus providing input for SP2 and

6Currently, ScienceWISE system mainly harvests manual user input through semantic bookmarking, annotation, etc. — a simple but important exampleof the described “symbiosis”: scientists use the system for their work, but just by design of the platform their activities enrich the scientific ontology andprovides service to the whole community.

8

Page 9: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

SP3). Proximity and similarity measures on multi-networks (SP4) lead to the development of a semantic recommender’s system.Community detection in multi-networks (SP4) will allow to precise the topics of research papers, to determine context-dependentmeaning of ambiguous terms, citation context, etc. This will result in creation of large amounts of high-quality data, describingthe scientific knowledge in greater detail, particularly will turn the coarse-grained concept-paper-author multi-networks (seeFig. 9) into much more fine-grained multi-networks where structures of the documents, contexts of references, mathematicalequations, etc. all are taken into account. The improved quality of data will allow for the large scale semantic analysis, suitablefor analysis of trustfulness of scientific discoveries, discovery-of-discovery and for creation of functionalities of ROBOT-EXPERT.

Self-Organizing Knowledge Integration and Enrichment (SP2). The machine-processable information elicited by theprevious processes can then be used to create and integrated knowledge base for scientific activities and scientific discovery. In thiscontext, the first research challenge will be to relate and semi-automatically curate the various knowledge sub-graphs createdby the elicitation process, e.g., to integrate semantically similar but syntactically different concepts, to axiomatize (formalize) therelations between various conceptualizations, and to suggest non-trivial predicates to interconnect related scientific entities from theknowledge graph described above, both inside the system and in relation to third-party knowledge bases available on the Internet.

As scientific information can be created, manipulated, and re-published iteratively by arbitrary experts in our system, keepingtrack of the various data operations and data versions is essential. Developing new efficient versioning and lineage abstractionsby taking into account both the input generated from the knowledge elicitation component and the requirements defined by thethird and forth sub-projects will constitute a significant research challenges in this context.

Finally, we plan to leverage the large-scale network of semantic data created and consolidated in this project to design anovel end-user experience based on the available linked data. Specifically, we will develop entity-centric techniques for rankingand diversifying the content relevant to a given user of query. Instead of merely exploiting the textual content of the scientificdocuments, we will to take advantage of our integrated knowledge graph to leverage structured information about semanticresources in our system. Thus, search stops to be about mere documents; Rather, it is an interface for developing semantically-richsolutions to contextualized user-goals about any type of scientific entity.

Trust in Discoveries (SP3). Trust is essential (and to some extent implicit) within the scientific community. However, so far noformal methods have been investigated to evaluate the trustworthiness of a scientific discovery per se (involving potentially manycontributors, publications and experiments). This has not been considered as a major problem, given that specialists have a goodoverview of their field and are able to assess the trustworthiness of a scientific discovery in a science community process.

This will, however, radically change as scientific discoveries, as outlined earlier, are becoming complex phenomena, crowd-sourcing tasks to numerous people and experiments connected through complex workflows, whose complexity may evade thegrasp of individual researchers and even whole specialized communities. In this project, we propose to develop formal methodsto evaluate the trustworthiness of a scientific discovery by analyzing how such assessments are made today in scientific disciplines,comparing these processes to other trust mechanisms, in particular those investigated in computational trust, and developingbased on this analysis a series of mechanisms that could support humans in assessing trustworthiness of scientific discoveriesin the future. We will design algorithms that track the provenance of scientific discovery using the lineage model developedwithin SP2, assess all factors of uncertainty from heterogeneous resources harvested within SP1 (e.g. uncertainty on participatingresearchers, experimental parameters, accidental errors, malicious behaviors) and infer the level of trust that can be taken in a resultor pinpointing potential weaknesses in the chain of reasoning leading to a result.

Multi-layer complex networks of scientific information (SP4). The nature of scientific information is naturally complex (e.g.Figure 9 on page 34 depicts an example structure of a multi-networks within ScienceWISE with many different types of links be-tween different entities). This is the simplest possible scientific semantic structure, although more complicated networks (e.g. those,using the actual experimental data, connection papers through equations or analyzing the internal structure of the papers) are also pos-sible. There exist currently no formalism to analyze the dynamics of such multi-layer, complex knowledge graphs. For this reason,a significant part of the project will develop the novel theoretical frameworks for analyzing the evolving topology of such graphs.

To analyze the structural, static and dynamical properties of ScienceWISE networks, we first need to introduce network indexes

9

Page 10: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

that adequately capture the richness of our entire data and event history. Such characterization represents a crucial step for thedevelopment of null models, that is, random multi-layered networks devoid of any correlations, and thus of any information,but still able to capture a few, selected structural features of the real instances. Null models, in turn, are necessary to assess thereliability of the outcome of any procedure to extract information based solely on the network structure, as relevant attributesshould significantly stand out above the background expected from null cases. Ultimately, one of the major goals of this projectis precisely going to be the development of new techniques to interrogate ScienceWISE networks about the presence and evolutionof concepts and of collaborative teams, or about the identity of the most authoritative scientists related to specific concepts, tomention just a few queries that could be answered by analyzing the network structure.

It is also important to highlight that multi-networks are ubiquitous in several disciplines and yet have been, to date, mostlyneglected. The results of this subproject will thus prove valuable beyond the particular scope of this proposal.

2.4 Organization of the collaboration

A. Value added by the joint research approach. The ambitious goal of the project — to close the gap between the resultsproduced by scientists and the automated “consumption”, integration and analysis of such results — is strictly impossible torealize within one scientific discipline and hence requires consolidated efforts of at least three scientific fields: i) Natural Science,tackling elicitation of the source data and expert verification of the resulting reasoning; ii) Computer Science, addressing theissues in developing new semantic web curation and integration algorithms as well as new scientific trust and verification models;and iii) Complexity Science, tackling the analysis of dynamically growing multi-layered interconnected networks of scientificknowledge. We strongly believe that we will have a strategic advantage by adopting this structure of collaboration, combiningthe power of analytic tools from physics and complexity science with fundamental advances in information integration, searchand computational trust analysis. Through existing national and international collaborations and networking, we will also haveaccess to expertise in social sciences, human-computer interaction, knowledge representation methods.

The research goals of SP2, SP3 and SP4 will strongly benefit from the experimental and validation platform provided bythe SP1, where intermediate and final results of all sub-projects will be applied, hence generating feedback and a consolidatedexperimental framework.

B. Competence, complementarity and collaboration of the groups involved. K. Aberer and P. Cudre-Mauroux bring exper-tise in computer science, they work on two different aspects of the project: trust management and self-organizing knowledgeintegration. A. Boyarsky and P. De Los Rios are theoretical physicists. Prof. De Los Rios brings a unique and necessary expertisein networks and complexity science that is crucial for the project. Prof. Boyarsky is an initiator and technical project leaderof the ScienceWISE.info, the main data source and validation platform for the project. He has been collaborating for severalyears with K. Aberer and P. Cudre-Mauroux and, being a theoretical physicist, share a “common language” with P. De Los Rios.He is responsible for identification and detailed formulation of the problems for the sub-projects, application of the results withinthe platform, and collecting and formalizing feedback.

Participants have published together six papers [6, 7, 18, 28, 105, 106] related to this project. These papers represent a firstconcrete indication of the strong potential of the consortium as a whole.

K. Aberer and P. Cudre-Mauroux have a long track-record of collaboration (Prof. Cudre-Mauroux is a former PhD student ofProf. Aberer). A. Boyarsky has co-authored papers with all three other participants in the last 2 years: three papers [6, 7, 105] withAberer and Cudre-Mauroux; paper [28] with De Los Rios and the members of the group of Aberer; paper [18] with the studentsof Cudre-Mauroux and Aberer; paper [106] with Cudre-Mauroux and the members of his group; he has been involved in theresearch related to all sub-projects.

C. Project Organization.Interactions between the sub-projects. The project splits into 4 sub-projects. The nature of the scientific interaction between

the sub-projects is summarized in Fig. 1 and described in Section 2.3.Integration of results. The starting point of the project is the existing ScienceWISE system with its hundreds of active users,

10

Page 11: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

and many tens of thousands of semantically annotated research data. At the first stage of the project the technologies and,correspondingly, the amount and the quality of the data, are expected, in turn, to be continuously improved by the results of theother sub-projects, applied to the ScienceWISE system. In this context, SP1 serves as a connection between the sub-projects. Apartfrom its own goals (development of formats and knowledge elicitation) it generates a number of tasks for the projects SP2–SP4. Asdeliverables of the other sub-projects become available, they get iteratively integrated into the ScienceWISE platform, which resultsin an improvement of the overall quality of the data, an increase of its amount, etc. This brings the projects to its second stage,when based on these new data all sub-projects are able to address more challenging tasks, feeding the results into the ScienceWISEsystem and obtaining larger volumes of “experimental data”.

Practical Organization. The practical organization of the collaboration between the groups will be orchestrated as follows.The development of the ScienceWISE experimental software platform will be carried out by the doctoral students and interns fromEPFL and UniFr, and supervised by postdoctoral researchers at each institution. The development of the experimental platformwill follow modern agile software development methods based on iterative and incremental development steps. Planning andcollaboration will be handled through a state-of-the-art and free task management platform7 that UniFR is already using. Sourcecontrol and issue tracking will be handled on GitHub.

Risk management. The project coordinator will perform a continuous risk management evaluation throughout the project,identifying any possible delay/problem w.r.t. the work-plan described in this document at an early stage so that solutions canbe elaborated in time. A systematic approach will be adopted for monitoring resource spending against the project budget andachievements against schedule.

Meetings and Communication The project will be launched by a plenary kick-off meeting. The meeting will be the firstopportunity to refine the common and shared understanding of tasks and resources and to build up an operational team spirit.We plan to hold plenary project meetings at least biannually and commit to meet on a bimonthly basis in the first six months.

Furthermore, each sub-project may organize additional face to face meetings for dedicated intra-sub-project and cross-sub-projectcommunication when required to speed up the development and integration process. Apart from face to face meetings, therewill be weekly telephone/Skype conferences. Knowledge exchange will be based on those meetings and on email and wikicommunication, and on the task management platform. All meetings and calls shall have public minutes.

Personnel, hired within this project into all groups will closely work with the ScienceWISE data and use it as a validationplatform and will participate in these regular teleconferences. We will also conduct regular (bi-monthly) research-focused seminarsat EPFL and at UniFr with both internal and external speakers.

Since 2012 A. Boyarsky (after a long stay at EPFL and CERN, 2003–2011), works as a professor of physics at LeidenUniversity (the Netherlands), still spending part of his time as an Invited Professor at EPFL and scientific associate at CERN.He will continue to co-lead the Swiss-based team together with Prof. Aberer and will create a group in Leiden, thus expandinghis involvement in this project. He plans to bring additional support for the project from the Leiden University and from theDutch science foundation (FOM).

D. Promotion of young researchers. Young researchers from all involved groups participated in all project-related papers(A. Astafiev [18] is a PhD student of K. Aberer; R. Prokofyev [18, 105, 106] is a PhD student of P. Cudre-Mauroux, in [28]O. Zozulya was a postdoc with K. Aberer, M. Charlaganov was a postdoc with A. Boyarsky; Z. Yang was a Master’s student ofP. De Los Rios; G. Demartini [6, 7, 105, 106] is a postdoc with P. Cudre-Mauroux’s). In these papers, as well as in the subprojectsof this proposal, different methods from computer science and theoretical physics together with their practical applications arecombined. Young researchers get exposed to those various methods and have the opportunity to work with all senior membersof these projects, as our preliminary papers demonstrate. We plan to continue this practice with the new personnel, and alsoplan to systematize our interactions through regular research meetings involving all the researchers involved in this project (seeabove). Those research meetings will be organized mainly having the students as main audience.

On a more general basis, a Sinergia research grant would represent a unique opportunity for us to intensify, integrate and sustain7http://asana.com/

11

Page 12: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

our current interactions, as well as to bring the whole project to the next level. We are convinced that the rich collaborationframework that we sketch in this project will provide opportunities to acquire interdisciplinary knowledge and first-hand experiencecombining cutting-edge methods from several fields with their practical applications. Our project’s distinct combination oftheoretical approaches, digital models and complexity science, along with the immediate and practical applications to real-worldsystems, will provide young researchers with a unique scientific setting, in which they will be able to invent the next-generationscientific infrastructures that are today needed and to gain first-graded research experience both from physics (e.g., complexsystem) and from computer science.

2.5 Relevance and impact

The results of this project can be applied in many areas of science, industry and business where there exist an expert community withan electronic document space and a minimal field-specific ontology. We expect the successful realization of this project to bringseveral important break-throughs in particular in the domains of ICT. One of the main challenges that our society is facing in its tech-nological and social development is the increasing complexity of the systems that we are trying to organize, to manage and to model.In the case of social and financial systems, this challenge is there from the very beginning – such systems are intrinsically complexand strongly-correlated by their nature. On the other hand, in the natural sciences we are facing the challenge of emergent complexity.

A typical starting point in, say, bioinformatics, astronomy or other fields, operating with huge amounts of data is to identifysets of concepts describing elements of some natural “Lego set”, that are well defined and have a finite set of easily classifiableproperties [46, 75, 113]. They are relatively easy to be formalized in an ontology and manipulated automatically using two main ab-stractions: hierarchies and keywords. This approach is very fruitful and continues to lead to numerous advances in the respective sci-entific fields. These systematic efforts, however, are mostly limited to information with naturally “structured controlled vocabulary”.

However, a significant part of the scientific discourse is much harder to formally conceptualize in the form of such simpleabstractions, for at least two reasons. First, concepts in science often cannot be categorized in rigid hierarchies (e.g., in naturalsciences the same names are routinely used to describe an observed phenomenon and its mathematical abstraction). Scientificconcepts increasingly revolve around sets of complex, sometimes abstract conceptualizations and their non-trivial relationship.Scientific information is originally produced in unstructured form, is often not presented within any single resource, but isdistributed across many sources of different nature. This information is often incomplete, is dynamically generated, and needsto be constantly updated, with relations between various components changing over time. Second, while existing ontology curationmethods typically want to semi-automatically resolve all conflicts as soon as they arise, conflicting information often needs tobe preserved in scientific context where categorical answers are not always available. Thus, manipulating complex scientificknowledge automatically requires a fundamental new step in information management, e.g., through the development of newalgorithms and models facilitating the elicitation and the automatic manipulation of complex and dynamic knowledge graphs.The successful realization of the present project will hence open the door to numerous breakthrough in science.Impact for society. If the above goal is achieved, this project will also addresses an important problem of making researchresults accessible to other groups of scientists and, more importantly, to non-experts (e.g. businesses, entrepreneurs). It helpsin contextualizing and searching for research data and results that would otherwise end up in the “data graveyard”. Researchdata can thus remain valid for a long time after its initial generation and be reused by other individuals. As all science disciplinesare becoming more and more data-intensive, models and methods like those describe above are urgently required. The overallapproach, once established, can be extended to activities of other (non-research) communities of knowledge workers. It hastherefore a direct relation to many areas: (society, economy, technology, and education).

12

Page 13: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

3 Interdisciplinary Nature of the Proposal

This project is a collaboration between computer scientists and physicists.The involvement of physicists is two-fold: (A) the analysis of the complex data, discussed in this project, requires advanced

methods of Complexity Science, developed within the physics community; and (B) the physics community was chosen forthe first experiment in developing a semantic participatory platform. This choice is justified by several considerations: (i) thephysics community is relatively compact (compared to most of the other natural sciences); (ii) it is well organized and structured;(iii) it already possesses a number of field-specific ontologies, and (iv) it possesses a centralized repository of research articles(ArXiv.org). The choice is also determined by the existence of the validation platform, ScienceWISE, adopted by the communityof physicists and having a dynamically growing ontology. Through inter-disciplinary subjects, the system gradually expandsits boundaries to computer sciences, econophysics, life sciences, etc.

Therefore, it is important that physics experts are capable of understanding and “semantifying” the input; collecting andanalyzing users’ demand and feedback. The results of the project will make the information management in physics more effectiveand therefore will have direct impact on the research in this field.

Significant parts of the methodology and goals of the project is cutting edge in Computer Science, as explained in the workdescription. As examples, knowledge elicitation through semi-automated crowdsourcing, entity management (including storage,indexing, search, disambiguation and integration) as well as ontology mapping are all extremely relevant topics for ComputerScience, as demonstrated by our recent publications in top Computer Science venues.

13

Page 14: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

4 Sub-Project 1: ScienceWISE – the scientific Knowledge Elicitation platform

�������� ���������

���

��� ������� ������

��� ������� �������

� ���������������������������

���������

��������

� �!��� �����!

����� ���!����������!

�����������������

����� �������� ������������

� ����������"�� � �����"�� �!� ��� � �

� ��������������� �����"�� ���������

#� ���� ���������� ��� �

��� �����$� �����%���� �� �%����� ����

� ��!����� ���������� ��� �

��&�'������������&(�� �

� ��!����� ���������� ��� �

������������ ��

) *�"���� �����

Figure 2: The structure of the ScienceWISE system, the interactions between its various modules and the ArXiv.org documentserver. Scientists access the system through the ArXiv.org (or several other scientific research repositories, not shown). Asa result of semantic annotation and conceptual indexing (bookmarking) of the research papers within the ScienceWISE system, thesemantic space of research documents gets expanded and the ontology gets updated. Automatic population of the ontology,based on continuous processing of the available WWW resources, gets verified through direct user input. The automated reasoningon the ontological statements is performed and the results are fed back to ScienceWISE for user verification. The existingScienceWISE ontology can then be published on the web and connected to the “Linked Open Data” cloud.8

To realize the goals outlined in this proposal, one should start from within a particular scientific community, large enough toprovide useful statistics and social power, but still small and homogeneous enough to collectively discuss non-trivial information andmethods. We propose to perform the first experiments within the physicists community, a choice justified by several considerations:(i) the physics community is relatively compact (compared to most of the other natural sciences); (ii) is well organized andstructured; (iii) already possesses a number of field-specific ontologies, and (iv) possesses a centralized repository of researcharticles (ArXiv.org).

Our ScienceWISE system will be used as a validation platform for the project. ScienceWISE is already used by severalhundreds of physicists around the world. It is composed of three main elements, all crucial for achieving the goals of this proposal(see the schematic representation of its high level architecture in Fig. 2):

1. the complex knowledge-graphs which capture the advances/concepts in physics and their complex relationships (see the leftpanel of Fig. 3)

2. the system itself, which consolidates all local inputs with the current ontology and attempts to create a comprehensive, globaland dynamic knowledge system8Linked Open Data project: http://linkeddata.org — a recommended best practice for exposing, sharing, and connecting pieces of data,

information, and knowledge on the Semantic Web using URIs and RDF.

14

Page 15: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Figure 3: The ScienceWISE concept ontology and its use for semantic annotation. Left panel: visual representation of theconcepts in the ontology, together with its parents category (up), semantic relations to other concepts (left and right arrows)and alternative definitions (bottom). The definitions are authored and unlike in the case of Wikipedia can stress different aspectsof a complex problem or even represent conflicting views. Right panel: the ScienceWISE ontology allows to semantically annotatebookmarked papers. The scientists are free to use any concept from the ontology as tags and can add new concepts to the ontologyif needed when bookmarking a paper.

Figure 4: Weekly number of visits to the ScienceWISE system in the last two years (based on Google Analytics). The systembecame open to the public on April 2011, before that it was used by the specially invited test-users.

3. the social community of experts in a field which give local, noisy and incomplete knowledge on some parts of the knowledge-graph (Figure 5).

We start below by introducing related systems and the current state of the ScienceWISE platform. We then describe the specificresearch challenges related to this work package, which focuses on crowdsourced knowledge acquisition techniques.

Related Systems. A number of interesting efforts were recently started towards the organization of scientific information. TheCERN-developed Invenio system, for instance, is a platform for storing scientific documents along with additional metadata.Another related system is Arnetminer.org [120] where the goal is to mine academic social networks by means of author ranking.Other recent systems like BibSonomy.org [77], Connotea, CiteULike, or Mendeley allow to publish and organize papers usingclassical information retrieval techniques based on a static schemas, keywords, and term-frequency ranking techniques. Manybibliography management systems (including the CERN Document Server) allow to add private or public comments to anyrecord. MathSciNet9 has a long-standing tradition of open peer-run short reviews for mathematical papers of their colleaguesfrom adjacent fields. Additionally, there is a number of attempts to create “scientific blogs” and/or “discussion forums” to

9from the American Mathematical Society http://www.ams.org/mathscinet

15

Page 16: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

supplement scientific publications with expanding and clarifying notes.10 However, additional scientific information, generatedwithin these resources, is not easily accessible outside of the immediate groups of resource users, due to the absence of clearstructure/categorization and natural language variability.

The ScienceWISE system today. The ScienceWISE platform adopts an audacious approach to knowledge management.ScienceWISE adopts a user-curated and dynamic ontology in order to assign user-defined concepts to papers rather thanunstructured tags. This difference is crucial for the functionality of scientific bookmarking, as it allows to easily recall bookmarkedpapers using ontological neighborhood of a paper (cf. Figure 9) and to exploit relations between concepts to explore semanticallyrelated contribution (as in the right panel in Figure 6).

The ScienceWISE users expand the semantic space of the research papers through annotation and scientific bookmarking.To annotate a paper its author is presented with the list of the concepts and definitions, automatically identified in the paperand ordered by their relevance. The user can choose some of the concepts and the system will produce a hyperlinked versionof the manuscript, inserting hyperlinks to relevant definitions/resources, thus expanding the paper with additional details, commentsor pedagogical materials. Competing scientific viewpoints are represented as alternative resources and definitions about thesame concept (Figure 3, left panel).

Users can bookmark any ArXiv.org paper using the ScienceWISE ontology (conceptual indexing). The system automaticallyselects the most relevant concepts for characterization of a paper, to be further fine-tuned by the user. A concept navigationpanel (see Figure 3, right panel) allows to classify bookmarked papers, create collections and easily navigate to any bookmarkedpaper with several clicks.11

Both for annotation and bookmarking, users can add concepts, definitions, resources, and relations that they deem necessary. Thisoccurs exclusively through the ScienceWISE ontology, thus creating a mechanism to expand it manually and validate the results ofautomated expansion. The ability of users to expand the ontology makes this “restrictions of the natural language flexibility” quitemild, while in return part of the scientist’s work is performed automatically and the researcher than has just to check and tune sug-gested set of concepts for indexing, resources for annotation, etc. In addition, the user is empowered with ontology-based methods,that allow him to perform semantic search and recommendation or to navigate using the ontological neighborhood of a paper.

The ScienceWISE datasets The project builds on large research databases: papers on ArXiv.org [1] (more than 700 000preprints); millions of preprints and papers from the CERN Document Server (CDS) [2] and hundreds of thousands of records fromthe HEP INSPIRE bibliographic database [4] (850 000 bibliographic records, tens of thousands of scientists’ records, thousands ofinformation on experiments in high-energy physics plus the data on user searches); collection of semantically annotated paperson ScienceWISE with user logs.

4.1 Interactions with the overall project

The ScienceWISE system plays a dual role in this project. One one hand, it provides the data and “experimental material” onwhich the other sub-projects are built. It is important to highlight that the ScienceWISE platform has already been operationalfor almost one year and that large amounts of semantic data on which the other sub-projects shall build are already present.Therefore, there is no “initial bootstrap problem” and no danger that by not starting early enough the ScienceWISE system wouldjeopardize the success of the whole project. On the other hand, ScienceWISE is the validation framework where the approachesand methods developed within the sub-projects SP2–SP4 will be tested. It provides semantically organized subgraphs, used bythe other sub-projects as input and uses the user-feedback to check and improve on their methods. Successful applications ofthe deliverables of the sub-projects will make knowledge acquisition process within the ScienceWISE system more effective,providing in turn more data of higher quality and refinement. Unlike harvesting knowledge acquisition systems, this positiveuser experience will server to improve the user-system interactions and the quality of the data.

10See e.g. CosmoCoffee.info – researches’ forum, related to cosmology; PhysicsForums.com – general question and answer forum, studentoriented; scientific blogs of individual scientists or groups (e.g. http://blogs.discovermagazine.com/cosmicvariance).

11Our present algorithm ensures that any paper can be reached in a minimal number of clicks, that scales essentially as a log(number of papers).

16

Page 17: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

������������

���� ������������� ������������������ �����������

Figure 5: Heterogeneous scientific information within the ScienceWISE system. The scientific concepts (blue boxes) are organizedinto ontologies through semantic relations (red arrows). The scientific articles are linked with each other through citations(purple arrows). The authors are connected into a network through co-authorship on the papers. Each concept is at the sametime related to multiple articles that mention this concept (green dashed-dotted lines). The scientific publications thus introducenew types of relations between concepts through co-occurrences (red dashed line), while the ontology of concepts inducesanother network of authors – people, who write about the same concept (not shown).

One of the main goals of SP1 is to accumulate the results of the whole project and incorporate them into the system. The otherfocus of this sub-project is on novel knowledge elicitation mechanisms to organize the user input, through clustering and semanticreasoning. In broad terms, SP1 will allow to produce an initial scientific semantic context to any paper, concept, experimentaldata or other entity imported into the system.

4.2 Detailed research plan

Large-scale knowledge elicitation and reasoning. ScienceWISE gives the opportunity to researchers to collectively edit anontology of the concepts, pertinent for their work. The relations between these concepts, latent or explicit, are of great importancefor the scientific process as a drive for innovation and new findings.

It has come to a point where the collectively created knowledge base is too big to be fully grasped by any single researcher.Instead, a middle man mechanism must be introduced to digest the data, perform some initial analyses and present researchers withsome more accessible content. Achieving this requires having the machine being able to “understand” the problems researchers areworking on. This intelligent assistant has to be able to understand the different concepts at hand in order to find the relationsamong them and guide the scientist in exploring the data.

In the context of this sub-project, we will work on this middle man mechanism bridging the gap between the communityof scientists and the ontology. The goal is here to design a mechanism through which scientists can query the ontology (e.g.,asking how two concepts in the ontology are related), and at the same time react to results provided by the system to improve it.

Reasoning [54] will play in important role in this context, to discover the hidden logical relations between two concepts or

17

Page 18: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Figure 6: Establishing a relation between entities of different nature through heterogeneous network of Figure 5. Left panel:relation between two scientific concepts (“Dark matter” and “Gauge field”) does not exist in the ontology of concepts. Itgets established, however through their co-occurrence in several research papers (in the middle). Right panel: establishing arelation between two authors who did not co-author but had written about the same subject (“Dark Matter”). The figures arethe actual screenshots from RELFINDER12 operating on the RDF version of the ScienceWISE ontology.

to highlight conflicting axioms in the knowledge base. Standard reasoning techniques however face performance and scalabilityissues because of their computationally intensive nature, especially when run over large instances (A-boxes). In order to be ableto provide a response time suitable for the human-computer interaction sketched above, advanced or approximate reasoningtechniques are necessary. One of the leading projects in that respect is the LarKC platform.13

One of the first tasks of this work package will thus be to identify, and then to adapt, a relevant subset of the algorithmsdeveloped in the LarKC context to enable near real-time interactions between the scientists interrogating the ontology and theknowledge base. This integration will make it possible to provide advanced knowledge browsing capabilities to the scientistsas well as a feedback loop between the community of scientists and the knowledge base.

Composite scientific concepts. One of the first objectives of the SP1 is related to the development of the formal ontological rep-resentation of composite concepts. The abstract concepts in science are often expressed as composites of some other ontological con-cepts. For example, a concept mass of particle is a composite of two basic scientific concepts: mass and particle. Important prop-erty of a composite concept is that it is not several concepts “glued” together. Rather, they are related through some special propertywhich is non-symmetric and has a limited domain and range. Moreover, the relation “Concept A is a composite of concepts B and C”should not be thought of as several triples of the form “A is related to B” plus “A is related to C”. Rather this is a new type of semanticrelation, connecting a Concept A with several other concepts simultaneously. The notion of composite concepts brings several chal-lenges at the level of knowledge representation languages and reasoning that will be addressed in this proposal: (a) Development ofrules and methods of their formal representation, that tell to ‘reasoners’ what concepts can be meaningfully combined; (b) Applica-tion of these rules to automated generation of composite concepts with their subsequent user validation; (c) Derivation of semanticrelations of newly created concepts based on the semantic properties of their sub-concepts with subsequent user validation. Theseresults will allow to generate automatically all possible combinations of existing ontological entities (drastically increasing the sizeof the ontology) and links between them for subsequent verification (thus providing input data for other sub-projects, SP2 and SP3).

Semantic recommendation system. Every day from 400 to 800 articles are published at ArXiv.org. Each paper onArXiv.org belongs to one or several Subject classes, chosen by the authors of the paper. One of these classes is chosen asprimary and is mandatory, while all the other (optional) are known as secondary classes. All the new submissions are presentedto the readers of ArXiv.org in the random order, making it very difficult to keep up-to-date with the new preprints.

Our current recommendation system topic reorders these daily submissions according to the user’s interests. In order to specifyhis/her interests, the user chooses a set of concepts, marking some of them as “not-to-be-missed” (thus boosting their priorityin the ranking). However, to fully describe a topic of interest one should specify hundreds of concepts (ideally each with its

13The Large Knowledge Collider project http://www.larkc.eu — a platform for massive distributed incomplete reasoning removing the scalabilitybarriers of currently existing reasoning systems for the Semantic Web, see e.g. [125].

18

Page 19: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Figure 7: Depedence of the proximity of two nodes in a multi-layered graph on the number of paths. The shortest path fromthe user to the paper B has 3 edges, while the shortest path to the paper A has 4 edges. However, there are more than one path tothe paper A and the nodes that form these paths are interconnected. Based on the shortest path analysis we would decide that paperB is more relevant than paper A, but in fact it is possible that the paper A is much more relevant. Circles denote concepts; orangeedges – ontological relations, purple edges – citations in the papers; black ones – occurrence of a concept within a paper.

own weight), which is prohibitively hard to do manually.The objective is to establish automatically user interests by analyzing available data about the user: (a) papers the user had

co-authored (together with the scientific concepts they contain; the papers they refer to as well as citations to the papers); (b)papers the user had bookmarked within the ScienceWISE system. To this end we need to determine a proximity of any newlyadded paper to the user, using the multi-layered author-paper-concept graph 9. For graphs with only one type of edges/vertexesthere are well-established algorithms (such as e.g. [52]), however the situation is completely different in this case. There arethree types of edges in such a network: “paper-to-paper” (citations), “paper-to-concept” (occurrence of concepts within a paper);and “concept-to-concept” (ontological relations). All these edges have their own weights ascribed to them14 Clearly, the proximitydepends not only on the shortest path from a paper to a user, but also on the number and length of different existing paths (as Fig. 7demonstrates). The different nature of the edges of this graph represents an additional challenge, as the relative weights of the linksof different nature (or of the paths of different length) need to be determined.

Thus, we are facing two main problems: developing algorithm that properly takes into account weights of different “nature” andweighting paths of different length. These problems are unsolved within the network theory and additional input is needed. Thisadditional information is provided by the ScienceWISE platform, where already exists data on (i) Manually defined topics ofinterest for our users and rankings based on them; (ii) Bookmarked articles with manually assigned tags; (iii) User clicks whileworking with the ranked list of papers.

Dynamic context determination and generation. Dynamic generation of ontologies by crowdsourcing is facing the challengeof “optimizing the user’s input”. This is why, although Wikipedia contains tens of thousands of articles on physics and relatedsubjects, its semantic form — DBpedia [19] — would return almost zero results for any physics-related query, as the crowd-sourcedinput in this case is completely unstructured. To overcome this, one needs a dynamically adapting scheme that continuouslyasks users the most relevant information for the concepts they enter. To achieve this goal the ScienceWISE system should beable to perform both statistical analysis and deductive reasoning on the fly. The adoption of LarKC and other semantic webtechnologies together with the results of SP2 will allow to address the question of the most efficient mechanism to use.

Matching mathematical context. Similar (and even identical) mathematical structures can appear in very different researchcontexts. It is almost impossible to recover by traditional means a paper from a different domain where a mathematical structures

14Concept-to-paper edge weights are proportional to the “term frequency” ; weights in paper-to-paper citation network can be assigned using PageRank [29, 101]or its generalizations, such as e.g. LeaderRank [90]. Finally, different ontological relations have between two concepts in ontology define a different measureof proximity. For example, “is an instance of” or “is a specialization of” is much more relevant, than a mere “is related to”.

19

Page 20: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

of interest had been already analyzed. The set of mathematical formulae in a given article defines its mathematical context andrealizes yet another sub-network associated with the papers and concepts. Automated analysis of mathematical context is anattractive possibility since mathematical formulae are much more rigid syntactically and semantically than the sentences of anatural language. There also exists a set of formal transformation rules establishing equivalence of two equations having differentsyntax. By implementing methods of symbolic algebra and sub-graph matching (derived from the SP2 approach) we will beable to realize “Equation search” (finding papers that analyze or use the same mathematical equations, even though the underlyingmeaning of the variables in these equations can be different). This idea was formulated some time ago already [84] but wasnever realized for large-scale data. Along similar lines, we will also implement “matching” of articles that use similar equations.

Community finding and clustering within the ScienceWISE ontology. A significant part of this project will be addressingthe issues of network clustering, community finding techniques, etc. A number of functionalities of the ScienceWISE systemhinges on our ability to effectively determine such structures within the ontology. While developing these methods are part of thecorresponding sub-projects, we list below several use-cases that will be realized and tested once the methods and algorithmsof other sub-projects will be implemented.

a. Concept clustering. To be able to navigate efficiently through the collections of research papers, we want to determinethe collections (sub-topics, etc) dynamically and then provide the users with the easiest way to navigate via:

All papers! Subtopic! Specialization within the subtopic

Based on the objectives of sub-projects SP2 and SP4 we will develop a method for the dynamic determination of “sub-topics”within a given corpus of research papers and determine the minimal set of concepts that allows to select such a sub-topic.

b. Similarity between research papers. One of the serious practical problems encountered by the modern scientist is findingarticles relevant to a given article. This problem may arise in various situations such as performing the literature search beforethe submission of a new manuscript or looking for an answer to some auxiliary question emerging en route to some researchobjective. The main approach today – manually searching databases for relevant titles, keywords, terms and references – does notwork satisfactory and often crucial information does not reach the interested party. It is, therefore, important to develop methodsof automatic context analysis for the calculation of mutual relevance of scientific articles (see Figures 6 and 9 where our currentrealisation of such matching).

c. Subjects of research papers. Modern identification of the subjects of research papers is typically based on rigid ad hocpredefined schemes (be it Physics and Astronomy Classification Scheme (PACS)15, Mathematics Subject Classification16, ACMclassification system in computer science17, or any other research field or journal). Such categorizations are often simplistic and donot represent the full spectrum of a research field at any point of time and are thus essentially useless. On the other hand, if acomprehensive algorithm of classifying research papers had been found, it would provide a significant advance in the organizationof scientific publications (or in the task of any other classification of research documents).

d. Finding an expert for a topic. Finding the best expert for a given topic is very important in many different situations(interdisciplinary research, understanding or interpreting research results, collaborations, etc.). In the literature there are essentiallytwo main approaches to the expert finding problem: 1) building author profiles and search over those or 2) retrieving papersfirst and then ranking the authors of those papers [23]. This is of course highly unsatisfactory.18

It is clear that the correct way to rank experts is by considering all possible relations available between concepts identified in thedocuments, their semantic neighborhood, and the authors who contributed to the document, citation graphs, etc. This requires ananalysis of the structure of the concepts-papers-authors networks (Figure 5). The theoretical aspects of such an analysis willbe explored in the sub-project SP4, Section 7 on page 33 using lineage and trust information from SP2 and SP3. Based on the

15http://aip.org/pacs

16http://www.ams.org/mathscinet/msc/msc2010.html

17http://www.acm.org/about/class/

18For example, imagine a narrow specialized community with several dozens of researchers. A paper with, say, 50 citations would mean that it is very “influential”.

20

Page 21: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

ranking functionalities of SP2, we aim at developing a method to identify the “expert on a given subject” by taking advantageof those pieces of information. Arnetminer.org [120] is a related system where the goal is to mine academic social networksalso by means of author ranking. Compared to this system, the present approach will take advantage of complex knowledgenetworks to achieve semantically much better results.

4.3 Schedule and milestones

SP1 will continuously provide the integration of various methods and techniques developed within other sub-projects, verificationsof those methods and algorithms and providing feedback.

After the first year of the project, we plan to provide a new ScienceWISE release, with a first set of enhanced functionalitiesbased on the activities of SP2, SP3 and SP4, along with a first version of the large-scale reasoning component.

The second year will be devoted to the further expansion of the functionalities and the analysis of the users’ feedback, usagestatistics, and algorithms to find experts in a given context. The previous developments implemented within the system willbring the data of new quality that will be funneled to the sub-projects SP2, SP3, SP4.

The third year will bring together all the main developments of the project and should eventually lead to a first robot-expertprototype.

Month 4: Development of composite concepts representation of the scientific ontology.

Month 8: Development of the recommender’s system based on the proximity and connectivity measured in the author-concept-paper multi-network.

Month 12: Major release of ScienceWISE system, incorporating new functionalities

Month 16: Determining contexts of citations. Creating more detailed multi-network of scientific papers, authors and ontology,using the structure of research papers

Month 20: Implementation of the scalable reasoning within the ScienceWISE system.

Month 24: Major release of ScienceWISE system, incorporating new functionalities

Month 28: Clustering of research papers within the ScienceWISE ontology. Authors matching based on the clustering. Providingthe capabilities of author disambiguation for ArXiv.org

Month 30: Finding an expert for a topic

Month 32: The structure of mathematical equations and equation match between papers

Month 36: New release of ScienceWISE system, incorporating new functionalities. Summary of the results

4.4 Personnel

We ask for one PhD student, to be employed in the Leiden University and working on equation search and mathematical contentmatch (the position is financed in Leiden in compliance with the regulations laid out in Article 8, Chapter 2 of “Regulationsof the Swiss National Science Foundation on research grants”of 14 December 2007) and for a postdoc, working on various aspectsof the semantization of the ScienceWISE scientific data. We also ask for funding of one full-time engineer position at EPFL,in order to ensure sustainability of the ScienceWISE system for the duration of this SINERGIA project. Apart from this weask for a continuous funding of one intern position through the duration of the project. Our previous experience (supporting theinfrastructure development of the ScienceWISE system in 2009–2011) shows that interns (students with the Master’s level, willingto enter the PhD programme and considering this work as a trial period) are very good for this work that combines developmentwith the computer science-related research. Out of 5 interns that participated in our project, two are currently PhD students,one is an engineer in a project in ETHZ and one is entering the graduate school this year.

21

Page 22: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

5 Sub-Project 2: Self-Organizing Knowledge Integration and Enrichment

5.1 Interaction with the Overall Project

Integrating and organizing all pieces of knowledge inside the system is essential for the overall project, given the inherentheterogeneity and lack of consistency of the data considered, which is either automatically extracted from research papers, importedfrom external knowledge bases, or created by arbitrary (and potentially conflicting) experts. The goal of this sub-project is todesign new algorithms and methods to automatically relate and interconnect all pieces of information inside our platform. In thatsense, it will take advantage of the data gathered by the first sub-project (Knowledge Elicitation), and will organize the pieces ofdata into an integrated knowledge graph that can be meaningfully analyzed by the third and forth sub-project (Trust in Discoveriesand Multilevel Networks and Clustering) and by the search and ranking applications built on top of our system.

5.2 Interaction with Current SNSF Project

Prof. Cudre-Mauroux was awarded a SNSF Professorship Grant in 2010 (PP00P2 128459: Infrastructures for Community-BasedData Management). The output of the first sub-project of his Professorship grant (Data Storage), such as the diplodocus[RDF]storage system[137], will be used in this project to store the various pieces of knowledge of the ScienceWISE platform. Someof the results of his second sub-project (Knowledge Abstractions), such as the generic entity linking framework [49], will beused as a foundation for designing new models in the context of this project (i.e., for scientific data). Thus, the present projectcan be considered as an ideal expansion of his previous project to a different field, and as a means of solidifying his presenceand collaboration in the Swiss research landscape19.

5.3 Detailed Research Plan

This subproject revolves around four main tasks: the (semi) automated curation of the ScienceWISE ontology, the integrationand self-organization of the instance data (knowledge subgraphs) that are created by the Knowledge Elicitation component, theautomated tracking of all operations applied to the data (semantic lineage), and the ranking of concepts and entities. Overall,the main goal of this subproject is to meaningfully consolidate the various knowledge sub-graphs that are created by the firstsub-project into an integrated knowledge graph that serves as a basis for the subsequent sub-projects. Each of these tasks isdetailed in more detail in the following.

5.3.1 Ontology Curation

The main objective of this task is to provide a uniform query interface to query heterogeneous but semantically related piecesof data inside the system. Given the decentralization of our context, and the diversity of the various scientific sources from whichknowledge is extracted, it is highly unlikely that all pieces of information stored in our system will be based on the same ontology,even if some common data model (say, RDF/S or OWL) is supported by all. In practice, the knowledge elicitation sub-projectcreates knowledge subgraphs from the analysis of research papers, the input of the scientists, and the partial import of externalsources (such as scilink.com). Those subgraphs can each refer to some common ontology for a subset of the data, but typicallyintroduce new terms of their own, that might or might not be axiomatized (i.e., formally defined) depending on their origin.

The aim of this task is three-fold: first, to relate semantically overlapping but syntactically different concepts (e.g., “journalpublication” in one ontology and “article” in a second ontology). Second, to analyze the resulting graph of concepts and ontologiesand merge identical or even get rid of unneeded concepts whenever necessary. Finally, to identify and axiomatize those conceptsthat need a more formal description, for example by automatically generating or by requiring some manual input for OWLdescriptions of the concepts. This process is iteratively repeated every time new concepts or ontologies are entered in the system bythe first sub-project.

We get as a result not one, but a complex and evolving network or ontologies where concepts are related using formalconstructs (such as “owl:sameAs”, “owl:unionOf” or “owl:subClassOf”). This network can however be treated as an integrated

19Prof. Cudre-Mauroux called the SNSF last year to confirm those points

22

Page 23: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

conceptualization, and can be used as a single entry point for all semantic searches and queries in the system using iterativequery rewriting mechanisms [11, 91]. This ontology curation process is illustrated in Figure 8 (bottom part) below.

Schema-Level(ontology)

Instance-Level (entities)

LOD Open Data Cloud

Figure 8: Knowledge integration: semi-automated process continuously take care of integrating and curating the ontology(bottom part of the knowledge graph), as well as relating similar entities and linking them to external knowledge bases (upperpart of the knowledge graph, where the component at the extreme right represents knowledge from third-party sources).

State of the Art in Ontology Curation Ontology or schema matching, i.e., relating some concepts in one ontology to relatedconcepts in a second ontology, is an active area of research [55, 108]. A number of interesting and automated approaches exist torelate concepts, using various edit-distances, taxonomic analyses or networked approaches [95]. The situation is very different forontology management, ontology cleaning or concept merging [74], which are today still mostly handled in an ad-hoc and manualway. In that sense, the current project will go well beyond the state of the art by proposing and evaluating semi-automatic andself-organizing ontology cleaning methods. The idea of having a network of ontologies treated as a single conceptualizationusing query rewriting relates to the relatively recent developments of Peer Data Integration techniques and Peer Data ManagementSystems. A Peer Data Management System (PDMS) [8, 26, 68] is a distributed data integration system providing transparentaccess to heterogeneous databases without resorting to a centralized logical schema. Instead of imposing a uniform query interfaceover a mediated schema, PDMSs let the peers define their own schemas and allow for the reformulation of queries throughmappings relating pairs of schemas. PDMSs typically exploit the schema mappings transitively in order to retrieve results fromthe entire network [16, 69, 92]. The formal methods used to define the mappings in PDMSs may vary, but are typically derivedfrom older view-based data integration techniques (e.g., LAV, GAV, or GLAV formalisms [60]).

Methods The first part of this task, ontology matching, will reuse state of the art techniques in that domain [55] but will combinethem using a systematic and formal framework. The combination will be based on established techniques to integrate conceptmappings [53], but will do so using a probabilistic framework [39] to encode all automatically-generated mappings in a formalway, that can then be leveraged for semantic search functionalities (see below) or by the third sub-project when assessing thetrust of a new piece of concept of discovery linked to a conceptualization. All mappings created in this step and that are deemeduncertain (with a probability below a given threshold) will be send to expert scientists for verification.

Concept curation will be based on semi-automatic methods. The various mappings created in the preceding phase will beranked [61]. Then, candidates for the merging will be identified; Candidates will typically be pairs of concepts, one well-definedand one poorly-defined and rarely used that could advantageously be replaced by the first concept. Also, superfluous conceptsthat were extracted from papers or defined in one of the imported ontologies but are deemed uninteresting or that are not usedwill be marked as candidates for deletion. All those suggestions will be presented to expert scientists for inspection.

The semi-automated axiomatization will be instance-based; we plan to statistically analyze the data to infer plausible axioms,and add those new axioms to the knowledge base. The set of axioms that will be generated in this way will be chosen from one ofthe OWL 2 sub-profiles, in cooperation with the first and third sub-projects to ensure large-scale reasoning and trust managementcapabilities. Also, we will use the various mappings between the concepts to port axioms from on subgraph to the other. Wewill use the crowd-sourcing approach when conflicts arise (for example if new data not conforming to the newly-generated axiom

23

Page 24: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

are entered, or if new conflicting conceptualizations are imported).

5.3.2 Entity Disambiguation and Enrichment

In our context where data can be created, extracted, or imported by arbitrary parties, determining whether two scientific resourcesrefer to the same object (or referent) or not is of utmost importance. The rapid multiplication of distinct identifiers referring tothe same referent (both in our system and outside of our system), and the misuse of identifiers out of the original context in whichthey were defined are currently two central problems impeding the correct development of the Web of data [43]. Three specificsolutions will be investigated to tackle those issues in the context of this sub-project: subgraph matching, taking as input theknowledge sub-graphs created by the knowledge elicitation part and trying to find matching pairs, networked disambiguation,that will analyze the resulting mappings from a global perspective to spot and potentially amend inconsistencies in the networksof related entities, and entity enrichment, matching local entities to their online counterpart (see below for details).

State of the Art in Entity Resolution and Enrichment Entity disambiguation (also called entity resolution) techniques havebeen studied extensively in the past few years (see [85] or [86] for recent tutorials) and consist in providing logical integrationsolutions at the instance (or entity) level, for example by associating different pieces of data relating to the same object. In mostsolutions, a metric is first proposed to capture the similarity between pairs of entities. Machine learning or lexicographic algorithmsare then used to determine whether a pair of entities should be matched or not according to the metric. Several recent piecesof work follow this classical approach in the context of Web data. Jaffri et al. [80], for example, recently investigated entitydisambiguation in two popular portals (DBLP [70] and DBpedia [19]) and found that a significant percentage of online entitieswere either conflated or incorrectly linked. Tackling this problem at Web-scale or for very large knowledge bases is today animportant research issue. Entity enrichment (i.e., linking local entities to third-party entities) is also gaining popularity withthe increasing importance of the Linked Data movement20. DBPedia Spotlight [96] or Wikipedia linkers (e.g., [97]) are recentexamples for creating links to a particular online knowledge sources.

Methods The first task, subgraph matching, will take advantage of the knowledge subgraphs defined above in the firstsub-project (and potentially also similarity links imported from other sources) to take into consideration neighborhoods of dataaround a given piece of information being analyzed. We expect the scope of this process, i.e., the number and type of links tobe followed in the knowledge subgraphs, to be specified declaratively in the query (only relevant information should be retrieved inthis manner). Based on this, and third-party resources, we create links between related entities in an automated, bottom-up fashionby running graph isomorphism queries [41] on pairs of potentially related knowledge subgraphs.

The second research challenge tackled in this context will deal with system-scale automatic entity disambiguation based onthe links relating some of the entities and created above. The notion of graph-based disambiguation was introduced by ouridMesh [43] approach. idMesh takes advantage of the transitivity of equivalence links (such as OWL:sameas [94]) to disambiguatepairs of entities. The focus of this task will be to identify the potential links that could be created in our scientific infrastructurecontext, and to extend the idMesh approach to those types of links whenever possible.

Finally, the third research challenge of this task, entity enrichment, will try to match local entities to their online counterpart,for example taking advantage of the data freely available on the Linked Open Data cloud. We plan to tackle this problem byfirst identifying manually subparts of the Linked Open Data cloud that are interesting in our context, and by building indicesto summarize those sub-parts locally. Then, we will use a combination of algorithmic and crowdsourcing techniques [49] tosemi-automatically and iteratively create links to third-party knowledge bases, and to assess their quality dynamically.

5.3.3 Semantic Versioning and Lineage

As scientific information can be created, manipulated, and re-published iteratively by arbitrary experts in our system, keepingtrack of the various data operations and data versions is essential. Keeping track of the various versions of the pieces of dataand conceptualization, as well as the origin and of the various operations applied to the data is for example a prerequisite for manytrust-related operations (see sub-project 3). Although standard versioning and lineage techniques (e.g. [13, 79]) could be used,

20http://linkeddata.org/

24

Page 25: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

they would generate an enormous amount of additional data and metadata in our context. Choosing the right versioning andlineage abstractions and techniques by taking into account both the input generated from the knowledge elicitation componentand the requirements defined by the third and forth sub-projects will constitute a significant research challenges.

Related Techniques Data versioning has a long history in computer science, and several practical solutions exist to managethe various versions of a given piece of data. However, as our recent work proves [116], those solutions are efficient for textualdata mostly, and are inadequate in a scientific data context. Data lineage (also called data provenance) pertains to the storageand querying of the logical history of the data, starting from its original source. It is especially important in open environmentswhere data can be discovered, modified or mashed-up by arbitrary parties. There exist today many different ways to modeldata lineage [117]. Lineage is typically recorded about the data itself, but can also be directly related to the processes or workflowsoperating on the data [138]. The granularity at which lineage data is collected can vary widely. While many approaches generatelineage for individual tuples or values [13, 135], other record provenance on statistical aggregates [104] or even on entire datasets [58]. There is currently no standard format for representing lineage information across disciplines. Recently, a few efforts (suchas the Open Provenance Model [99]) suggested graph models to encode lineage information in a generic way. Today, however,most lineage information is still encoded using application-specific formats and tools, such as FITS headers in astronomy [134],GenePattern [110] in computational biology, or Vesta [76] in software development.

Methods The first research topic tackled in this context revolves around the creation of a formal model to version the piecesof data (entities, concepts, and links) in our system. We plan to base this versioning model on knowledge subgraphs (such asRDF molecules [137]) and to design algorithms to identify such subgraphs automatically in our local knowledge base. Thesecond part of this research effort will revolve around the definition of efficient and effective compression methods to compactlyencode series of related knowledge subgraphs. Specific delta-compression schemes for our knowledge subgraphs will hencebe researched in this context. They will be conceptually related to our recent efforts on array data [116], albeit focused on avery different data type (sets of subgraphs instead of arrays).

The second part of this task will be devoted to the formal definition of a dedicated scientific lineage format for our system. Wewill initially base our efforts on the Open Provenance Model [99]), but will derive a new version of that model, customize for ourdata types. More precisely, we plan to focus our lineage model on the concepts of scientist, scientific data, scientific concept,scientific process, and scientific discovery, in order to capture lineage information that could then be reused by the third sub-project(Trust in Discoveries).

5.3.4 Semantic Ranking and Semantic Search

In ScienceWISE, we will leverage on the large-scale network of semantic data created and consolidated using the techniquesdescribed above to design a novel end-user experience based on the available linked data. Specifically, we will develop entity-centrictechniques for ranking and diversifying the content relevant to a given user of query.

The workflow of current search engines consists roughly of the following steps: crawling, indexing, retrieval, and ranking.In order to go beyond the current state of the art, next generation search engines will need to perform a deeper analysis of availablecontent for presenting the user with a more structured and comprehensive search result that, more than a ranked list of links,allows to better answer the user information need.

Instead of merely exploiting the textual content of the scientific documents, we plan to take advantage of our integratedknowledge graph to leverage structured information about semantic resources in our system. Thus, search stops to be aboutmere documents; Rather, it is an interface for developing solutions for contextualized user goals about any type of entity.

Related Techniques Semantic Search is a relatively novel research area. Search over very large document collection is tradi-tionally performed by creating inverted indices from textual data and by ranking relevant document using for instance the frequencyof the terms and the documents (TF/IDf) [22]. SIndice [123] is a scalable and efficient approach to create simple index over largecollection of Semantic Web data. Beyond those indexing methods, new methodologies are starting to appear to take advantage ofknowledge bases in the search process, such as ontology-based interpretation of keywords for semantic search [122], how to use shal-

25

Page 26: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

low semantics to improve Web search [21] or hybrid approaches combining both keyword search and ontology-based search [27].

Semantic Ranking and Diversification Knowledge and its articulation are strongly influenced by the diversity in academicbackgrounds, schools of thoughts, experimental approaches, and temporal contexts. Judgments and assessments, which play a cru-cial role in many areas of academic research, including physics and computer science, reflect this diversity in perspective and goals.

For the information available and the way it is expressed and structured, diversity is the reason for diverging viewpoints and con-flicts in the available data. Clearly, to build upon and use the vast amount of scientific data, increasing at an incredible rate, this diver-sity has to be taken into account to achieve a deeper understanding and reliable interpretation of information and knowledge available.

In this task we will exploit the outcome of the distributed reasoning and integration where different aspects of available dataare highlighted. The goal of ranking and diversification is to make possible for the user to identify all those different aspects(for example by automatically following relevant links iteratively in the knowledge graph) without losing any possible perspectivewithout the need for the user to look at all available data.

Based on current raking and diversification algorithms mainly available for unstructured web search approaches, we will developnovel techniques to diversify available content in ScienceWISE. Such approaches will exploit the ontology-based annotationsand the results from the clustering and aggregation steps. Diversified results will be ranked according to the size, the qualityand the origin of the diversified knowledge subgraphs. Finally, a ranked list of diversified results will be presented to the end-user.

Semantic Search over Large-scale structured data The ranking and diversification result from the previous step will beexploited to create novel semantic search functionalities. ScienceWISE will allow users to search for entities as well as forinformation, experimental findings, scientific publications, and any other related concept.

In addition to keyword or ontological search, we plan to base our semantic search functionalities on novel graph isomorphismand subgraph matching (for example to find relations between various diversified knowledge subgraphs). Also, transitive propertiesin the knowledge graph will be used at this stage to aggregate all interesting results relating to a given query using efficientreachability (i.e., transitive closure) queries [41].

The final goal of such semantic search applications is to enable an easy and fast access to the large amount of linked data whichis being created and integrated by scientists with the support of the system. We will allow the scientists to search for such data in auser-friendly way leveraging on our recent advances in large-scale entity search techniques [121]. Our application will allowscientists to obtain up-to-date information about scientific findings, novel research directions, open research problems that still needto be investigated. In this way, we can also see ScienceWISE as a provider of recommendations for scientists about the mostpromising activities to follow and to work on.

5.3.5 Performance Evaluation

The evaluation of the methods developed in this sub-project will be based on standard metrics in information retrieval and datamanagement. We will use precision and recall metrics as well as relevance analyses based on a golden standard to evaluate theeffects of ontology curation, entity enrichment, and semantic search and ranking in our system. The baseline will be measured onthe knowledge subgraphs that are created by the knowledge elicitation part: Queries will be issued on those sub-networks, and thenon our integrated and enriched networks. The performance of our versioning method will be assessed by taking into considerationthe overhead incurred by the method (i.e., CPU overhead for compressing and decompressing data) as well as its performance interms of compression ratio. The evaluation of the lineage framework will be based on its granularity (i.e., what types of dataare taken into account), expressivity (what types of operations can be captured), and on the overhead (CPU, disk) incurred bythe framework. We will use data from the current incarnation of ScienceWISE for our experiments. For the search and integrationparts, a set of queries and a gold standard of answers will be created from a subset of the concepts and data available in the system.

5.4 Schedule and Milestones

The important milestones for this sub-project can be summarized as follows:

Month 3: Ontology matching and curation algorithms

26

Page 27: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Month 8: Semi-automatic axiomatization framework

Month 10: Graph-based entity matching and disambiguation algorithms

Month 16: Prototype of entity disambiguation + entity enrichment using third-party resources

Month 18: Semantic versioning model

Month 22: Prototype of semantic versioning + lineage

Month 26: Semantic diversification and ranking algorithms

Month 32: Prototype of semantic search

5.4.1 Personnel

This subproject will be carried out by one postdoctoral researcher and one Ph.D. student mainly. The postdoctoral researcherwill focus on the formal models (i.e., the semi-automated axiomatization model, the model for scientific data lineage, and thegraph matching and graph enrichment frameworks) while the Ph.D. student will tackle algorithmic issues, and will be responsiblefor most of the performance evaluation tasks.

27

Page 28: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

6 Sub-Project 3: Trust in Discoveries

Establishing trust is key to enable cooperation in particular in open environments such as crowd-sourcing systems. The basicidea underlying the notion of trust is that of an expectation on a certain behavior of another agent [20]. This idea has been adoptedin the area of computational trust [93, 112]).

The question of trust has received significant attention in many recent information systems, such as Social Networks, E-commerce,Web services, and Peer-to-peer Content Sharing and resulted in a rich body of results on computational trust. In particular, reputation-based trust, where the trustworthiness of agents and contents is evaluated based on past experience has become a common paradigmin that field. Reputation determines trustworthiness within a specific context by revealing hidden quality and behavior [14].

Scientific research is very specific with respect to the issue of trust. Trust is essential (and to some extent implicit) withinthe scientific community. When assessing the correctness of scientific results it is typically not assumed that a deliberate fraud(data falsification, misrepresentation of the results, etc. has occurred). Therefore the cases of deliberate scientific fraud and datefalsification (such as e.g. a “Schon scandal”21 that resulted in withdrawal of about 25 papers from the leading research journals,including Science and Nature) are sufficiently rare and generate a strong discussion in the scientific community. However, itis increasingly understood that in the “Publish or Perish” epoch such frauds may become more and more widespread (maybeat a lesser scale that in the Schon’s case) and that the scientific society should not be unprotected against such abuses [124].

In the context of scientific research, trust management aims at evaluating the trustworthiness and reputation of scientists,claiming the results, rather than of the results itself. The reputation of scientists is usually established through citation analysis[78] or even its position and seniority level.

The main method of evaluating of trustworthiness of a specific scientific result, e.g., a scientific publication or experimentis based on the peer review system and on the principle of reproducibility. Both of these principles have their limitations.

The peer-review system (or implicit peer-review by experts) is supposed to check that the work is free from “basic errors”(mathematical or logical) and strives to answer (to the best of the referee’s knowledge) whether all the assumptions, explicitlyor implicitly specified in the work are correct and consistent with each other and with the other results the referee is aware of.The peer-review system has been under continuous fire in recent decades (see e.g. [32, 34] for review). However, even puttingaside the “human factor” of the peer-review system, its intrinsic limitations are clear. Verification of almost any “beyond thestate of the art” result may be well beyond mental or experimental abilities of one person (or a small group). This is definitelytrue for most experimental works, but can also be the case for purely theoretical, even mathematical results – see the historyof verification of the proof by G. Perelman of the Poincare hypothesis.22

Recent efforts aim at finding new publication forms and formats, that intensify the scientific discourse while decreasing the sizeof unit of publishing, which is becoming possible and increasingly attractive due to the possibilities of electronic publishing. Forexample, LiquidPub (http://project.liquidpub.org) is a project that aims to change the way scientific knowledge isproduced, disseminated, evaluated, and consumed, using concepts such as liquid books supporting multi-author collaboration andinstant communities for knowledge sharing.

The reproducibility of scientific results has its own drawbacks as a method of verification. Some experiments (such as LHCexperiments) are extremely difficult to replicate, such as most in high-energy particle physics due to the enormous costs or in someareas of mesoscopic physics where refined expertise is required.

Credibility of generated knowledge. Most importantly, so far no formal methods have been investigated to evaluate thetrustworthiness of a scientific discovery per se (involving potentially many contributors, publications and experiments). This,till today, has not been considered as a major problem, given that specialists have (or believe to have) a good overview of theirfield and are able to assess the trustworthiness of a scientific discovery in a science community process.

21See e.g. Nature Physics 5, 451 - 452 (2009) doi:10.1038/nphys131622Three papers by G. Perelman have been posted on arxiv in November of 2002. Only in 2006 Bruce Kleiner and John Lott, both of the University of Michigan,

posted a paper on arXiv that fills in the details of Perelman’s proof of the Geometrization conjecture.[16] John Lott said in ICM2006, “It has taken us sometime to examine Perelman’s work. This is partly due to the originality of Perelman’s work and partly to the technical sophistication of his arguments. Allindications are that his arguments are correct.”

28

Page 29: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

This may, however, radically change as scientific discoveries, as outlined earlier, are becoming complex phenomena, involvingnumerous people and experiments connected through complex workflows, whose complexity may evade the grasp of individualresearchers and even whole specialized communities.

Indeed, in the traditional “trust and reputation” use cases users do not try to estimate the credibility of their own statements orconclusions themselves. In physics this is a standard practice. The estimates of potential errors (both due to the statistical natureof data and due to limitations of measurement procedure), probabilities that a model or a result is correct (“statistical significance”),hidden or explicit assumptions for a statement to be correct – all this is a part of a normal work. However, as results (discoveries)become more and more complex, social aspects become more and more important also for the analysis of credibility. A combinationof two extreme approaches (social, with “blind” users and scientific, with “smart” users, but no social analysis) would be appropriate.

Therefore procedures to support the evaluation of the credibility would be extremely helpful. In this research we plan to exploitthe rich body of semantic knowledge generated by ScienceWise starting from existing body of work in computational trust toprogress towards this objective. We propose to

• develop formal methods to evaluate the trustworthiness of a scientific discovery by analyzing of how such assessments aremade today in scientific disciplines

• compare these processes to other trust mechanisms, in particular those investigated in computational trust,• and develop based on this analysis mechanisms that could support humans in assessing trustworthiness of scientific discoveries

in the future.

We expect that many of today’s scientific good practices, such as repetition of experiments, conduction of independent experiments,peer review etc. are precursors of such more formalized methods of trust assessment for that purpose.

6.1 Interaction with the Overall Project

Modeling and integrating scientific knowledge in common ontologies is the basis for automating its semantics-oriented processing.From a pragmatics-oriented perspective this opens the question of evaluating whether specific facts and statements, and inparticular, scientific discoveries are credible, trustworthy and can be relied on. This process of evaluating trust in science bearsstrong similarities with other problems of automated trustworthiness evaluation that have been studied in the area of computationaltrust in online systems. The goal of this project is to leverage from the opportunities that the creation of a knowledge platform,as envisaged by the overall project, and the existence of a concrete instance of such as system, as provided by ScienceWise,to investigate of how methods of computational trust are applicable in the science context or how they need to be further developedfor the purpose of trust evaluation in science. The results of this project will be an essential element in taming the complexity ofevaluating trust in scientific discovery and will contribute concrete trust models and algorithms to the ScienceWise platform.Concretely the interactions with the other subproject will concern the following issues:

• For subproject 1 it will ensure that Knowledge Elicitation does not bring false or not credible information into the system.We will also have to ensure that the trust mechanisms are non-intrusively integrated at the system and user interface leveland are acceptable to the system users.

• For subproject 2 we will in particular jointly study the propagation of trust-related features during the process of knowledgeintegration and take advantage of data lineage mechanisms to track trust-related features. This reflects on the fact that semanticalignment is inherently a social process requiring and producing trust [37].

• For subproject 4 we will jointly investigate of which static and dynamic graph-related features would be indicators on the qualityof discoveries in the semantic knowledge graph, both in a positive and negative sense. This extends existing computationaltrust methods for social network analysis to a completely novel context [51].

6.2 Detailed Research Plan

The research will be based on the existing body of work in computational trust for online systems. Central to this is the notion ofreputation-based trust. Reputation determines trustworthiness within a specific context by revealing hidden quality (adverseselection, i.e., advertise higher quality than true) [14] and hidden behavior (moral hazard, i.e. post-contractual opportunism)

29

Page 30: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

[17]. Reputation systems alleviate adverse selection by acting as signaling mechanisms towards products or services of highquality. They can also deal with moral hazard by acting as sanctioning mechanisms revealing the untrustworthy behavior ofa participant, such that others will not be willing to transact with him. According to [51], reputation-based trust mechanisms canbe classified in the following categories: social networks formation, probabilistic estimation techniques and game-theoretic models.

Social networks formation implies the existence of a trust graph among entities with link weights dynamically re-calculatedand aggregated together for the calculation of end-to-end trust. Probabilistic estimation techniques calculate trust values as theoutput of probability distributions (or at least the most likely outcome) over the set of possible behaviors of the trusted agents[82, 100]. Game-theoretic trust mechanisms consider trustworthy behavior of agents as the result of rational behavior, e.g. aNash equilibrium, see e.g. [48, 87].

Building on these basic notions, we organize the research in the following steps:

6.2.1 Analysis of the concept of trust in the science context

We first compare today’s computational trust mechanisms to current science practice. Trust in scientific results is a backbonefor the progress of human society and an implicit assumption which makes our daily life possible. Indeed, the way of how trustin scientific discovery is built bears resemblance to models of computational trust we discussed before. The system of peer-reviewand other social mechanisms in scientific communities are the analog of social network formation techniques. The practice to assesthe statistical significance of a result and potential errors in data is the analog of probabilistic evaluation of reputation data inprobabilistic estimation techniques. And sanctioning mechanisms for fraud in scientific communities can be well understood in theframework of game-theoretic trust mechanisms. Of course, the domains of online systems and science are sufficiently differentto expect that beyond analogies we will also find significant differences. Identifying those will be a first step in our research.

In this analysis we will in particular focus on typical mistakes and fallacies in today’s science and how they can be rapidly ampli-fied with the increasing complexity of the scientific discovery process. We can observe several typical problems today, that would beconsidered as major mistakes in computational trust evaluation. For example, when reusing results from different studies, typicallynon-statistical (so-called systematic) uncertainty about the reused results are dismissed in the statistical analysis of the validity of anexperiment, even in the cases when they were carefully discussed in the original research paper. Moreover, in many fields such as e.g.astrophysics, the very practice of estimating the systematic uncertainties is very complicated and non-formalized. Thus, the inter-faces in a complex scientific workflow are agnostic of some of the uncertainties. This would be certainly unacceptable in a probabilis-tic reasoning model for trust. Also, independence assumptions (e.g., among experiments) might not hold and thus statistical estimatesmight be biased, and their impact on future complex discovery processes becomes hard to estimate. Note, that all these problemsmight not be perceived as severe today, as still specialists and communities feel competent to correct them through carefully de-signed social processes. But when discoveries become increasingly complex these processes risk to fail (similarly as today’s financesystem starts to fail as a result of complexity and phenomena of strong coupling [115]). One notable example of this complexity indata and data processing is the increasing use of neural networks in analyzing the data of high-energy experiments (see e.g. [5, 59]).

6.2.2 Identifying and tracking features relevant to scientific trust

The first step in developing a trust model is the identification of relevant features. In online science platforms we find severalwell-known sources that serve as reputation data, both for scientist and scientific findings. These include in particular peer-reviewedpublications (where the fact of acceptance is a strong indicator of the quality of the publication itself, assuming an underlyingreview process) and citation graphs (where a citation is considered as a positive feedback or recommendation on a publication).With a science knowledge platform, such as ScienceWise, we may expect to identify many other sources of reputation data, such as

• Comments and annotations on papers and authors, but as well on experiment descriptions, other’s comments etc.• Ontological annotations, that allow through reasoning and consistency checking to validate the correctness of statements.• Data on the statistical significance of experimental data and quantification of error sources• Complex graph features, that go beyond the analysis of citation or authorship graphs, and include the semantic annotations,

relations among experiments, datasets and results and other content and their relationships that are made available.

30

Page 31: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

• Textual features obtained through analyzing style and sentiment of texts [102].

These are just a few examples of what we might identify as useful reputation features. An important aspect will be to distinguishwhich of those are amenable to a formal analysis (such as statistical features) and which are heuristic by nature, such as thosefeatures related to social interaction and natural language.

Once such features are identified, we also analyze of how such features can be tracked using data lineage mechanisms. Suchmechanisms have been investigated in the areas of data processing [83] and escience [118], but typically limited to specific scienceworkflows or data processing problems. In our case lineage may span whole discovery processes involving large communities, a highvariety of methods and models, and potentially long periods of time. This requires more generic and universally accessible mecha-nisms to track the lineage of a scientific statements, similarly as it has been recently proposed with the nanopublications concept [67].

6.2.3 Developing trust models for scientific discoveries

Based on our understanding of the role of trust in science and the availability of pertinent reputation features we will develop trustmodels that allow us to infer trust values from reputation data. The trust model aims to assess all factors of uncertainty from hetero-geneous resources (e.g. uncertainty on participating researchers, experimental parameters, accidental errors, malicious behaviors).

Given the relative importance of statistics in modern data-driven science, probabilistic models that have found wide applicationin computational trust is a natural first candidate. We will in particular consider the problem of how to deal with problems ofpropagation and aggregation of uncertainty across experiments in scientific discoveries, and of independence (or correlation)of results in the scientific discovery process. Graphical probabilistic models have been applied to that end in many settings byus and others [107, 127, 130], and we will use them as starting point. Since science is also a social agreement process, a secondclass of algorithms we intend to apply are social network-based and more generally graph-based analysis models to determine trust.Game-theoretic methods we will investigate only at later stages of the project, given that fraud seems to be still (easily?) containedby severe sanctioning mechanism.

A particular aspect that makes this research interesting and challenging, is that in contrast to the classical online reputationsystems, science not only deals with large numbers of small transactions, but also with a small numbers of large transactions(the scientific discoveries). Traditional computational reputation mechanisms have been designed with the former problem in mind(e.g., selling large numbers of products), and this case is also relevant to science (e.g., publishing large numbers of papers). But inscience also the aggregate view is of importance. For (big) scientific discoveries, establishing trust is not a matter of enteringor not entering into a transaction (the research), but of assessing whether the aggregation of many trusted small research actionsresults in a trustworthy global result, and where potential problems in arriving at the results might occur. This difference wemay expect to have a profound impact in the type of trust models and algorithms we consider.

Finally, in developing the trust models we will also have to consider the problem of trust heterogeneity. As explained, trustin scientific discovery depends on many factors, and different computational trust mechanisms may be employed to assessthese trust factors. This might result on the one hand in the requirement of integrating reputation data from many, potentiallyheterogeneous reputation sources, and on the other hand in the requirement of integrating trust values produced by many,potentially heterogenous trust models. To illustrate, consider the problem of integrating the value of a citation graph analysis(e.g. the h-factor) with the statistical significance of an experiment. This problem is today completely open, apart from effortson syntactic integration of reputation data (e.g. OASIS Open Reputation Management Systems (ORMS) technical committee,http://www.oasis-open.org/committees/orms) and initial analysis as performed in our group [128].

6.2.4 Evaluation

The evaluation will be performed using the ScienceWise platform. It will require the following steps, which can be executedin several iterations, depending on the progress in the theoretical work and the data acquisition in ScienceWise.

1. Identification of the available features that can be used as reputation data and extension by new features. We can rely onthe existence of authorship and citation data, feedback given by ScienceWise users on ScienceWISE content and featuresderived from the ontology annotation graph. These will be incrementally extended. In particular, we need to extract from

31

Page 32: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

scientific publications or obtain through annotation, data on the statistical significance of experiments and dependencies amongexperimental results. The ScienceWise system will also serve as the platform to analyze the provenance of scientific discoveries.

2. Identification of one or more scientific discoveries as case study and extraction of the related data from ScienceWise. Anexample of such a discovery could be superluminal neutrinos, where the best efforts of professional have not yet explainedphysically unacceptable outcomes, but the final decision has to be taken after assessing the quality and quantity of availabledata. This data will potentially be complemented through targeted feedback from the expert community for baseline evaluation,which can be relatively easily targeted using the ScienceWise platform.

3. Offline experimentation with the data related to the case study. In this step we will compare different algorithms, compare therequality and performance, in order to identify suitable candidate algorithms to be deployed on the life ScienceWise platform.

4. Development and life deployment of selected algorithms. This step will allow us to gather additional real reputation data, such asuser feedbacks, but also to evaluate the quality of the algorithms by directly or indirectly gathering feedback from the user commu-nity. The feedback will be used to further improve the trust models, but also to provide novel requirements to the other subprojects,e.g., on how to maintain data provenance or how to improve the science ontologies with concepts relevant for trust assessment.

6.3 Schedules and Milestones

This subproject will be carried out by one postdoctoral researcher and one Ph.D. student. The PhD student will primarily developand test specific trust algorithms, whereas the postdoc will focus in integrating this work into the overall architecture, designof case studies and experimental work. The main milestones are the following:

1. Month 6: Paper on analysis of the concept of trust in Science.2. Month 12: Set of reputation features identified and extracted.3. Month 18: Algorithms for trust evaluation designed and offline evaluated.4. Month 24: Deployment of trust algorithms on ScienceWise.5. Month 30: Data from real deployment gathered.6. Month 36: Summary of results.

Naturally, we expect the development to be less linear as outlined in this schedule and contain many iterative cycles. But in viewof obtaining real experimental data we will have to maintain it for a selected subset of problems and methods we consider inthis research.

32

Page 33: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

7 Sub-Project 4: Multilevel Networks and Clustering

The nature of our project not only requires expert insight from complex systems theory, but actually challenges the current stateof the art of the scientific understanding of complexity. For this reason, a significant part of the project will be devoted to thestudy of novel theoretical frameworks for creating, integrating, assessing and evolving complex knowledge graphs.

7.1 Interactions with the overall project

One of the objectives of this project is to develop the large scale automated semantic analysis, which in particular requiresadvancing of our fundamental knowledge in managing complex heterogeneous interdependent networks. This sub-projects aims toderive new knowledge graphs from the data generated by the sub-projects Knowledge Elicitation (SP1) and Knowledge Integration(SP2). The hidden connections discovered through those analyses and the ranking algorithms developed here can in turn beused by the integration and search parts of SP2, and by the trust analysis of the Knowledge Verification sub-project (SP3).

As a matter of fact, the ScienceWISE network (the main data structure, provided by the SP1) exhibits a complexity, but alsoa completeness, that is almost unprecedented in network theory, which will allow us to push the boundaries of the present conceptsand techniques that are customarily used to understand and analyze networks. Actually, the ScienceWISE data might representone of the first amenable instances of a real complex system, composed of multiple networks, each capturing the relations betweenentities belonging to a specific category; connections, moreover, can be of several different varieties even between the sameentities. At the same time entities in different networks are poised to show mutual relations, giving raise to multi-layered networks.Instances of such systems can be found across several disciplines: social networks of individuals connected by a variety of humanrelationships, dealing with goods and interests that are in turn connected by their own functional or cultural relations; biologicalnetworks of proteins physically interacting with each other, and apt to bind genes that are in turn mutually connected by sharedfunctions; scientists connected by their collaborations writing together papers that are in turn connected by bibliographic citations,and dealing with concepts in mutual semantic relations. The latter example is precisely pertinent to the present proposal andcan be described as a three layer network where the connections between authors are undirected, just as the ones between concepts,whereas the paper network is directed because journal articles are time-ordered (cf. Figure 5 on page 17). The major obstacle thatcurrently prevents a comprehensive understanding of the structure and dynamics of such multiply connected systems is the lackof an adequate theoretical framework to study how different types of connections relate to and interact with each other. Evenif a few pioneering studies of multi-layer networks already exist, they have generally applied standard techniques (developed fornetworks with a single type of links) to each of the networks layers separately, or at most measured phenomenological correlations(and sporadically their consequences) between different layers. These approaches fail to capture the origin of the coupling betweenlayers, and therefore cannot characterize the system dynamically or suggest generative (or predictive) models of it. From aseemingly unrelated perspective, real complex systems are also often characterized by a multi-level structure where two or morestructural levels interact with each other giving rise to multi-scale structural and dynamical properties. Many approaches, includingrenormalization techniques, community detection methods, and other coarse-graining schemes, have tried to understand what canbe inferred about the properties of one structural level from the knowledge of the properties of other levels. However, in networkedcomplex systems, this type of analysis has been carried out mostly ignoring the multiply layered and connected nature of networks.In general, dealing with multiple layers/connections has been merely regarded as a sophistication of the simpler singly-connectedcase, and has therefore not received much emphasis when dealing with a multi-level description of complex systems.

ScienceWISE as a web of inter-related networks. Our suggestion is to dynamically map the ScienceWISE data and its flatknowledge networks onto evolving multilayered networks, where authors, papers and concepts represent the main three levels (Fig-ure 5 on page 17). Each level is a complex network in itself, as authors are mutually connected if, for instance, they have coauthoredsome papers; there are edges between papers according to bibliographic citations; concepts are linked by semantic relations. The au-thors and concepts networks are undirected, whereas the papers network is directed. A further degree of complexity is brought by thedynamic nature of the network, which grows by the constant addition of new authors, papers and concepts: when a new paper is pub-lished and/or bookmarked (within ScienceWISE), it connects its co-authors with each other, it sends links to the papers it cites and to

33

Page 34: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

Figure 9: The inter-connected network of concepts-papers in the ScienceWISE (for simplicity additional structures, providedby the authors are not shown). All papers that mention a specific concept are organized into a graph with the distance to thepaper being proportional to the “relevance” of a concept in the paper. (The Figure shows the graphs related to the concepts“Dark matter”, “Galactic Center” and “Lepton number”). As one zooms onto any paper that mentions all these three concepts(“Dark Forces At the Tevatron”) one discovers the cloud of concepts, associated with it (again, their distance encodes relevance).The co-occurrence within the same paper imposes a network structure on the concepts, different from the ScienceWISE ontology.

the concepts relevant to it. Such kind of multilayered network represents a new challenge in the science of complex networks, whichmust be tackled by using a multipronged approach and by carefully deploying novel descriptive indicators and analysis algorithms.

7.2 Detailed research plan

The detailed research plan will proceed along the following directions:

7.2.1 Structural features and null models

The goal of this part is to adequately describe the structural features of the ScienceWISE network by first adapting the standardtopological indexes (degree, clustering coefficient, assortativity index etc.) to multi-layered networks, and in parallel by devisingnew quantities that are able to take into account at once of both the intra- and inter-layer structural characteristics. The probabilitydistributions of such quantities will then be used as constraints to build null models using, for example, multi-layer generalizationsof the Configuration Model [98]. The time-dependent nature of the ScienceWISE network allows to reconstruct the patternsof activity, and from that it will be possible to propose dynamical growth models able to correlate the macroscopic networkstructure with the microscopic dynamics of publishing scientists.

7.2.2 Scienctific activity dynamics

Using the ScienceWISE system, we will study the joint temporal distribution of all types of events (bookmarking, publication,citation, appearance of new concepts, etc.). Recent results have highlighted the tendency of several complex systems to evolvethrough bursts of intense activity that interrupt longer periods of otherwise quite activity [24]. In particular, in many casesthe distributions of waiting times and intensity of activity are both found to be power laws, without a typical scale. However,these analyses have been carried out on independent time series. The possibility to study simultaneous time series of events(bookmarking, publication, citation, appearance of new concepts, etc.) on a system of which we also know the internal structureis almost unprecedented. We will study empirically the response of one aspect of our system to changes in other aspects, and test

34

Page 35: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

these observations against out-of-equilibrium dynamical models used in queuing theory, self-organization, etc. We will alsoextend and exploit advanced techniques that have been introduced in order to infer, from multiple time series, the separationbetween internal and external contributions to the dynamics of complex systems.

We will develop a semantic and evolutionary approach to scientometrics. While most scientometric methods focus on quantita-tive patterns of scientific production (i.e. number of publications, citations, etc.), we will focus on content-based out-of-equilibriumpatterns of scientific discovery. In particular we will focus on the ScienceWISE article annotation and semantic analysis systemdeveloped within the online article repository ArXiv.org. By considering the entire event history of ScienceWISE, we willlook for equilibrium and out-of-equilibrium regimes in the system. In particular, we will identify key structural changes aimingat the automated detection of patterns of scientific discovery, which is one of the main quests of ScienceWISE. Our approachwill be tested against the data, using “objective’ external information singling out major scientific discoveries, e.g. identifyingpublications leading to the award of Nobel Prizes.

7.2.3 Clustering and coarse graining

Being able to construct null models of the network is a stepping stone to extract information from it. Indeed, to be of any relevance,detected features have to stand out of the background produced by null models (that is, networks with most of the same structuralfeatures as the real ones, but devoid of any relevant information content). One of the prominent techniques to squeeze out signalsfrom a network structure is community detection [56]. In a nutshell, community detection, also known as network clustering, aimsat discovering groups of nodes that are more tightly knit with each other than with the rest of the graph, the underlying assumptionbeing that such groups gather vertices that share some common properties. Most community detection protocols proposed in thepast can be applied only to single networks, or at most to the simplest layered networks, that is, bipartite graphs (a bipartite graph isa network where two different classes of entities are connected, without links within each class) [25, 88, 114]. As a consequence wewill have to devise new schemes able to deal with the full stratified nature of the ScienceWISE network and to assess the reliabilityof its outcome. An outstanding challenge that we will need to address in this context is, moreover, that authors and papers can belongto several different communities, so that novel clustering algorithms need to allow for such possibility (soft vs. hard clustering).

A different but related approach, is network coarse graining, whereby the size of the network is reduced by clumping togethergroups of nodes so that some relevant quantities are preserved in he process. Ideally, this technique should highlight the basicbuilding blocks of the network and how they develop in time (in some sense seizing the relevant degrees of freedom of thenetwork) [15, 63, 64]. Besides being a potentially powerful information retrieval tool, coarse graining promises being usefulfor network visualization, as it allows to zoom in and out of the network at will, making any graphical representation not onlypossible (given the network size and complexity, a node level representation might be impractical), but also interactive (as showne.g. in Figure 9 on the previous page).

7.2.4 Authorities and information retrieval

Just as community detection and coarse graining reveal the intimate organization of the system, finding ways to exploit the networkstructure to improve the scientific workflow can result in game-changing procedures. We are going to focus, first, on findingthe most authoritative scientists and/or papers related to a given concept. For this goal, which is tightly linked to recommendationsystems, we can exploit the network structure itself by generalizing algorithms a la PageRank [30] to multilayered networks,alongside the various different approaches proposed in SP3. Also information retrivial is going to play a crucial role in improvinghow scientists work. As an example, an author could start from a number of references that she aims to use for her work, andinterrogate the network via its structure, rather than by keywords, to extract further relevant papers to consider and/or researchers tocontact that had escaped to her initial search (possibly due to different keywording or to non intuitive connections at the level ofthe concepts) (as in examples in Figures 9 on the preceding page and 6 on page 18).

35

Page 36: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

7.3 Schedule and Milestones

7.3.1 Milestones

• Month 6: null model for the multigraph• Month 15: Model for the multigraph growth• Month 20: PageRank-like algorithm to search the multigraph• Month 24: Multigraph Clustering algorithms• Month 30: Multigraph coarse-graining and, possibly, multigraph visualization schemes• Month 36: Summary of the project

7.3.2 Task assignment for the requested personnel

This subproject will be carried out by one postdoctoral researcher and one Ph.D. student. Initially, the postdoctoral researcherwill mostly focus on the definition of structural indexes for multilayer networks, and on the formulation of new clustering schemes,whereas the PhD student will analyze the large scale structure of the network, and will implement the various algorithms. Asthe PhD’s experience will increase, he/she will be progressivley more and more involved in the abstract part. Toward the end of theproject the postdoc will also explore the extent by which the progresses gained on the ScienceWISE network can be exportedto other multilayer networks.

7.4 Overlap with other projects

P. De Los Rios has at present one project about clustering and coarse-graining of networks, funded by the Swiss National Foundation.It will end in October 2013 so that it will have no time overlap with the present proposal. Since such on-going project focusses onthe intrinsic properties of clustering algorithms, the present proposal fits nicely in the current research of Prof. De Los Rios.

36

Page 37: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

References

[1] Preprint server http://ArXiv.org provides open access to preprints of research papers in natural sciences, mathematics, CS.In certain areas of physics and mathematics it contains close to 100% of all manuscripts.

[2] CERN Document Server provides access to articles, reports and multimedia content in HEP. Contains more than a million of uniquerecords. Accessible http://cds.cern.ch.

[3] NASA Astrophysical Data system provides a single interface to searching both metadata and the contents of the full-text archive,which now totals over 2.5 million documents. Accessible at http://labs.adsabs.harvard.edu.

[4] A comprehensive bibliographic database in high-energy physics maintained by CERN, DESY, FNAL, SLAC collaborations. Accessibleat http://inspirehep.net.

[5] T. Aaltonen et al. A Search for the Higgs Boson Using Neural Networks in Events with Missing Energy and b-quark Jets in p anti-pCollisions at s**(1/2) = 1.96-TeV. Phys.Rev.Lett., 104:141801, 2010.

[6] K. Aberer, A. Boyarsky, P. Cudre-Mauroux, G. Demartini, and O. Ruchayskiy. An integrated socio-technical crowdsourcing platform foraccelerating returns in escience. 2011.

[7] K. Aberer, A. Boyarsky, P. Cudre-Mauroux, G. Demartini, and O. Ruchayskiy. ScienceWISE : a Web-based Interactive SemanticPlatform for scientic collaboration. 2011.

[8] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. A Framework for Semantic Gossiping. SIGOMD RECORD, 31(4), December 2002.

[9] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. Start making sense: The Chatty Web approach for global semantic agreements.Journal of Web Semantics, 1(1), 2003.

[10] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. The Chatty Web: Emergent Semantics Through Gossiping. In International WorldWide Web Conference (WWW), 2003.

[11] K. Aberer, P. Cudre-Mauroux, and M. Hauswirth. The Chatty Web: Emergent Semantics Through Gossiping. In International WorldWide Web Conference (WWW), 2003.

[12] K. Aberer and Z. Despotovic. Managing trust in a peer-2-peer information system. In Proceedings of the tenth international conferenceon Information and knowledge management, CIKM ’01, pages 310–317, New York, NY, USA, 2001. ACM.

[13] P. Agrawal, O. Benjelloun, A. D. Sarma, C. Hayworth, S. U. Nabar, T. Sugihara, and J. Widom. Trio: A System for Data, Uncertainty,and Lineage. In International Conference on Very Large Data Bases (VLDB), 2006.

[14] G. Akerlof. The market for” lemons”: Quality uncertainty and the market mechanism. The quarterly journal of economics, pages488–500, 1970.

[15] A. Arenas, J. Duch, A. Fernandez, and S. Gomez. Size reduction of complex networks preserving modularity. New J Phys, 9:176, 2007.

[16] M. Arenas, V. Kantere, A. Kementsietsidis, I. Kiringa, R. J. Miller, and J. Mylopoulos. The Hyperion Project: From Data Integrationto Data Coordination. SIGMOD Record, Special Issue on Peer-to-Peer Data Management, 32(3), 2003.

[17] K. Arrow. The economics of moral hazard: Further comment. The American economic review, 58(3):537–539, 1968.

[18] A. Astafiev, R. Prokofyev, C. Gueret, A. Boyarsky, and O. Ruchayskiy. Sciencewise: A web-based interactive semantic platformfor paper annotation and ontology editing. ESWC2102, 3, May 2012.

[19] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. The SemanticWeb, pages 722–735, 2007. Available at http://dbpedia.org/.

[20] R. Axelrod. The Evolution of Cooperation. Basic Books, New York, 11 edition, 1984.

[21] R. A. Baeza-Yates, M. Ciaramita, P. Mika, and H. Zaragoza. Towards semantic search. In NLDB, pages 4–11, 2008.

[22] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston,MA, USA, 1999.

[23] K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In E. N. Efthimiadis, S. T. Dumais,D. Hawking, and K. Jrvelin, editors, SIGIR, pages 43–50. ACM, 2006.

[24] A. Barabasi. Bursts: the hidden pattern behind everything we do. Dutton, 2010.

[25] M. J. Barber. Modularity and community detection in bipartite networks. Physical Review E, 76(6):–, 2007.

[26] P. Bernstein, F. Giunchiglia, A. Kementsietsidis, J. Mylopoulos, L. Serafini, and I. Zaihrayeu. Data Management for Peer-to-PeerComputing : A Vision. In International Workshop on the Web and Databases (WebDB), 2002.

37

Page 38: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

[27] R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli. Hybrid search: effectively combining keywords and semanticsearches. In Proceedings of the 5th European semantic web conference on The semantic web: research and applications, ESWC’08,pages 554–568, Berlin, Heidelberg, 2008. Springer-Verlag.

[28] A. Boyarsky, O. Ruchayskiy, Z. Yang, O. Zozulya, M. Charlaganov, and P. D. L. Rios. From scientific papers to the scientific ontology:dynamical clustering of heterogeneous graphs and ontology crowdsourcing. Joint Workshop on Large and Heterogeneous Dataand Quantitative Formalization in the Semantic Web (LHD+SemQuant 2012), 2013.

[29] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems,30(1):107–117, 1998.

[30] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and Isdn Systems, pages107–117. Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA, 1998.

[31] A. Budura, P. Cudre-Mauroux, and K. Aberer. From bioinformatic web portals to semantically integrated data grid networks. FutureGeneration Computer Systems, 23(3):485–496, 2007.

[32] F. Casati, F. Giunchiglia, and M. Marchese. Publish and perish: why the current publication and review model is killing researchand wasting your money. Ubiquity, 2007:3:1–3:1, January 2007.

[33] D. Che, Y. Chen, and K. Aberer. A query system in a biological database. In Scientific and Statistical Database Management, 1999.Eleventh International Conference on, pages 158 –167, 1999.

[34] G. Cormode. How not to review a paper: the tools and techniques of the adversarial reviewer. SIGMOD Rec., 37:100–104, March 2009.

[35] P. Cudre-Mauroux. Emergent Semantics. EPFL & CRC Press, 2008.

[36] P. Cudre-Mauroux and K. Aberer. A Necessary Condition For Semantic Interoperability in the Large. In Ontologies, DataBases,and Applications of Semantics for Large Scale Information Systems (ODBASE), 2004.

[37] P. Cudre-Mauroux, K. Aberer, A. I. Abdelmoty, T. Catarci, E. Damiani, A. Illarramendi, M. Jarrar, R. Meersman, E. J. Neuhold,C. Parent, K.-U. Sattler, M. Scannapieco, S. Spaccapietra, P. Spyns, and G. De Tr. Viewpoints on Emergent Semantics. Journalon Data Semantics, (VI):1–27, 2006.

[38] P. Cudre-Mauroux, K. Aberer, and A. Feher. Probabilistic Message Passing in Peer Data Management Systems. In InternationalConference on Data Engineering (ICDE), 2006.

[39] P. Cudre-Mauroux, K. Aberer, and A. Feher. Probabilistic Message Passing in Peer Data Management Systems. In InternationalConference on Data Engineering (ICDE), 2006.

[40] P. Cudre-Mauroux, S. Agarwal, and K. Aberer. Gridvine: An infrastructure for peer information management. IEEE Internet Computing,11(5), 2007.

[41] P. Cudre-Mauroux and S. Elnikety. Graph Data Management Systems for New Application Domains. PVLDB, 4(12), 2011.

[42] P. Cudre-Mauroux, J. Gaugaz, A. Budura, and K. Aberer. Analyzing Semantic Interoperability in Bioinformatic Database Networks.In International Workshop on Semantic Network Analysis (SNA), 2005.

[43] P. Cudre-Mauroux, P. Haghani, M. Jost, K. Aberer, and H. de Meer. idMesh: Graph-Based Disambiguation of Linked Data. InInternational World Wide Web Conference (WWW), 2009.

[44] P. Cudre-Mauroux, K. Lim, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier,S. Madden, J. M. Patel, M. Stonebraker, and S. Zdonik. A Demonstration of SciDB: A Science-Oriented DBMS. Proceedings ofthe VLDB Endowment (PVLDB), 2(2):1534–1537, 2009.

[45] N. Dawes, K. A. Kumar, S. Michel, K. Aberer, and M. Lehning. Sensor Metadata Management and its Application in CollaborativeEnvironmental Research. In 4th IEEE International Conference on e-Science, 2008.

[46] D. De Roure and J. Frey. Three perspectives on collaborative knowledge acquisition in e-science. In Workshop on Semantic Webfor Collaborative Knowledge Acquisition, volume 20. Citeseer, 2007.

[47] E. Deelman, D. Gannon, M. Shields, and I. Taylor. Workflows and e-science: An overview of workflow system features and capabilities.Future Generation Computer Systems, 25(5):528–540, 2009.

[48] C. Dellarocas. Sanctioning reputation mechanisms in online trading: Environments with moral hazard. MIT Sloan Working PaperNo. 4297-03, 2004. Available at SSRN: http://ssrn.com/abstract=393043 or DOI: 10.2139/ssrn.393043.

[49] G. Demartini, D. Difallah, and P. Cudre-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniquesfor Large-Scale Entity Linking. In [Submitted for Publication], 2012.

38

Page 39: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

[50] G. Demartini, C. S. Firan, T. Iofciu, R. Krestel, and W. Nejdl. Why finding entities in wikipedia is difficult, sometimes. Inf. Retr.,13(5):534–567, 2010.

[51] Z. Despotovic and K. Aberer. P2p reputation management: Probabilistic estimation vs. social networks. Computer Networks, 50(4):485– 500, 2006.

[52] E. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959.

[53] H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In International Conference onVery Large Data Bases (VLDB), 2002.

[54] F. Donini. Complexity of reasoning. In The description logic handbook: theory, implementation, and applications. Cambridge UniversityPress, 2003.

[55] J. Euzenat and P. Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE), 2007.

[56] S. Fortunato. Community detection in graphs. Physics Reports-Review Section Of Physics Letters, 486(3-5):75–174, 2010.

[57] I. Foster and C. Kesselman. The grid: blueprint for a new computing infrastructure. Morgan Kaufmann, 2004.

[58] I. Foster, J. Vockler, M. Wilde, and Y. Zhao. Chimera: A Virtual Data System For Representing, Querying, and Automating DataDerivation. In Scientific and Statistical Database Management (SSDBM), pages 37–46, 2002.

[59] J. Freeman, J. Lewis, W. Ketchum, S. Poprocki, A. Pronko, et al. An Artificial neural network based b jet identification algorithmat the CDF Experiment. Nucl.Instrum.Meth., A663:37–47, 2012.

[60] M. Friedman, A. Levy, and T. Millstein. Navigational plans for data integration. In National Conference on Artificial Intelligence(AAAI), 1999.

[61] A. Gal. Managing Uncertainty in Schema Matching with Top-K Schema Mappings. Journal on Data Semantics, VI, 2006.

[62] D. Gfeller, J.-C. Chappelier, and P. De Los Rios. Finding instabilities in the community structure of complex networks. Physicalreview E, Statistical, nonlinear, and soft matter physics, 72(5 Pt 2):056135, Dec. 2005.

[63] D. Gfeller and P. De Los Rios. Spectral coarse graining of complex networks. Physical review letters, 99(3):038701, July 2007.

[64] D. Gfeller and P. De Los Rios. Spectral coarse graining and synchronization in oscillator networks. Physical review letters,100(17):174104, 2008.

[65] P. Ginsparg. First steps towards electronic research communication. Computers in Physics 390, 8:390, 1994.

[66] P. Ginsparg. The global-village pioneers. Physics World (Bristol, England), (PRESSCUT-H-2008-608):22–26, 2008.

[67] P. Groth, A. Gibson, and J. Velterop. The anatomy of a nanopublication. Information Services and Use, 30:51–56, 2010.

[68] A. Halevy, Z. Ives, P. Mork, and I. Tatarinov. Piazza: Data Management Infrastructure for Semantic Web Applications. In InternationalWorld Wide Web Conference (WWW), 2003.

[69] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation for large-scale semantic data sharing. VLDB Journal, 14(1), 2005.

[70] H. Haller, F. Kugel, and M. Volkel. imapping wikis - towards a graphical environment for semantic knowledge management. In SemWiki,2006.

[71] P. Heim, S. Hellmann, J. Lehmann, S. Lohmann, and T. Stegemann. Relfinder: Revealing relationships in rdf knowledge bases.In Proceedings of the 4th International Conference on Semantic and Digital Media Technologies (SAMT 2009), pages 182–187,Berlin/Heidelberg, 2009. Springer.

[72] P. Heim, S. Lohmann, and T. Stegemann. Interactive relationship discovery via the semantic web. In Proceedings of the 7th ExtendedSemantic Web Conference (ESWC 2010), volume 6088 of LNCS, pages 303–317, Berlin/Heidelberg, 2010. Springer.

[73] D. Helbing. FuturICT — A Knowledge Accelerator to Explore and Manage Our Future in a Strongly Connected World. Arxivpreprint arXiv:1108.6131, 2011.

[74] M. Hepp, P. D. Leenheer, A. de Moor, and Y. Sure, editors. Ontology Management, Semantic Web, Semantic Web Services, andBusiness Applications, volume 7 of Semantic Web And Beyond Computing for Human Experience. Springer, 2008.

[75] T. Hey and A. Trefethen. The data deluge: An e-science perspective. In Grid computing, pages 809–824. Wiley Online Library, 2003.

[76] A. Heydon, R. Levin, T. Mann, and Y. Yu. The Vesta Approach to Software Configuration Management. Technical Report 168,Compaq Systems Research, 2001.

[77] A. Hotho, R. Jschke, C. Schmitz, and G. Stumme. BibSonomy: A social bookmark and publication sharing system. In A. de Moor,S. Polovina, and H. Delugach, editors, Proceedings of the First Conceptual Structures Tool Interoperability Workshop at the 14th

39

Page 40: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

International Conference on Conceptual Structures, pages 87–102, Aalborg, 2006. Aalborg Universitetsforlag.

[78] M. L. I. The rise and rise of citation analysis. Physics world, 20:32–36, 2007.

[79] Z. G. Ives, T. J. Green, G. Karvounarakis, N. E. Taylor, V. Tannen, P. P. Talukdar, M. Jacob, and F. Pereira. The orchestra collaborativedata sharing system. SIGMOD Record, 37(3):26–32, 2008.

[80] A. Jaffri, H. Glaser, and I. Millard. URI Disambiguation in the Context of Linked Data. In Workshop on Linked Data on the Web(LDOW), 2008.

[81] H. Jeung, S. Sarni, I. Paparrizos, S. Sathe, K. Aberer, N. Dawes, T. G. Papaioannou, and M. Lehning. Effective Metadata Managementin Federated Sensor Networks. In Proceedings of The Third IEEE International Conference on Sensor Networks, Ubiquitous, andTrustworthy Computing, 2010.

[82] A. Jøsang, S. Hird, and E. Faccer. Simulating the effect of reputation systems on e-markets. Trust Management, pages 1072–1072, 2003.

[83] G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. In Proceedings of the 2010 international conference onManagement of data, SIGMOD ’10, pages 951–962, 2010.

[84] M. Kohlhase and I. A. Sucan. A search engine for mathematical formulae. In Proc. of Artificial Intelligence and Symbolic Computation,number 4120 in LNAI, pages 241–253. Springer, 2006.

[85] N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In ACM SIGMOD internationalconference on Management of data, 2006.

[86] N. Koudas and D. Srivastava. Approximate Joins: Concepts and Techniques. In International Conference on Very Large Data Bases(VLDB), 2005.

[87] D. Kreps and R. Wilson. Reputation and imperfect information. Journal of economic theory, 27(2):253–279, 1982.

[88] X. Liu and T. Murata. Detecting Communities in K-Partite K-Uniform (Hyper)Networks. Journal of Computer Science and Technology,26(5):778–791, 2011.

[89] S. Lohmann, P. Heim, T. Stegemann, and J. Ziegler. The relfinder user interface: Interactive exploration of relationships betweenobjects of interest. In Proceedings of the 14th International Conference on Intelligent User Interfaces (IUI 2010), pages 421–422,New York, NY, USA, 2010. ACM.

[90] L. Lu, Y.-C. Zhang, C. H. Yeung, and T. Zhou. Leaders in social networks, the delicious case. PLoS ONE, 6(6):e21202, 06 2011.

[91] J. Madhavan and A. Halevy. Composing Mappings Among Data Sources. In SIGMOD International Conference on Managementof Data, 2003.

[92] J. Madhavan and A. Halevy. Composing Mappings Among Data Sources. In International Conference on Very Large Data Bases(VLDB), 2003.

[93] S. Marsh. Formalising trust as a computational concept. PhD thesis, University of Stirling. Dept. of Computing Science andMathematics, 1994.

[94] D. McGuinness and F. van Harmelen (Ed.). OWL Web Ontology Language Overview. W3C Recommendation, February 2004.

[95] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application toSchema Matching. In International Conference on Data Engineering (ICDE), 2002.

[96] P. N. Mendes, M. Jakob, A. Garcıa-Silva, and C. Bizer. Dbpedia spotlight: Shedding light on the web of documents. In Proceedings ofthe 7th International Conference on Semantic Systems (I-Semantics), 2011.

[97] R. Mihalcea and A. Csomai. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conferenceon Conference on information and knowledge management, CIKM ’07, pages 233–242, New York, NY, USA, 2007. ACM.

[98] M. Molloy and B. Reed. The size of the giant component of a random graph with a given degree sequence. Combinatorics Probability& Computing, 7(3):295–305, 1998.

[99] L. Moreau, J. Freire, J. Futrelle, R. Mcgrath, J. Myers, and P. Paulson. In The Open Provenance Model: An Overview, pages 323–326. 2008.

[100] L. Mui, M. Mohtashemi, and A. Halberstadt. A computational model of trust and reputation. In System Sciences, 2002. HICSS.Proceedings of the 35th Annual Hawaii International Conference on, pages 2431–2439. IEEE, 2002.

[101] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66,Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.

[102] B. Pang and L. Lee. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2:1–135, January 2008.

40

Page 41: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

[103] T. G. Papaioannou, S. Sarni, K. Aberer, S. Simoni, M. Parlange, M. Bavay, and M. Lehning. Automated Model-driven Simulationand Visualization of Field Sensor Data. In Proceedings of the European Geosciences Union General Assembly 2011, Earth & SpaceScience Informatics, 2011.

[104] B. Plale, J. Alameda, B. Wilhelmson, D. Gannon, S. Hampton, A. Rossi, and K. Droegemeier. Active Management of ScientificData. IEEE Internet Computing, 9(1):27–34, 2005.

[105] R. Prokofyev, A. Boyarsky, O. Ruchayskiy, K. Aberer, G. Demartini, and P. Cudre-Mauroux. Tag recommendation for large-scaleontology-based information systems. In International Semantic Web Conference (ISWC), pages 325–336, 2012.

[106] R. Prokofyev, G. Demartini, A. Boyarsky, O. Ruchayskiy, and P. Cudre-Mauroux. Ontology-based word sense disambiguation forscientific literature. In ECIR, 2013.

[107] D. Quercia, S. Hailes, and L. Capra. B-trust: Bayesian trust framework for pervasive computing. volume 3986 of Lecture Notes inComputer Science, pages 298–312. 2006.

[108] E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 2001.

[109] A. Ranabahu, P. Anderson, and A. P. Sheth. The cloud agnostic e-science analysis platform. IEEE Internet Computing, 15(6):85–89, 2011.

[110] M. Reich, T. Liefeld, J. Gould, J. Lerner, P. Tamayo, and J. Mesirov. GenePattern 2.0. Nature genetics, 38(5):500–501, 2006.

[111] D. D. Roure, C. Goble, S. Aleksejevs, S. Bechhofer, J. Bhagat, D. Cruickshank, P. Fisher, N. Kollara, D. Michaelides, P. Missier,D. Newman, M. Ramsden, M. Roos, K. Wolstencroft, E. Zaluska, and J. Zhao. The evolution of myexperiment. eScience, IEEEInternational Conference on, 0:153–160, 2010.

[112] J. Sabater and C. Sierra. Review on computational trust and reputation models. Artif. Intell. Rev., 24:33–60, September 2005.

[113] S. Sahoo, A. Sheth, and C. Henson. Semantic provenance for escience: Managing the deluge of scientific data. Internet Computing,IEEE, 12(4):46–54, 2008.

[114] E. N. Sawardecker, C. A. Amundsen, M. Sales-Pardo, and L. A. N. Amaral. Comparison of methods for the detection of node groupmembership in bipartite networks. European Physical Journal B, 72(4):671–677, 2009.

[115] F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, and D. White. Economic networks: What do we know and what do weneed to know? Advances in Complex Systems, 12(4-5):407–422, 2009.

[116] A. Seering, P. Cudre-Mauroux, S. Madden, and M. Stonebraker. Efficient Versioning for Scientific Array Databases. In InternationalConference on Data Engineering (ICDE), 2012.

[117] Y. L. Simmhan, B. Plale, and D. Gannon. A Survey of Data Provenance Techniques. Technical Report TR-618, Computer ScienceDepartment, Indiana University, Bloomington, 2005. ftp://ftp.cs.indiana.edu/pub/techreports/TR618.pdf.

[118] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. SIGMOD Rec., 34:31–36, September 2005.

[119] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A Core of Semantic Knowledge. In 16th international World Wide Web conference(WWW 2007), New York, NY, USA, 2007. ACM Press. Available at http://www.mpi-inf.mpg.de/yago-naga/yago/.

[120] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: extraction and mining of academic social networks. In Proceeding ofthe 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 990–998, New York,NY, USA, 2008. ACM.

[121] A. Tonon, G. Demartini, and P. Cudre-Mauroux. Combining inverted indices and structured search for ad-hoc object retrieval. InSIGIR, pages 125–134, 2012.

[122] T. Tran, P. Cimiano, S. Rudolph, and R. Studer. Ontology-based interpretation of keywords for semantic search. In Proceedings ofthe 6th international The semantic web and 2nd Asian conference on Asian semantic web conference, ISWC’07/ASWC’07, pages523–536, Berlin, Heidelberg, 2007. Springer-Verlag.

[123] G. Tummarello, E. Oren, and R. Delbru. Sindice.com: Weaving the open linked data. In K. Aberer, K.-S. Choi, N. Noy, D. Allemang,K.-I. Lee, L. J. B. Nixon, J. Golbeck, P. Mika, D. Maynard, G. Schreiber, and P. Cudre-Mauroux, editors, Proceedings of the 6thInternational Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC/ASWC2007), Busan, South Korea, volume 4825of LNCS, pages 547–560, Berlin, Heidelberg, November 2007. Springer Verlag.

[124] E. van der Wall. Summa cum fraude: how to prevent scientific misconduct. Netherlands Heart Journal, 19:57–58, 2011.

[125] F. van Harmelen, A. ten Teije, and H. Wache. Knowledge engineering rediscovered: Towards reasoning patterns for the semanticweb. In N. Noy, editor, Proceedings of The Fifth International Conference on Knowledge Capture, pages 81–88. ACM, 09 2009.

[126] L. Vu and K. Aberer. A probabilistic framework for decentralized management of trust and quality. Cooperative Information Agents

41

Page 42: Crowdsourced conceptualization of complex scientific ...€¦ · Crowdsourced conceptualization of complex scientific knowledge and discovery of discoveries Karl Aberer1 Alexey Boyarsky1,2

XI, pages 328–342, 2007.

[127] L. Vu, M. Hauswirth, and K. Aberer. Qos-based service selection and ranking with trust and reputation management. On the Moveto Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, pages 466–483, 2005.

[128] L. Vu, T. Papaioannou, and K. Aberer. Synergies of different reputation systems: challenges and opportunities. In Privacy, Security, Trustand the Management of e-Business, 2009. CONGRESS’09. World Congress on, pages 218–226. IEEE, 2009.

[129] L. Vu, T. Papaioannou, and K. Aberer. Impact of trust management and information sharing to adversarial cost in ranking systems.Trust Management IV, pages 108–124, 2010.

[130] L. Vu, F. Porto, K. Aberer, and M. Hauswirth. An extensible and personalized approach to qos-enabled service discovery. In DatabaseEngineering and Applications Symposium, 2007. IDEAS 2007. 11th International, pages 37–45. IEEE, 2007.

[131] L.-H. Vu and K. Aberer. Effective usage of computational trust models in rational environments. In Web Intelligence and IntelligentAgent Technology, 2008. WI-IAT ’08. IEEE/WIC/ACM International Conference on, volume 1, pages 583 –586, dec. 2008.

[132] L.-H. Vu and K. Aberer. Towards probabilistic estimation of quality of online services. In Web Services, 2009. ICWS 2009. IEEEInternational Conference on, pages 99 –106, july 2009.

[133] L.-H. Vu and K. Aberer. Effective usage of computational trust models in rational environments. ACM Trans. Auton. Adapt. Syst.,6:24:1–24:25, Oct. 2011.

[134] D. Wells, E. Greisen, and R. Harten. FITS: A flexible image transport system. Astron. Astrophys. Supp, 44:363–370, 1981.

[135] A. Woodruff and M. Stonebraker. Supporting Fine-grained Data Lineage in a Database Visualization Environment. In InternationalConference on Data Engineering (ICDE), 1997.

[136] M. Wylot, J. Pont, M. Wisniewski, and P. Cudre-Mauroux. diplodocus[rdf] - short and long-tail rdf analytics for massive webs ofdata. In International Semantic Web Conference (1), pages 778–793, 2011.

[137] M. Wylot, J. Pont, M. Wisniewski, and P. Cudre-Mauroux. diplodocus[rdf] - short and long-tail rdf analytics for massive webs ofdata. In International Semantic Web Conference (ISWC), pages 778–793, 2011.

[138] J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer. Semantically Linking and Browsing Provenance Logs for E-science. In InternationalConference on Semantics in a Networked World (ICSNW), pages 158–176, 2004.

[139] H. Zhuge. China’s e-science knowledge grid environment. intelligent Systems, IEEE, 19(1):13–17, 2004.

42