42
© 2016 FutureTDM | Horizon 2020 |GARRI-3-2014 – 665940 REDUCING BARRIERS AND INCREASING UPTAKE OF TEXT AND DATA MINING FOR RESEARCH ENVIRONMENTS USING A COLLABORATIVE KNOWLEDGE AND OPEN INFORMATION APPROACH Deliverable D3.1 Research Report on TDM Landscape in Europe

Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

©2016FutureTDM|Horizon2020|GARRI-3-2014–665940

REDUCING BARRIERS AND INCREASING UPTAKE OF TEXT AND DATA MINING FOR RESEARCH

ENVIRONMENTSUSINGACOLLABORATIVEKNOWLEDGEANDOPENINFORMATIONAPPROACH

DeliverableD3.1

ResearchReportonTDMLandscapein

Europe

Page 2: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

2

Project

Acronym: FutureTDM

Title: Reducing Barriers and Increasing Uptake of Text and Data Mining for Research

EnvironmentsusingaCollaborativeKnowledgeandOpenInformationApproach

Coordinator: SYNYOGmbH

Reference: 665940

Type: Collaborativeproject

Programme: HORIZON2020

Theme: GARRI-3-2014-ScientificInformationintheDigitalAge:TextandDataMining(TDM)

Start: 01September,2015

Duration: 24months

Website: http://www.futuretdm.eu

E-Mail: [email protected]

Consortium: SYNYOGmbH,Research&DevelopmentDepartment,Austria,(SYNYO)

StichtingLIBER,TheNetherlands,(LIBER)

OpenKnowledge,UK,(OK/CM)

RadboudUniversity,CentreforLanguageStudies,TheNetherlands,(RU)

TheBritishLibraryBoard,UK,(BL)

UniversiteitvanAmsterdam,Inst.forInformationLaw,TheNetherlands,(UVA)

Athena Research and Innovation Centre in Information, Communication and

Knowledge Technologies, Inst. for Language and Speech Processing, Greece,

(ARC)

UbiquityPressLimited,UK,(UP)

FundacjaProjekt:Polska,Poland,(FPP)

Page 3: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

3

Deliverable

Number: D3.1

Title: ResearchreportonTDMLandscapeinEurope

Leadbeneficiary: RU

Workpackage: WP3:ASSESS:Studies,Publications,LegalRegulations,PoliciesandBarriers

Disseminationlevel: Public(PU)

Nature: Report(RE)

Duedate: 31.01.2016

Submissiondate: 29.04.2016

Authors: MariaEskevich,RU

AntalvandenBosch,RU

Contributors: MarcoCaspers,UVA

LucieGuibault,UVA

AlessioBertone,SYNYO

SusanReilly,LIBER

CarmenMunteanu,SYNYO

PeterLeitner,SYNYO

SteliosPiperidis,ARC

Review: MarcoCaspers,UVA

LucieGuibault,UVA

AlessioBertone,SYNYO

Acknowledgement: This project has received

funding from the European Union’s Horizon 2020

Research and Innovation Programme under Grant

AgreementNo665940.

Disclaimer: The content of this publication is the

sole responsibility of the authors, and does not in

any way represent the view of the European

Commissionoritsservices.

ThisreportbyFutureTDMConsortiummemberscan

bereusedundertheCC-BY4.0license

(https://creativecommons.org/licenses/by/4.0/).

Page 4: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

4

TableofContents

1 Introduction....................................................................................................................................6

1.1.Whatistextanddatamining?.....................................................................................................6

1.2.ThegrowingimportanceofTDM.................................................................................................7

1.3.Goalandstructureofthereport..................................................................................................9

2 TDMframework............................................................................................................................11

2.1.Datatobemined........................................................................................................................11

2.2.TechnologyandR&D..................................................................................................................16

3 TDMacrosseconomicsectors.......................................................................................................21

3.1.Primarysector............................................................................................................................22

3.2.Secondarysector........................................................................................................................22

3.3.Tertiarysector............................................................................................................................23

3.4.Quaternarysector......................................................................................................................25

3.5.Quinarysector............................................................................................................................26

4 Summaryoftheliterature.............................................................................................................27

5 Conclusion.....................................................................................................................................40

References.............................................................................................................................................41

Page 5: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

5

LISTOFFIGURES

Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography........8

Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite........9

Figure3.MacroviewofTDMframework.............................................................................................11

Figure4.MicroviewonTDMenvironment..........................................................................................17

Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors........22

Figure6.Timelineofreports.................................................................................................................29

Page 6: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

6

1.INTRODUCTION

Recentyearshaveseenanexponentialgrowthinthevolumeofdigitaldataavailabletoalllevelsof

society.Afurtherproliferationofdigitalordigitisedcontentisexpectedinthis,theeraofBigData.A

numberofinterrelatedfactorshaveboostedthisdatathrustandtheaccompanyingdevelopmentof

technologiestoexploitthisnewtypeofinformation-richgroundmaterialwithtextanddatamining

(TDM). First, the costs for data storage and cloud computing facilities continue to decrease,while

computing power increases. Second, digitisation has moved from specific processes, such as

archiving and optimizing industrialworkflows, to nearly all daily aspects of societal activities. This

processpavesthewayfordataficationofthisinformation,i.e.thecreationofnewinformationand

knowledge, potentially with new value, on the basis of machine-readable content. TDM is the

umbrellatermfortechnologiesthatenablethisgoal.

Thecontinuousgrowthofdata1increasinglyemphasisesthequestionofhowtobestmakeuseofit.

Societyexpects,withgoodreason,thatthe implementationof intelligentmethodstodealwithBIg

Datawill leadtoimprovedinformationaccess,fasterknowledgediscovery,higherproductivity,and

competitiveness.TheseexpectationshaveinspiredTDMresearchersandcompaniesacrosstheworld

to explore novel algorithms for extracting information, discovering knowledge, and improving the

ability topredict fromdataona large scale.Asdata is amanifold concept, so is TDM.This report

aimstochartandstructuretheTDMlandscape.

1.1.WhatisTextandDataMining?

Text and Data Mining was initially defined as “the discovery by computer of new, previously

unknown information, by automatically extracting and relating information from different (…)

resources,torevealotherwisehiddenmeanings”(Hearst,1999),inotherwords,“anexploratorydata

analysisthatleadstothediscoveryofheretoforeunknowninformation,ortoanswersforquestions

for which the answer is not currently known” (Hearst, 1999). Nowadays, this umbrella term

encompassesdiversetechniquesthatallow interpretationofcontentofanytyperangingfromraw

data, e.g. sensor data, text, images and multimedia, to processed content, e.g. diagrams, charts,

tables,references,maps,formulas,chemicalstructures,andmetadatafromsemi-structuredsources,

onalargescalethroughtheidentificationofpatterns.

TDM algorithms harbour vast potential for nearly all scientific fields and for a wide diversity of

practical industrial and societal applications. However, this broad potential and variety of TDM

applications complicates the landscape view, as even the technology itself is referred to using

different terms within different fields. Within the business and managerial context, TDM can be

referred to asBusiness Intelligence Solutions orQualitative Data Analysis. To highlight the central1Itisexpectedthattherewillbe16trilliongigabytesofdataby2020,whichcorrespondstoanannualgrowthrateof236% indatageneration (Motion fora resolution further toQuestion forOralAnswerB8-0116/2016pursuant to Rule 128(5) of the Rules of Procedure on “Towards a thriving data-driven economy”(2015/2612(RSP),2016):http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN

Page 7: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

7

positionofvastquantitiesofdata,thetermBigData(Processing)isused.Whenthefocusisonthe

discovery of hitherto undiscovered information, Content Mining appears to be a term of choice,

whileDataAnalyticsandTextAnalyticsemphasisethedata-drivenaspectofperforminganalysesand

afocusonseekinganalyticalsolutionstochallenges.Intheacademicworld,TDMisoftenconsidered

tobecomposedofMachineLearningmethodscoupledwithExploratoryDataAnalysismethodssuch

as visualisation.Machine Learning, in turn, arose from thepost-war fields ofArtificial Intelligence,InformationTheory,andPatternRecognition.

1.2.TheGrowingImportanceofTDM

Withtheincreaseofcomputationalpower,theriseofBigData,andthedigitisationofvastamounts

ofdata,theuseofTDMpromisestoyieldvaluablesocietalandeconomicbenefits intermsofnew

insights and cost savings that would otherwise not be possible: more scientific breakthroughs, a

greaterunderstandingof society, and countlessbusinessopportunities. Thepotential of TDMas a

researchtechniqueandasaneconomicopportunityhasbeenrecognisedatapoliticallevelintheEU

(HargreavesI.etal.,2014).2ThebenefitsofTDMhavealsobeennotedbytheresearchcommunity

andsocietalstakeholderse.g.viatheHagueDeclaration3.

Originallyrootedinacademicresearch,TDMhasleftitsacademicchildhoodyearsandisnowaglobal

informationtechnology(IT)industry.Asanindustry,TDMisfuelledbydirectaccesstolargeamounts

ofdata.CompaniescaneitherdevelopandruntheirownTDMsoftware,oroutsourcethisworkto

external experts who can customise TDM solutions. Industry is now rapidly rolling out the

implementationofTDMforBigData,ascomputingpowergrowsandbusinesses findnewways to

gather and access data. According to the International Data Corporation’s Worldwide Big Data

TechnologyandServicesForecastfor2013-2017,thegrowthrateoftheBigDatamarketuntil2017

willbesixtimesfasterthanthatoftheoverallinformationandcommunication(ICT)market.TheBig

DataValuePublic-PrivatePartnership2predictsthatTDMwillreachatotalofEUR50billionby2017,

andmayresult in3.75millionnewjobs.WhilemanycompaniesseethepotentialofTDMfortheir

business development, only 1.7% of companies are currentlymaking full use of advanced digital

technologieslikeTDM,despitethebenefitsthatdigitaltoolscanbring.2

TheapplicationofTDMtoscienceholdsthepromiseofacceleratingscientificdiscoveries.Miningthe

contentpresent in researchpublicationdatabases and theassociated researchdata sets, as far as

they are available for harvesting, will boost scientific research by increasing the productivity of

researchers and leaving part of the scientific discovery process to computers, a trend calledDataScience. The size of scientific publication databases varies across domains, but in each case, the

number is beyond a value thatwould be feasible for a human to read and assesswithout digital

support. Figures 1-2 give examples of the volume of content that has to be dealt with by the

2 (Motion for a resolution further toQuestion forOralAnswerB8-0116/2016pursuant toRule 128(5) of theRules of Procedure on “Towards a thriving data-driven economy” (2015/2612(RSP), 2016):http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN;http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf3TheHagueDeclarationonKnowledgeDiscovery:http://thehaguedeclaration.com/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/

Page 8: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

8

academia and industry researchers in the Computer Science4 and Life Sciences domains5,

respectively. TDMmay help researchers in all scientific fields to regain control over the academic

process,toaggregateinformationandtofollowtrendsacrossfieldsbeyondtheirown.Estimationsof

addedeconomicvalueinthiscase,basedonthetimesavingbenefitofTDMalone,areanincreaseof

2percent,orEUR5.3billion,withlong-termimpactofoverallgainintherangeofatleastEUR32.5

billion(Filippov,2014).

Figure1.NumberofdifferenttypesofpublicationsintheDBLPComputerSciencebibliography.6

4http://dblp2.uni-trier.de5https://europepmc.org6http://dblp2.uni-trier.de/statistics/publicationsperyear.html[accessedonMarch2016]

Page 9: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

9

Figure2.NumberofdifferenttypesofpublicationsontheEuropeanLifeSciencesPMCwebsite.7

1.3.Goalandstructureofthereport

Inthisreport,weoutlinethetextanddatamininglandscape.Westartbyprovidinganoverviewof

types of data and technologies that the TDM experts are working with. We look at the global

economicstructureoftheknowledge-basedsocietyandhighlightusecasesforTDMuptakeacrossall

economic sectors. Following recentdiscussionsonTDMperspectiveswithin the scientific,political,

andeconomiccontext,wehighlighttheimportanceofthetechnologyanditspotential.

TDMalgorithmsaredevelopedglobally. It is thereforeour task to list and structure the important

technology definitions, data types, underlying algorithms and leading research communities on a

worldwide scale. However, the actual uptake of TDM and ease of practical implementation vary

considerablydependingonthecountryandtheeconomicsectorinquestion.Inthisreport,wegivea

broadperspectiveon the general EuropeanUnion (EU) landscape, andquoteexceptional country-

leveldirectivesonlywhenthesecountrieshavelegalregulationsspecificallyaddressingTDM,e.g.the

caseofstatutoryexceptiononcopyrightintheUK8.

TDM is becoming increasingly ubiquitous on the global market, as each company that tracks its

activities,products,andcommunicationsinsomedigitalformhasthepotentialtoreusethisdatafor

furtheranalysisandimprovement.Forsomesectors,countriesandapplications,itisalreadypossible

toestimate thepotential profit or at least the sizeof themarket.Other fields suchas journalism,

7https://europepmc.org/About[accessedonMarch2016]8 "Section 29A and Schedule 2(2)1D of the Copyright, Designs and Patents Act 1988":http://www.legislation.gov.uk/uksi/2014/1372/regulation/3/made

Abstracts30.9million,25.9millionfromPubMed

Patents4.2million

Agricolarecords617244

Fulltextarticles3.6million

NHSguidelines765

Page 10: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

10

which is in the process of embracing the concept ofData Journalism, have to first agree on the

conceptualchangesrequiredtotheircurrentworkbehaviourinordertomergethetraditionalwork

patternwithTDM(LewisS.C.andWestlundO.,2014).

This report is structured as follows: in Section 2,we introduce the definitions of the overall TDM

framework, classifications of data types, and the technologies that can be used in text and data

miningprocess.InSection3,weshowhowTDMispotentiallyrelatedtoallsectorsofeconomics,and

we illustrate this assumptionwith examples and potential use cases of TDM implementation and

integration across sectors. In Section 4, we draw a timeline of scientific, industrial and political

documents thatemphasise the importanceofand interest inTDMand theneed foruptakeacross

sectors,andwesummarisethekeymessages.WeconcludethereportwithSection5,outliningthe

importanceofTDMforthefutureoftheeconomyandscience.

Page 11: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

11

2.TDMFRAMEWORK

Thequantityofdigitalcontentcurrentlyproducedbysocietyandthepotentialeaseofaccesstothis

dataatany locationviaeasy-to-usedevices (withreliable Internetconnection,andmobiledevices)

increasetheneedforcontentminingtechnologiesthatboosthumancapacitiesintermsofspeedof

dataprocessing (Mayer-SchoenbergerandCukier,2013).TDMtechnologiescansupport individuals

aswellaslargerinstitutionsandcompaniesindecision-makingprocessesandinthedevelopmentof

novel ideas and solutions. In this sectionwe discuss diverse data classifications, technologies that

underlieTDM implementations,and the researchcommunities thatbring together the researchers

anddevelopersfrombothacademiaandindustrytoimproveandadvanceTDMalgorithms.

2.1.Datatobemined

TDMisnotdefinedthroughasinglescientificdomainortheory; it isacomplexsetoftechnologies

with different underlying theories and assumptions, consisting of components that are developed

withindiversedisciplines.AsFigure3shows,atamacrolevel,theTDMframeworkiscentredaround

data.AsweentertheBigDataera(Mayer-SchoenbergerandCukier,2013),theefficientstorageand

managementofcontentrepresentasignificantchallengeoftheirown,bothintermsoftheamount

ofinformationandintermsoftheheterogeneousnatureofthedata,whichcanbenumeric,textual,

multimodal,etc.Themorecomplexthedata,themorecreativeandinnovativetheTDMcomponents

have to be. Data can be structured according to its size and legal accessibility, according to its

sources,orbywhoproducesitthecontentthatconstitutesthedatacollections.

InSection2.1.1,weprovideanoverviewofthetypesofsourcesthatcangenerateandholddata,to

illustratethebreadthofthesourcesonwhichTDMmayoperate. InSection2.1.2,wesummarisea

newlyproposedclassificationofdatasetsaccordingtodatasizeandtypesoflegalaccess,inorderto

identifythetypesofdatasetsthatcouldbemoreeasilyaccessedandminedbyfor-profitandnon-

profitorganisations,inthecasethatcopyrightrestrictionsareadjusted.

Figure3.MacroviewofTDMframework.9

9ThisFigure(‘MacroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.

Page 12: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

12

2.1.1.Datasourcesanddataholders

Datacomesincountlessforms.It iseasytothinkofdataasintentionallycollectedtextorscientific

data,butdatamaybealsoproducedbyhumansandbusinessesastheycarryoutdatafiedactivities,

leavingso-called“dataexhaust”asaby-productof theiractionsandmovements in theworld,e.g.

users’ online interactions, corporate and personal websites, and released digital documents and

archives (Mayer-Schoenberger and Cukier, 2013). Another type of data is represented by

measurements of uncontrolled events happening in nature as recorded by weather sensors,

telescopes,andotherinstrumentsacrosstheglobe.

Belowwelistdifferentdatasources,rangingfrompersonaldata,createdortrackedfromindividuals,

to professionally gathered and created content in formsof cultural knowledgedata, scientific and

businessdata,publicsectorinformation.AllthisdatacanbeeithersharedpubliclyontheInternet,

andstoredoffline.Whensharedpublicly,theoriginalcontentisenrichedwithadatadescriptionand

furtherinformationthatareavailableonawebsite.

Personaldata

ThetracesthateachInternetuserleavesformamassiveamountofinformationthathasvaluewhen

gathered and structured on amass scale. Currently this information ismostly given by individuals

eitherwithorwithouttheirconsent;inthefirstcaseoftenwithoutthemrealisinghowprofitablethis

information is,and in thesecondcase,without themeven realizing that their information isbeing

collectedinanyway.

States also collect information about individuals in the country, and can often compel people to

providethemwithinformation,ratherthanhavingtopersuadethemtodosooroffersomethingin

return, as private companies would have to. This special status allows a state to gather large

amountsofinformationthatcanbepotentiallyminedbyprivateorpublicinstitutions.Thiscanthen

beusedtooptimiseinteractionsbetweenthestateandindividuals,toimprovetheservicesprovided

bythestate,andtocreatebusinessmodelsaroundthisdataortousethisknowledgeforthegreater

goodofthesociety.Atthesametime,alongwiththeaccesstopersonaldata,statesareresponsible

forprotectingtheprivacyofthedataproducerswhenminingthisdataandreleasingittothegeneral

public.

Culturalheritagedata

There has been enormous investment in the digitisation of heritage data, i.e. data that are

consideredworthwhilearchiving,studying,orexhibiting.In2010,itwasestimatedthatitwouldtake

EUR10billiontodigitiseEurope’sculturalheritageholdings.3By2012,itwasestimatedthatatleast

20%of Europe’s culturalheritage collectionshadbeendigitised10.At the same time librarieswere

investing in preserving digital collections and exploring ways to increase the accessibility of this

content. Libraries use TDM to structure digital content, and to develop and explore further

possibilities for knowledge discovery e.g. via topic modelling. Improvements in the accuracy of

10http://pro.europeana.eu/enumerate/statistics/results

Page 13: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

13

optical character recognition (OCR), including for handwritten texts11will increase the impact and

efficiencyof applying TDM todigital content inorder to enhancemetadataor analyse collections.

Google Books12 and Google Art13 operate on a worldwide scale, and digitise cultural (printed)

heritage for theirown technologydevelopment,aswellas tomake thisdataaccessible toawider

audience.TheHathiTrustDigitalLibraryisapartnershipmodelbringingtogetherdigitisedcollections

fromacrossUSinstitutionsandmakingthemavailabletoUSresearchersforanalysis.14AtaEuropean

level, Europeana, which collects metadata about heterogeneous data such as digitised artworks,

artefacts,books,videosandsounds,providesaccesstothismetadataandlimiteddatasetsviaAPIs15.

The DARIAH16 and CLARIN17 infrastructures illustrate the value and potential of access to these

extremely heterogeneous corpora for the research community. Both infrastructures support TDM

acrossawide-rangingcorpora,e.g.CLARINsupportsnaturallanguageprocessingandanalysisoftext

(includingnewspapers,webfora,books),andspeech(newsbroadcasts,signlanguage).

Scientificdata

Scientific data consist of twomain types: data that have been collected through experiments and

studies(measurementsofanykind,texts,videos,etc.),andthescientificpublicationsthatdescribe

the work after analysis has been performed (containing data description, methods, experimental

results,etc.).

Data used in experiments may be released to a general audience, along with the publication

describingthestudy.Thisjointreleaseofdata,withcorrespondingpublicationandrelatedmetadata,

is considered to have a beneficial side-effect: it promotes scientific research reproducibility and

scientific integrity. An increasing number of researchers and international research organisations

perceivethistobeapositivestepforwards(ICSU,ISSC,TWAS,IAP,2015).18

Scientific data can be released purely as a data set collected for general purposes. In this way,

researchers in business or academia can access the data and decide themselves on the type of

experimentstorun.AnexampleofthisapproachistheopenaccessreleaseoftheEUsatellitedata

fromCopernicusEarthobservation19.The initiativetoaskthequestionsandtoanswerthemlies in

thehandsoftheaudience.OpendataanddatasharinginresearchisontheincreaseinEuropedue

to funder-mandates such as the H2020 Open Data Pilot, and the increasing availability of

infrastructurefordatasharinge.g.genericservicessuchasEUDAT20andZenodo21.Onaworldwide

11http://transcriptorium.eu12https://books.google.com13https://www.google.com/culturalinstitute/project/art-project

14https://sharc.hathitrust.org15http://labs.europeana.eu/api16https://de.dariah.eu/tatom17http://www.clarin.eu18 FORCE11 MANIFESTO: Improving Future Research Communication and e-Scholarship:https://www.force11.org/about/manifesto19http://www.eea.europa.eu/highlights/eu-satellite-data-to-be20http://www.eudat.eu

Page 14: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

14

scale,anexampleistheResearchDataAlliance(RDA)22,aforumwheretheinfrastructuresfordata

sharingacrossdisciplines,institutions,andcountries,arediscussed,andguidelinesareformulated.In

addition, industryisalsobeginningtomaketheirscientificdatasetsavailable,e.g.clinicaltrialdata,

oftenforethicalortransparencyreasons.

Scientificpublicationscanalsobeconsideredasavaluabledataset forTDM.Thesetexts represent

potentialsourcesofnewknowledgewhentheinformationtheycontainissummarisedandanalysed

incomparisonwithotherstudiesandlinesofenquiryfromotherdisciplines.Oftencontainingfigures

anddatavisualisations,articlesareofvaluenot justbecauseof thetext theycontain,butbecause

they are important sources of raw data and context. TDM analysis of publications can help find

connections between studies in different domains that are implicitly or indirectly connected

(Swanson, 1986). In the field of bioinformatics, the literature has been mined to analyse the

interactions of genes andproteins in various diseases.Often this typeof analysis is performedon

Medline23abstracts,astheyarefreelyavailablebutithasbeennotedthatthisisextremelylimiting.

VanHaagenetal.foundthatonly32%ofknownProteinProteinInteractions(PPIs)couldbefoundin

Medline abstracts, and therefore the rest could only be found by accessing the full text literature

(vanHaagenetal.,2009).

Scientific publications that are released via openaccess are, bydefinition, accessible for TDMand

TDM-basedapplications.PubMedCentral24hasbecomeawidelyusedsourceforTDM.TheEuropean

open access publication infrastructure OpenAire25 not only supports TDM by encouraging storing

articles under open access licences and in interoperablemachine readable formats, it also utilises

TDM to infer links between data and publications, and to provide funders with data regarding

publications arising from research funding and research trends26. Subscription content is less

accessibleas,withtheexceptionoftheUK,Europeanresearcherscanonlyminethiscontentunder

conditions specified by licences. In contrast to this, in theUS, researchersmaymine content that

theyhavelegalaccessto,withouttheneedforadditionallicencepermissions.

Businessdata

Businessesarecreatingandcollectingdatafromallpossiblesources,asminingdatafromcombined

sources improves insights into internal processes, strategies, and themarket. These data can vary

fromenergy consumption by business agents, such as shops and carriers, to personal information

about customers and their digital and financial transactions. Most of this data constitutes value

withinthecontextofaknowledge-basedeconomy,andissensitivetodataprotectionissues,thus,is

usedinternallybycompaniesandisnotreleasedpublically.

21https://zenodo.org22https://rd-alliance.org23https://www.nlm.nih.gov/bsd/pmresources.html24http://www.ncbi.nlm.nih.gov/pmc25https://www.openaire.eu

26https://blogs.openaire.eu/?p=88

Page 15: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

15

Publicsectorinformation(PSI)

PSI isdataproducedandmadeavailablebygovernmental institutions.Today,agrowingnumberof

governmentsrealisethatthisdatacanbringmoreprofitandbenefit tosocietywhenreleasedtoa

general audience. PSI data release initiatives allow commercial companies that are already

experienced in datamining, to use this additional information source to build better products for

theircustomers,consequentlycreatingmorejobsandboostingcommercialsuccess.

The European Union Open Data Portal27 gives access to a growing range of PSI data from the

institutions and other bodies of the European Union (EU); it is free for both commercial or non-

commercial purposes. The EU Open Data Portal is managed by the Publications Office of the

EuropeanUnion.ImplementationoftheEU'sopendatapolicyistheresponsibilityoftheDirectorate-

GeneralforCommunicationsNetworks,ContentandTechnologyoftheEuropeanCommission.

Opendataportalshavealsobeensetupby individualEUMemberStates.28Thedatasetsofpublic

documentsthattheseportalsshare,aremostlyunderCC-0 license, i.e. theownersdidnotreserve

anyrights,andthedataisfreeforsharinganduseinanycountry,byanyinstitutions.Thesedatasets

representthedocumentsinthenativelanguagesofeachspecificcountry.29,30,31,32,33

Internetdata

Sincetheintroductionofthefirstwebsitein1991,theWorldWideWebhadexpandedtomorethan

45 billion pageswith uniqueURLs in the late 2010s (Bosch et al., 2016).Manywebpages contain

valuable information that can be crawled for further mining. More generally, the Internet (the

infrastructureonwhichtheWorldWideWebisbased,butalsoservicessuchasinternet,email,and

filetransfer)allowsaccesstolargeamountsofdataofalltypes,thatcanbecrawledwiththeproper

credentials. Many data remains, of course, hidden behind passwords and firewalls. The so-called

DeepWeb, the part of theWorldWideWeb that is not directly accessible for indexing by search

engines,hasbeenestimatedtoholdatleast500timesasmuchdataastheindexed(‘surface’)web.34

2.1.2.Datatypes

AnytypeofdataconsideredtobeproperlyavailableforTDMmustfirstbeavailableas,orconverted

to,astandardisedandmachine-readableformsothatitcanbeeasilyprocessed.Intheir2014report

to the European Commission entitled 'Standardisation in the area of innovation and technological

development,notably in the fieldofTextandDataMining',Hargreavesetal. (2014) identified five

27https://open-data.europa.eu/en/data28http://ec.europa.eu/newsroom/dae/document.cfm?action=display&doc_id=8340 29https://data.gov.uk30https://data.overheid.nl31https://www.data.gouv.fr32https://www.govdata.de33http://data.norge.no34https://en.wikipedia.org/wiki/Deep_web

Page 16: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

16

typesofdatabasesthatlendthemselvestoTDM,accordingtotheirsizeandaccesspolicies,aswellas

theirpotentialforrevenuestructurechanges:35

1.XXL:alldatabehindfirewalls,i.e.companiesandorganisations’internaldatabases,notaccessible

tothepublic.36

2.XL:allpubliclyaccessibledatanotbehinda firewallorpaywall.e.g. freelyaccessiblenewspaper

websites.

3. L: all publicly accessible data located behind a paywall. e.g. online newspaper pages that are

accessiblewithasubscriptionfee.

4 M: all publicly accessible data behind a paywall whose clients are mainly researchers and

companiesinneedofreports.e.g.Reuters,Bloomberg.

5.S:scientificpublishers’databehindapaywall.

Thisclassificationcoversdatacollectionsofdifferentsizesthatcontaincontentofadiversenature.

Wecanmakeadistinctionbetweenhigh-leveldatathatislikelytobesubjecttocopyrightprotection,

and low-level data that is highly unlikely to be protected under copyright law (although data

protectionlawmayapplyincertaincases):

- Protected:text,image,sound,multimedia;

- Non-protected,unlessthecontentbecomespartofthedatabasethatthecompanyor

institutiondidputsubstantialinvestmentinto:e.g.clicks,likes,GPScoordinates,timestamps,

andmeasurementsofadifferentkind.

2.2.TechnologyandR&D

Textanddatamininghasgraduallyevolvedintoalargedomainwithcomplexapproachesforarange

ofapplications.Historically,thedevelopmentofminingtechnologieshasitsrootsinthemid-1700s,

withtheinventionofmathematicalmodels,regressionmethods,andstatisticalanalysis.Duetothe

adventof computersand systems for storing largerquantitiesofdata, thepossibility tominevery

largedatasetsandfilterrelevantinformationhasgreatlyexpandedwhatusedtobelaboriousexpert

analysiswork. Since the beginning of the 1990s,modern datamining tools have appeared on the

market and in labs, allowing more advanced analysis, to discover hidden patterns and to extract

implicit knowledge,asenvisagedand testedbySwansonD.R. (Swanson,1986).With the invention

andadoptionoftheWorldWideWeb,researchersexpandedtheireffortstodeveloptextanddata

miningmethodstosearchandexploretheweb.This ledtothecreationofsearchengines,starting

from simple interface-based browsers like Mosaic and Netscape, and developing into large-scale

systems likeGoogleandBing. It represented theopeninggambit towardsmorecomplexmethods,

not only based on keyword search, but also on their correlation, their similarity, their joint

probabilities,andtotheapplicationoftextminingtodifferentfields(suchasthebiomedicaldomain,

35Theoriginaldocumentusestheterm‘database’;wereplacedthisby‘data’.36ThesecollectionsareoutsidetheFutureTDMscope.

Page 17: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

17

economics, social sciences, or humanities), and types of data and content (such as scientific

publicationsandresearchdata,socialmedia,multimedia,orpublicsectorinformation).

Figure 4 shows a zoom-in view of the TDM environment, outlining the main research fields that

develop,test,andfinallyprovidethealgorithmsforTDMimplementationsforeachspecificdomain

of use. The core scientific areas aremachine learning,which ismachine learningof tasks and the

discoveryofpatternsandstructureindata;informationretrievalandsearch,whichfocusesontheoptimisation of search methods that index vast amounts of content and allow the location and

extraction of the most relevant facts for any question and user type; and natural languageprocessing techniques, which deal with human language in its textual and spoken form, covering

humanity’s past as well as reflecting current trends and topics. This scheme does not include an

exhaustive listof techniquesand researchdirections; itsaim is to illustrate thescaleandpotential

scientific interaction,and thebroadknowledge required tocreatea successful, scientifically sound

TDMapplication.

Below,webrieflyintroducethemainresearchdirectionsthatconstitutethecomponentsoftextand

dataminingapplications.

Figure4.MicroviewonTDMenvironment.37

Machine Learning (ML) studies the automated recognition of patterns in data, and develops

algorithmsabletolearntaskswithoutexplicit(human)instructionorprogramming.Learningisbased

37ThisFigure(‘MicroviewonTDMenvironment’)byFutureTDMcanbereusedundertheCC-BY4.0license.

Information Retrieval

and Search

Artificial Intelligence

Classification

Clustering

Regression

Association Rules

Statistics

Information Extraction Concept

Extraction

Opinion Mining

Summarisation

Negation and Modality

Detection

TextualEntailment

Natural LanguageProcessing

Machine Learning

Machine Translation

Predictive Analytics

Multimedia Processing

Page 18: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

18

ondataexplorationandisaimedatgeneralizingknowledgeandpatternsbeyondtheexamplesthat

arefedtothealgorithmattrainingstage.

Informationretrieval(IR)canvaryfromsimpleretrievalofalltheitems(e.g.fulldocuments)withina

collection that contain user-requestedwords inmetadata fields or structured databases, tomore

sophisticatedsearchandrankingoftheresultsbasedonthepresenceofcombinationsofimportant

wordsandnumbersindifferentsectionsoftextsordatafields(Baeza-YatesandRibeiro-Neto,1999).

It is important to note that, although TDM uses information retrieval algorithms as one of the

componentswithintheTDMframework,itgoesbeyondthesimpleretrievalofwholedocumentsor

their parts. Within the IR framework, the task of knowledge processing and of new knowledge

creation still resideswith the user, while in the case of full TDM technology implementation, the

extractionofnovel,never-beforeencounteredinformationisthemaingoal(Hearst,1999).

Naturallanguageprocessing(NLP)aimsatdevelopingcomputationalmodelsfortheunderstanding

andgenerationofnaturallanguage.NLPmodelscapturethestructureoflanguagethroughpatterns

that map linguistic form to information and knowledge or vice versa; these patterns are either

inspiredbylinguistictheoryorautomaticallydiscoveredfromdata.TheNLPcommunityissubdivided

intogroupsthatfocusonspokenorwrittencontent,i.e.speechtechnologies(ST)andcomputational

linguistics (CL). ST researchers work with audio signals, synthesising and recognising speech and

paralinguisticphenomena.Oncespeechcontenthasbeentranscribedinatextualrepresentation,it

canbefurthertreatedastextforTDM.CLresearcherstargetsyntacticanddiscoursestructures;their

algorithmshelpusunderstand thebasic concepts, relations, facts,andeventsexpressed in textual

content.

ThevisualisationoftheoutputoftheML,IR,andNLPalgorithmsisanintrinsicpartofalltheTDM

constituenttechnologies.Visualisationisvaluableforresearchersinfieldslikebioinformaticswhere,

forexample, theminingofproteinbehaviourdescribed inasetofpublications isvisualised inone

scheme;orintextmining,whenawordcloud,withwordsindifferentsizeandcoloursdependingon

their frequencies, gives a general impression of the most frequent terms used in a collection of

documents.

Statistics, as a field of mathematics in general and machine learning in particular, provides its

practitionerswithawidevarietyoftoolsfordatacollection,analysis, interpretationorexplanation,

andpresentation.

Classification/Categorisation/Clustering of content within a collection in order to split (in asupervisedorunsupervisedmanner)theseintogroupsofcertaintypesbasedoncontent,representa

TDM preprocessing step at the collection level, as these categories are used for further specific

miningofthecollection’scontent.Machinelearning(ML)algorithmsrepresentadominantsolution

forthistask,e.g.theNaïveBayesapproach(Lewis,1998),decisionrules(Cohen,1995),andsupport

vectormachines(Joachims,1998).Whileclassificationalgorithmsidentifywhetheranobjectbelongs

toacertaingroup,regressionalgorithmsestimateorpredictaresponseasavaluefromacontinuous

set.

Page 19: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

19

Association rule-learning algorithms allow for the discovery of interesting relations between

variables within large databases (Agrawal et al., 1993). For example, when a large database of

customer shoppingbehaviour ismined, association ruleshelp todefinewhat typesofproductsor

services are usually sold together. This information can help to adjust and tailor marketing

campaigns,aswellasmakeproductarrangementsinthestoresmoreefficient.

Informationextraction(IE) isawideconceptthatcoversthe inferenceof informationfromtextual

data.Examplesoftypesofextractedinformationarestatisticsontheusageofterms,theoccurrence

of named entities in text and their relations or associations according to certain classification

schemes,andthepresenceofspecificfactsorlogicalinferencesexpressedintexts(e.g.interactions

betweengenes,logicalproofs,ortheinferenceoflogicalconsequencesandentailmentexpressedin

thetext).

Term or Concept extraction is a first step for further IE, as it is necessary to define the terms of

importance for thecorpus to start itsprocessing.Once these terms,being specific for the topic to

conceptualiseagivenknowledgedomain,areextracted,anontologyofthefieldcanbecreatedfor

furtherIEtaskimplementations.

Negationandmodality(orhedge)detectionisanimportantaspectofTDM,asnegationandhedging

constructionsinthetext(suchas“OurresultscastdoubtonfindingXofAuthorZ”or“Wefoundthat

findingZdoesnotapplytocontextY”)offercrucialevaluationsofreportedfacts.

Sentimentanalysis,oropinionmining,aimstodeterminetheattitudeofaspeakerorawriterwith

respect tosometopicor theoverallcontextualpolarityofadocument.Theattitudemaybehisor

her judgmentorevaluation, theaffective stateof theauthor,or the intendedemotion theauthor

intendstoconveytothereader.

Summarisationofagiventextorasetoftextsisthetaskofcreatingashortertextthatcapturesthemost crucial information from the original texts. First, the crucial content to be summarised is

defined,andthen,dependingonthetask,itiscondensedintoanewrepresentationthatmayconsist

of sentences or parts from the original documents (extractive summarisation), or that may be a

newly created text ormultimedia document that rephrases the summarisedmessage (abstractivesummarisation).

Predictiveanalytics covers themethods that analyseprevioususageor transactions statistics and,

basedontheseresultsandtrends,offerspredictionsonfurthersystemsdevelopment.

Machine translation targets human quality level of translation ofwritten or spoken content from

onelanguagetoanotherusingcomputers.Althoughitoriginallystartedasrule-basedsystemstaking

linguistic knowledge into account, nowadays thepioneers in the fielduseBigData size collections

anddeeplearningalgorithmstotraintheirsystemsandtoachieveperformanceimprovements.

Multimediaprocessingallowsfurtherminingofallaspectsofmultimediacontent,varyingfromthe

audio stream of a given video to visual concepts extraction and annotation, person and object

discovery and localisation on the screen. This direction of research and applications has gained a

Page 20: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

20

greatdealof importance,due to the fact that contemporary Internetusers shareandcreate large

quantities ofmultimedia content that, for example, reflect their opinions and spread information

aboutevents.

As depicted in Figure 4, the methods listed here are not insulated domains of research and

application;theyrepresentacomplexinteractivesetoftoolsandmethodsthatcanberecombinedin

multipleways,dependingontheneedsofthespecifictaskandtheavailabilityofdataandexpertise.

Page 21: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

21

3.TDMACROSSECONOMICSECTORS

Since the beginning of humanity, the development of novel technologies has defined societal

structure and its employment distribution. Agriculture and natural resource extraction form a

primary sector, as this is an initial type of human joint-effort to produce food for survival. The

inventionofmachinesandelaboratetoolsledtothedevelopmentofthesecondarysector,whichhas

evolved intocomplexmachineryengineering,construction,andspaceexploration.Thethirdsector

comprises all the services that society members deliver to each other, varying from retail to

transportation, and from medical healthcare support to entertainment. Until recently, this three

componentstructurerepresentedmostofthesocietiesintheworld.However,thegrowthofdata-

driven business, services and research, has led to a discussion on the changes in the economic

structurethatshouldreflectanoveltypeofknowledge-basedeconomy,whereknowledgeanddata

become a separate valuable commodity (OCDE, 1996). Thus recently, the research that supports

knowledge sharing and growth, aswell as education of the professionals that can carry out these

activities,havebeentermedaquaternaryeconomicsector.Moreover,thehighleveldecisionmakers

in governments, large industry companies, and education, who have the potential to shape the

futureoftheentiresectorswiththeirvisionandfollowingdecisions,havebeenplacedinaseparate,

quinarysector.

Forthepurposeofourreport,weareinterestedinthecurrentandpotentialpresenceofTDMinall

economicsectors.WedepicttheserelationsinFigure5.Theprimary,secondary,andtertiarysectors

followroughlythesametypeofinteractionwithTDMtechnologies.Thesesectorsrepresentsources

of all possible types of data that are available for mining, and at the same time, have tasks and

challenges that require smart solutions. The quaternary sector is particularly central to TDM

development and implementation, as it produces the TDM experts and advances the technology

itself.Expertsatthedecision-maker leveluseTDMtoolstotestandprovetheirvisionofeconomic

development,aswellas tohypothesisewherea specificdecisionwill lead to,and tomeasureand

assessitsprogress.

In the sections below,we list examples per sector and subfield on how TDM is used, in order to

illustratethebroadnessofTDMuptakeanditspotentialfordevelopment.

Page 22: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

22

Figure5.GeneraleconomicstructureandconnectionbetweenTDMandalleconomicsectors.38

3.1.Primarysector

Making the best use of limited natural resources with a constantly growing population is one of

humanity’s ultimate challenges. Taking on the challenge requires the understanding of changing

patternsinclimateandenvironment;TDMhasthepotentialtodiscoversuchpatternsautomatically

from data. The primary data that can be used for TDM are collected in the form of satellite

measurements,weathersensorsonthegroundandbelow, intheair,and inthewater.Otherdata

arederivedfromeconomic indicators fromtheprimarysectors:productionnumbers,profits,stock

marketvaluepatternsofminedgoodssuchasoil,coal,andgold,andcropssuchaswheat39.

3.2.Secondarysector

Anylarge-scalemanufacturingplantordistributednetworkofpipelinesispackedwithvarioustypes

of sensors that produce a continuous stream of measurements that are crucial for maintenance

analysis,suchastemperature,pressurelevels,leakdetections,etc.Whenanalysedonalargescale,

this information offers a basis for discovering and understanding strategies for cost savings and

safety measures. For example, calculating optimal routes for product delivery can boost fuel

efficiency,leadingtoasmallerecologicalfootprintaswellastocostssavings;knowledgeofelectrical

gridusageorgaspipelinesintegrityhelpstoschedulemaintenanceofthemostsensitivepartsofthe

38 This Figure (‘General economic structure and connection between TDM and all economic sectors’) byFutureTDMcanbereusedundertheCC-BY4.0license.39http://www.agroknow.com/agroknow

PRIMARY SECTORAgricultureCoal MiningNatural Resources IndustriesWholesale Trade

TERTIARY SECTOREntertainmentFinance, Banking, InsuranceHealth and Social CareIT ServicesPublic AdministrationRetail/TradeTelecommunications and JournalismTourismTransportation

QUARTENARY SECTOREducationResearch and Development (e.g. Tech)

TEXT AND DATA MINING

QUINARY SECTORHigh Level Decision Makers in Government, Industry, Commerce, and Education

SECONDARY SECTORAerospaceAgro-IndustryBio-TechnologyConstruction

EngineeringManufacturingMicroelectronics

Across Economic Sectors

Technology

Data &Tasks

Technology

Data &Tasks

Tools for Analysis

Industry or Gov. Experts

Data &TasksTechnology

Technology TDM Experts

Page 23: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

23

system (Mayer-Schoenberger and Cukier, 2013), (Auclair D., 2016); data-driven smart grid

applicationscansignificantlylowerCO2emissions(ICSU,ISSC,TWAS,IAP,2015).

3.3.Tertiarysector

ThesectorofprovidedservicesembracesTDMatadifferentpacedependingontheTDMexpertise

andawarenessinthefield,BIgDataavailability,andthesensibilityofthisdataintermsofprivacy,as

muchmorepersonaldataisminedwithinthissector.

Companiesallocategreateffort to interactingwith theirexistingcustomers,as thisallows themto

monitorcustomerfeedbackontheproducts,andatthesametimetoprofilecustomersforfurther

targetedproductpromotion.Ontheotherhand,informationaboutusers’Internetsurfingbehaviour

isgatheredintheformofclicklogs,websiteviewsandsearchqueries.Oncethisdataiscategorised

and profiled, the advertising network industry can match the right advertisement from the right

company to the suitable client profile. Sentiment analysis of customer comments in global social

networks,suchasTwitterandinternetforums,allowscompaniestotrackcustomerinterestintheir

products,aswellastousethesechannelsfortargetedadvertisingofservicesandproducts.

3.3.1.Finance,Banking,Insurance

In the financedomain,predictiveanalyticsplayacrucial role.TDMbasedanalyseshelptomanage

risksandtomake informeddecisions.Banksand insurancecompaniesuseTDMtobuildconsumer

profiles anddevelop automatic classifiers todetermineeligibility for financial products, and to set

insurancepremiums.Customerparameterscan includediversetypesof informationgatheredfrom

different sources, ranging from consumers’ zip codes, employment background and educational

history,totheirshoppinghistory,socialmediausageandfriendshipnetwork(RamirezE.etal.,2016).

Financial institutions typically purchase this information from consumer reporting agencies (CRA)

thatcrawldatafromtheInternetandfromspecialgovernmentandprivatedatabases.

3.3.2.Healthandsocialcare

Both health care providers and consumers benefit when treatment plans andmedical advice are

tailoredtoindividualpatients.TDMhasgraduallybecomepartofthehealthcaresystem:inrouting

and treatment tailoring, treatment monitoring, as well as in claims processing and insurance

payments(AuclairD.,2016).Inaddition,BigDataisusedbydifferentmedicalinstitutionstopredict

the likelihood of hospital readmission for different categories of patients and types of disease

combinations.

MedicaltreatmentsandpredictionssuggestedbyartificialintelligenceandTDMarebeingintegrated

indecision support systems fordoctors (e.g.Babylon40 in theUK, IBMWatson41 in theUSA). Such

systems,poweredbyAI,canhelptolowerthecostsinregionswithashortageofspecialtyproviders.

40http://www.wired.co.uk/news/archive/2014-04/28/babylon-ali-parsa41http://www-03.ibm.com/press/us/en/presskit/27793.wss

Page 24: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

24

Moreover, medical TDM systems can flag prescription anomalies based on patient records, thus

helpingtopreventmedicationerrors(AuclairD.,2016).

3.3.3.Publicadministration

Publicadministrationcollectdataaboutdifferentaspectsofthelivesofthemembersofsociety,such

asownershipofmovableandimmovableassets,useofthehealthcaresystem,employmentstatus,

and family situation.This information isused tokeep trackof societaldevelopmentsandchanges,

and to comeupwithadjustments topoliciesand reallocationof services.A substantial amountof

this informationstaysofflineandisnot interconnected. Ifthiscontentwouldbemadeopentothe

public throughopengovernment initiatives,manytypesofbusinessesandnon-profitorganisations

wouldhaveachance toput thisdata touse,e.g.byconnecting thePSIdata to theirowndata,or

building new products and services on top of the released PSI (Mayer-Schoenberger and Cukier,

2013).

Likethefinance industry,publicadministrationcanbenefit frominternalTDMimplementations,as

intelligent mining and automated auditing of the data can help to detect and to prevent fraud,

currently leading to financial losses in tax collection; TDM may generally help to optimise the

workflow.

3.3.4.Retail

Inretail,trackingandminingofpurchasetransactionsbycustomersallowscompaniestoadjustthe

rangeofproductsondisplayaccording to thepreferencesof thecustomers inaparticulararea. In

additiontobetterfittedproductselection,foodretailerscanadjusttheirfacilitiesarrangements,for

example,refrigeratortemperatures,basedonlarge-scalestatisticsofalltheshopsinachain,which

inturnhelpstosaveontheenergybills.

Somemarketplacesaresetupassalesplatformsforindividualsandcompaniesincompletelyvirtual

environments,likeinternationalgiantseBay42andAmazon43ortheFrenchsmallersizeequivalent,Le

BonCoin44.Inthecontextofonlineshopping,theminingofuserbehaviour(clicks,productsselected

orbought,timespentonthepage,browsersused,timeofthedaywhenthewebsiteisaccessed)is

increasingly being used for product recommendations and reminders, which eventually increases

sales.

3.3.5.Telecommunicationsandjournalism

Thetelecommunicationsindustrygeneratessomeofthelargestmultimediadatasets,andTDMhelps

to filter and to play already available content that is of interest to clients depending on their

preferences.Mining customer preferences, in turn, influences decisions on allocating resources to

thecreationofnewcontentthatwillpotentiallybesuccessful.

42http://www.ebay.com43https://www.amazon.com44http://www.leboncoin.fr

Page 25: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

25

Journalists always needed diverse and reliable information sources to report news events and to

bring an analytical perspective to a general audience. Nowadays, journalists have to dealwith an

avalancheofdatasources,thusminingisseenasavaluabletoolforinvestigativejournalismtofind,

confirm,andeventuallytoprovestoriesinamoretransparentway(DiakopoulosN.,2014).Miningof

blogs,websitesandtwitter feedsmakes journalistsawareof theiraudience’s interestsand levelof

knowledge, which helps them to shape their social interaction accordingly and to select relevant

news pieces (Flaounas I. et al., 2013). At the same time, the journalists have to be careful when

redefiningtheiragendausingTDM.TheyneedtocumulatetheknowledgegainedfromTDMwiththe

actualstateofsociety,asnotallsocietymembersarerepresentedinthesocialmediaorwebdata

availableforcrawling.Intheirworkanduseofdata,journalistshavetofindabalancebetweendata

personalisationandtheecologyofcommonknowledge(LewisS.C.andWestlundO.,2014).

3.4.Quaternarysector

This information and knowledge-rich sector focuses on knowledge gathering, processing, and

creation, via research practices and teaching societymembers. In this sector, TDM algorithms are

boththesubjectofresearchandasetoftoolsthatenableresearch.

3.4.1.Research

The speed of scientific progress depends to a large extent on the framework that supports the

sharing of novel ideas and key findings within and across communities, the reproducibility and

comparabilityofresults,andtheresearchimpactmeasurements.TDMtechniqueshavethepotential

toaddvalueandincreaseefficiencyforallthesecomponents.

AsmentionedinSection1andillustratedwithexamplesinFigures1-2,researchersnowadayshave

tokeeptrackofanenormousnumberofpublicationstoretainanup-to-dateunderstandingofthe

research in their field. Moreover, a growing amount of research happens at the intersection of

severaldomains,ormethodsarebeingadoptedacrossdomains.Thus,thecurrentvolumesofcross-

domaincontent render them infeasible tobe readbyhumans.TDMcanhelpresearchersdiscover

newconnectionsandtrends,allowingthemtomoreoptimallyusetheirtimeandotherresources.

Thequalityofscientificresultssharedwithinthescientificcommunityreliesandheavilydependson

thesystematicreviewingofsubmittedarticlesforpublication.Thisrigorousprocessguaranteesthat

theresearchoutcome,whenaccepted,bringsnovel informationtothefieldofknowledge,andthe

results have been achieved following the standard procedures that safeguard reproducibility and

reliability. Ideally,highqualityreviewingreliesonfullawarenessofall thepublications inthefield.

However,withtheincreasingnumberofpublishedstudiesatagloballevel,itbecomesunfeasiblefor

anyhumanresearcher/reviewertokeeptrackofallstudiesthathavebeenreported.Therefore,TDM

canprovidevaluablesupportforreviewers,asitcanreducetheworkloadandminimiseanypotential

bias.

Page 26: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

26

Furthermore,thequalityofpublishedresearchcanbemeasuredwithusingTDM.Forexample,smart

mining of publication references that takes into account the discourse structure in case of each

quotation,yieldsamoresubtleandinformativeoverviewofthewholecitationsnetwork.

Clinicalstudiesrequireameaningfulsamplesizeofpatientswithcertainconditionsinordertotest

theeffectivenessofnewdrugsormedicalprocedure.Pharmaceuticalcompaniesthatdevelopthese

drugsandtreatmentsrelyonintermediarycompaniesthatcarryouttheactualscreeningofmedical

recordsforsuitablepotentialcandidates.TheseintermediariesworkonalargescaleandrelyonTDM

technologiestoestablishpatient-clinicalmatches,aswellastosupportpost-trialpharmacovigilance

throughearlydetectionofadversedrugevents.

3.4.2.Education

Traditional educational institutions can use BIg Data techniques to identify students for advanced

classes,whileat thesametimedeterminestudentswhoareat riskofdroppingoutormightbe in

needofearlyinterventionstrategies.TheGatesFoundationhasinvestedinanumberofprogrammes

to improve the quality and availability of educational data with the aim of developing policy and

practicetoimprovetheimpactofeducation,andtobooststudentachievement45.

Massive open online courses (MOOC), introduced in the late 2000s, are currently reshaping the

wholeconceptofdistanceeducation,andhaveapotentialtoimpactthewholeeducationalsystem,

as theymake it possible tomonitor and assess individual class performance based on large-scale

miningofstudentprofiles,attendance,taskcompletion,andforumdiscussions.

3.5.Quinarysector

Funders and high-level decision makers rely on their general visions of technology and societal

development paths, as well as the impact-metrics that assess potential economic, political, and

societalvalue.TheseexpertscanuseTDMtominethecurrentstatusofaffairs,andtopredict the

outcomes of diverse types of interventions. Moreover, sentiment analysis of social networks

activitiescanprovidethemwithfastfeedbackfromthepopulationonactionstaken.

45 http://www.gatesfoundation.org/Media-Center/Press-Releases/2009/01/Foundation-Invests-in-Research-and-Data-Systems-to-Improve-Student-Achievement

Page 27: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

27

4.SUMMARYOFTHELITERATURE

Astextanddataminingharbouralargepotentialacrosspracticallyallsectorsoftheeconomy,their

legal,economic,andpoliticalimplicationshavebeendiscussedinseveralexpertreportsandfromthe

pointofviewofvariousstakeholders.Thisreportisbasedonseveralofthesereports;inthissection

we summarise the most prominent reports published in recent years for further reading and

reference.Wehighlightthepointswhichcallforurgentlegalandeconomicchangesthatwouldallow

TDMtounleashitsfullpotential.Wealsohighlightcaseswhereuptakesuccessismentioned,andwe

outlinethedirection inwhichdiscussionsdevelop.Figure6depictsthegeneraltimelineofreports,

their research questions and focus (Legal aspects, Economic factors, Examples, Technology). It is

worthnotingthatreportscreatedonbehalfofgovernmentsandacademiaaremoreoftenreleased

via open access, even when produced by private companies. Therefore, they are available for

analysis,whileprivate sector reports, usually createdas aproduct for sale, staybehindapaywall,

restrictingouraccesstotheircontent.

Page 28: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

28

Page 29: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

29

Figure6.Timelineofreports.46

Year Title Authors Researchquestions/Focus

Dataandmethods Keyfindings

2011 DigitalOpportunity.AReviewofIntellectualPropertyandGrowth.47

I.Hargreaves ThisreportaimstofindeconomicevidencetosupportchangesintheIntellectualPropertylawsattheleveloftheUK,andfurtherataunifiedEUlevel.

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.

TwoimportantareasofutilityofTDM:(1)costsavings,(2)widereconomicimpactgeneration

2011 BIgData:Thenextfrontierforinnovation,competition,andproductivity.48

McKinseyGlobalInstitute(MGI)J.Manyika,M.Chui,B.Brown,J.Bughin,R.Dobbs,C.Roxburgh,A.HungByers

ToexploreBigDatapotentialwhenusedineveryeconomysector,itsinnovationpotential,andtheaddedvalue.

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices,aswellasinternalMGIresearch.

ThisreportgivesageneralperspectiveontheBigDataTDMbeingincorporatedintoalleconomicsectors,asdataficationreachesallsectorsoftheglobaleconomy.Theexamplesofpotentialprofits,aswellasraisingissues,arelistedforbothEUandUSAmarkets.-Issues:datapoliciesinthecontextofBigData,privacyprotection;storageandaccessfacilitiesandnoveltechniquesthatneedtobeimplemented.-BigDataanditsprocessingcreateadditionalvalueforbusinessesinanumberofways:improvedandmoreusertargetedservicequality,overalltransparencyofperformance,andthesupportofhumandecisionswithreliableanalytics.

2012 TheValueandBenefits D.McDonald,U. Toexplorecosts, Consultationswithstakeholdersand -Theavailabilityofmaterialfortextminingislimited.Evenin

46ThisFigure(‘Timelineofreports’)byFutureTDMcanbereusedundertheCC-BY4.0license.47https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/32563/ipreview-finalreport.pdf48http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation

Page 30: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

30

ofTextMiningtoUKFurtherandHigherEducation.DigitalInfrastructure.JISC.49

Kelly benefits,barriersandrisksassociatedwithtextminingwithinUKfurtherandhighereducation

casestudies(literaturereviewinsystemsbiology,textminingtoexpediteresearch,increaseofaccessibilityandrelevanceofscholarlycontent.

caseofearlyadoptersinthefieldsofbiomedicalchemistryresearch,theminingislimitedtotheOpenAccessdocuments.-Mainbenefitsareincreasedresearchers’efficiency,improvedresearchprocessandquality.Thereportdemonstrateshowmuchmoneyintermsofpersontimecanbesavedorusedforproperresearchtaskswhenthescientificcontentisminedautomatically,leavingtheresearcherswithanalysischallengesandpureresearchquestions.-Barriers:legaluncertainty,inaccessibleinformation.

2012 BusinessintelligenceandAnalytics:FromBigDatatoBigImpact.50

MISQuarterly

ThemainfocusofthisrevueistodescribeanddiscussBusinessintelligenceandAnalytics(BI&A)implementedfordiversemarketapplicationsthatusemanydatatypes(fromuser-generatedcontenttoFederalReserveWireNetworkinformation)

-BibliometricstudyofcriticalBusinessintelligenceandAnalyticspublications

-BI&Ainitscurrentstate(2.0)processesweb-basedunstructureddata,andthenextstepforthetechnology(3.0)willbetousemobileandsensor-basedcontent.

2012 ResearchTrends.SpecialIssueonBigData.51

Thisspecialissuehasabroadperspectivevaryingfromapplicationsscenarios

-- -Bibliometricsanalysisconfirmsthat‘ComputerScience’and‘Engineering’aretheleadingtopicsinBigDatapublications,andUSleadsthepublicationsraceinthefield.

49https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining50http://hmchen.shidler.hawaii.edu/Chen_big_data_MISQ_2012.pdf51http://www.researchtrends.com/wp-content/uploads/2012/09/Research_Trends_Issue30.pdf

Page 31: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

31

toinfrastructureissuesforTDMinBigDatacontext

2012 META-NET.StrategicResearchAgendaforMultilingualEurope2020.52

METATechnologyCouncil

Thisreportfocusesonthecurrentstateofandthefurtherpotentialfordevelopmentoflanguagetechnologiesonaglobalscale,andmorespecifically,theEuropeanlandscape.

-MultilingualismisaanimportantfeatureofEurope,asitmeansthatthereisneedforthediversificationandlocalisationofthetoolsthatwillbeabletoprovideservicesatthesamelevelforthewholevarietyofEuropeanlanguagesandcustomers.

2013 e-IRGWhitePaper.53 e-IRG ThisreportfocusesontheinfrastructureinitiativesthathavealreadybeentakenandespeciallyonthosethatareurgentlyneededinordertosupportTDMonalargescaleacrosscommunities.

-InfrastructureneedssupportatnationalandEUlevel,andrequirescollaborationwithbusiness.-Openscienceshouldbesupportedthroughandviasharedinfrastructures.

2014 Bridgingthegap.Howdigitalinnovationcandrivegrowthandcreate

P.Hofheinz,M.Mandel

-Thisdocumentoutlinesthegrowthofdataataglobalscale,

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.

-LargerrevolutioninalleconomicsectorsandaspectsofsocietallifearepredictedtobeduetotheBigDataera,butmostimportantlythesewillbeshapedbydataanalyticsforthis

52http://www.meta-net.eu/vision/reports/meta-net-sra-version_1.0.pdf53http://e-irg.eu/documents/10920/11274/annex_5.2_e-irg_white_paper_2013_-_final_version.pdf

Page 32: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

32

jobs.LisbonCouncilSpecialBriefing.54

andaspectsofdata-driveneconomylikebetterconditionsforcreatinganinclusiveeconomyforallsocietymembers,betterpotentialforcheaperindividualeducation,andtrainingforadvancedmanufacturing.-Theauthorsfocusonthe‘datagap’betweenEuropeandNorthAmerica.-StrongfocusisgiventothediscussiononpolicyregulationsthatneedtobeadjustedacrosstheAtlantic,bothintheEUandtheUSA.

data.-EuropeisbehindcountrieslikeSouthKorea,US,Canada,intermsofdatause.-Therearesuccessfulexamplesofopenaccesspolicyintroductionforpublicdata.InEstonia,thestatereleasedpubliclyowneddataincludingutilitiesandcellphonenetworksafterhavingstrippedprivateinformationandanonymization,whichhadpositiveimpactonthedevelopmentofdata-drivenbusinesses.-TheauthorsarguethatiftheEuropeandataprotectionstatutesarenotlightened,thecostofdatausagewillonlyrise,whichconsequentlywillwidenthegapbetweenEuropeandtheUSA,aswellastherestoftheworld.

2014 MappingTextandDataMininginAcademicandResearchCommunitiesinEurope.LisbonCouncilSpecialBriefing.55

S.Filippov -Thisreportcontinuestheabove-mentionedworkongatheringinformationontheimportanceofTDMfortheglobaleconomy

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices,aswellasinterviews,andbibliographyandpatentbasedanalysisonScienceDirectdatabase.

-Thedifferenceincopyrightaccessacrossthecountriesisreflectedbythenumberofpublicationsontextanddataminingsubjects.TheUSisleadingthefield,whiletheonlyEuropeancountryatthetopofthelististheUK,whichhasanexceptionforcontentminingforacademicpurposes.-ThenumberofpublicationsonTDMinEnglishisseveral

54http://www.progressivepolicy.org/wp-content/uploads/2014/04/LISBON_COUNCIL_PPI_Bridging_the_Data_Gap.pdf55http://www.lisboncouncil.net/publication/publication/109-mapping-text-and-data-mining-in-academic-and-research-communities-in-europe.html

Page 33: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

33

development,andfocusesonthereasonswhyEUlagsbehindtheUSandsomeAsiancountriesinthisdomain,andhypothesiseshowthiscanchangeeitherpositivelyornegatively,dependingonthemeasurestaken.

ordershigherthaninanyotherlanguages,covering97%ofthepublications.-TDMispatentedasalgorithmsintheUSandotherjurisdictions,whileinEuropeTDMispatentedasbeingembeddedinthesoftware.Chinaisrisingstronglyinthismarketintermsofpatents.

2014 Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefieldofTextandDataMining.ReportfromtheExpertGroup,EuropeanCommission.56

ExpertGroupI.Hargreaves,L.Guibault,C.Handke,P.Valcke,B.Martens,R.Lynch,S.Filippov

-ThisreportliststhestakeholdertypesandthelegalandeconomicissuesthattheyencounterbothintheEUandintherestoftheworld.-TheexpertshypothesiseonhowthevaluechainmightbeadjustedwhenTDMisfullyincorporatedandlegallyallowed,arguingthatitshouldhaveamorepositiveimpactoverall.Thedataiscategorisedaccordingtobothsizeandlegalaccesspattern.

2014 AssessingtheeconomicimpactsofadaptingcertainlimitationsandexceptionstocopyrightandrelatedrightsintheEU.Analysisofspecificpolicyoptions.57

CharlesRiverAssociates(CRA)J.Boulanger,A.Carbonnel,R.DeConinck,G.Langus.

Thisreportfocusesoneconomicimpactsofspecificpolicyoptionsinseveraltopicsofinterest(digitalpreservationofandremoteaccessto

-IntroductionofIPexceptionsforthecaseofTDMforscienceshouldnotsignificantlyadverselyaffectrightholders’incentivesforcontentcreation,contentqualityandTDM-specificinvestments.

56http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf57http://ec.europa.eu/internal_market/copyright/docs/studies/140623-limitations-economic-impacts-study_en.pdf

Page 34: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

34

culturalheritage,e-lendingforlibraries,TDMforscientificpurposes),inviewofprovidingpolicyguidanceonthesetopics.

2014 Studyonthelegalframeworkoftextanddatamining(TDM).58

DeWolf&PartnersfortheEuropeanCommission

ThefocusisonthelegalaspectsofTDM.

Dataanalysisisthemostencompassingtermandismorepreferableforuseinalegalcontext.Classificationof4differentlevelsofaccess:“alltoall”(web-date)/“manytomany”(socialnetworks)/“onetomany”(contractualclauses)/“onetoone”(confidentialagreements)

2014 TheDataHarvest:Howsharingresearchdatacanyieldknowledge,jobsandgrowth.59

RDAEuropeF.Genova,H.Hanahoe,L.Laaskonen,C.Morais-Pires,P.Wittenburg,J.Wood

ThereportfocusesonthemeasuresthatshouldbetakeninordertosupportresearchusingBigData.

-Mainincentivesofthereportinclude:--dataplanmanagement;--promotionofliteracyintermsofalltheaspectsofBigDatauseacrosssociety;--developtoolsandpoliciesthatsupportdatasharing.

2015 OECDScience,TechnologyandIndustryScoreboard2015.INNOVATIONFOR

OrganisationforEconomicCo-operationandDevelopment,

-Thisreportaimstoprovidepolicymakersandanalystswiththemeanstocompare

-Thisisabroadlyscopedreportthathassectionsthatdiscusstrendsandfeaturesofknowledgeeconomies.-ItgivesperspectivesontheoverallscientificexcellenceofEUcountriescomparedtotherestoftheworldacrossallfieldsof

58http://ec.europa.eu/internal_market/copyright/docs/studies/1403_study2_en.pdf59https://rd-alliance.org/sites/default/files/attachment/TheDataHarvestFinal.pdf

Page 35: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

35

GROWTHANDSOCIETY.60

Europe

economieswithothersofasimilarsizeorwithasimilarstructure,andtomonitorprogresstowardsdesirednationalorsupranationalpolicygoals.-Themainfocusofthisreportisontheknowledge-basedeconomies’developmenttrends.

science,andtheamountofinvestmentindifferentdomains,withspecialhighlightsonthenewgenerationofICT-related,BigDatabasedtechnologiesandpatents.-Acrossmanymetrics,EuropeancountrieswhencombinedarecomparabletotheresultsofUSA,Japan,China,SouthKorea.

2015 Skillsofthedatavores.TalentandtheDatarevolution.61

Nesta,UKJ.Mateos-Garcia,H.Bakhshi,G.Windsor

IntroducesitsownclassificationofthecompaniesthatworkwithBigData,andthenanalysesthetypesofskillsdatascientistsandengineershavetolearnandmastertoboosttheeconomyandscientificdevelopment.TheanalysisfocusesontheUKmarketwhilethecompany

InternalNESTAanalysisofthedataanalyticsmarketandhiringstrategiesandchallengesinthefieldthatarebasedonmorethan50in-depthinterviews(inthefieldsofmanufacturing,retail,ICT,creativemedia,financialservices,pharmaceuticals),collaborationwithCreativeSkillset.

-Classificationofcompaniesconnectedtodataprocessing:--‘Data-active’companiesthatareeitherdata-driven(Datavores),workwithlargevolumesofdata(DataBuilders),orcombinedatafromdifferentsources(DataMixers),--Dataphobes:othercompaniesthatrestrictthemselvestosmallsizedatasets.-Thesedifferenttypesofcompanieshavedifferentbehaviourswhilesettinguptheirrequirements,andthenhiringthedataexperts.Forexample,datavoresanddatabuildersrecruitmostoftheircandidatesfromComputerSciences,Engineering,andMathematics,whileBusinessdisciplinesarethesourceformanagementanalyticspositions.

60http://www.oecd.org/science/oecd-science-technology-and-industry-scoreboard-20725345.htm61https://www.nesta.org.uk/sites/default/files/skills_of_the_datavores.pdf

Page 36: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

36

classificationcanbeusedglobally.

2015 BriefingPaper.TextandDataMiningandtheNeedforaScience-friendlyEUCopyrightReform.62

ScienceEurope Summarisestheargumentsinthedebateforcopyrightlawreformthatshouldensurethatlegally-accessedcontentcouldbefreelyminedwithoutadditionalpermissionandcostwithintheEU.

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.

-EUregulationsfordatabaseprotection(Directive1996/9/EC)seriouslyimpederesearcherswhoaimtoaccessthecontenttobeminedviaadatabase,asdownloadingasubstantialpartofthedatabasecontentneedstobeauthorisedbythedatabaseowner,evenifnoneofthecontentisundercopyright.-Scientificpublishersstarttointroducelicencesregulatingtheminingofsubscribedcontent,whichfurtherlimitstheresearchersintheirimmediateinteractionwiththecontent.-ExceptionsincopyrightlawincountrieslikeUK,Japan,USA,whichallowtheirrepresentativestoavoiddifficultieswiththelegalityofcontentmining,whichputsEUresearchersatadisadvantage,andcomplicatespotentialforinternationalcollaborations.-Theauthorssuggestthefollowingstepsforresearchorganisations:--AvoidrequestingthattheirresearchersabstainfromTDM;--RefusetosignTDM-licensesaspartofsubscriptioncontracts,ornegotiatemorefavourablecondition;--EmpowertheiremployeesinternallythroughtrainingandTDMsupport,andexternallybyadvocatingcopyrightchangesatnationalandEuropeanlevel.

2015 IsEuropefallingbehindinDataMining?Copyright’sImpacton

C.Handke,L.Guibault,J.-J.Vallbe

Thispaperfocusesonthecopyrightarrangements

Theauthorsmeasureresearchoutputinthenumberofpublicationsinacademicjournal

DemonstratesthatresearchersfromtheEU/EEAcountrieshaveweakerperformanceintermsofpublicationsinthegrowingfieldofdatamining.Thisispartiallyduetothecomparatively

62http://www.scienceeurope.org/uploads/PublicDocumentsAndSpeeches/WGs_docs/SE_Briefing_Paper_textand_Data_web.pdf

Page 37: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

37

DataMininginAcademicresearch.63

impactingdataminingbyacademicresearchers.

articlesinThomsonReuters’sWebofScience(WoS)database.

strongcopyrightprotectionlawsinthesecountries.

2015 ScienceInternational:OpenDatainaBigDataWorld.64

InternationalCouncilforScience(ISCU),InternationalScienceCouncil(ISSC),TheWorldAcademyofSciences(TWAS),InterAcademyPartnership(IAP)

-ThisreportfollowsanaccordofInternationalScientificorganisationsandfocusesontheopportunitiesthattheBigDatarealityrepresents.-Itdiscusseshowtextanddataminingofthisscientificcontentonamassscalecanhelpwhenthescienceaddressesglobalchallengessuchasinfectiousdisease,energydepletion,migration,sustainabilityandtheoperationoftheglobaleconomyingeneral.

Basedonanalysisofpreviouspublications,reports,reportedlegalandeconomicpractices.

-Unprecedentedopportunitiesforscientificdevelopmentincurrentdigitalagerelyon2pillars:BigData(largevolumesofcomplexanddiversedatastreaminginrealtime)andLinkedData(separatedatasetslogicallyrelatedtoaparticularphenomenonallowcomputerstoinferdeeperrelationships).-Forthescientists,theopenaccesstothescientificdataandpublicationsisamuch-neededimperative.Therefore,authorsinsistonderegulatingopendataforscientificdatausage.Datashouldbe“intelligentlyopen”,i.e.discoverableontheWeb,accessibleandintelligibleforfurtheruse,andassessableintermsofdataproducers’quality.Thisopennesswillalsohelpreplicabilityteststhattestandscrutinisethereportedresults.-Theauthorsnotetheongoingdiscussionwhetherpublicdatashouldbefreelyavailabletoeveryone,orjusttothenon-profitsector.Theyarguethatitisprimarilynotappropriateorproductivetointroducethisdiscriminationindataaccess,andmoreimportantly,robustevidenceisaccumulatingtosupporttheclaimthatthisaccesstodataforeveryonebringsdiversebenefits,andbroadereconomicandsocietalvalue.-Theauthorsgiveexamplesofcross-countrycollaborationsinopenaccessinitiativesinSouthAmericaandAfrica,whilenotingthatsharingdataallowsmoreuniformdevelopmentoftechnologiesandtheirimplementationonaglobalscale.

63http://elpub.architexturez.net/system/files/15_HANDKE_Elpub2015_Paper_23.pdf64http://www.icsu.org/science-international/accord/open-data-in-a-big-data-world-short

Page 38: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

38

2016 TextandDataMining:TechnologiesUnderConstruction.65

Outsellmarketperformancereport,USAD.Auclair

-FocusofthisreportisonTDMaspectsforscientificresearch,healthcare,andengineering.

Theauthorsbasethereportonapproximately20in-depthinterviewswitharangeofstakeholderslikecontentvendorsandtechnologyandtoolproviders,aswellasonpreviouslypublishedinformation.

-Theauthorshighlightthefactthatdatamininghasbeenusedlongerformarketresearchandcommercialactivities,ascorrespondingcommercialwebsitescontainlargequantitiesofstructureddata,whileacademiaandindustryhavetodealwithhighvolumeofunstructuredtextsanddata,whichmadetheirprogressinuptakeslower.-ThereportprovidesalistofprovidersofTDM-relatedfunctionsandexamplesoftypesofbusinessthattheyareserving.-TheauthorslistanumberofactionsforcontentprovidersandTDMtoolvendorstofollow,thatcouldhelpthemtobenefitfromTDMuptakeinthenearfuture:--TDMsolutionsshouldbescalable--Storeddatashouldfollowthestandardsatcollectionandstoragestages.--Privacyandsecurityregulationsaretobeensured.

2016 BigData.AToolforInclusionorExclusion?UnderstandingtheIssues.66

FederalTradeCommission(FTC)Report,USAE.Ramirez,J.Brill,M.K.Ohlhaussen,T.McSweeny

-ThisreportarguesthatitisnotaquestionofwhethercompaniesshoulduseBigDataanalytics,buthowtoitthatcorrectly,minimisinglegalandethicalrisks.-HighlightsneedtoidentifyandavoidpotentialbiasesandinaccuraciesinTDM

FTCheldapublicone-dayworkshop‘BigData:AToolforInclusionorExclusion?’InordertogetafeedbackfrominvolvedstakeholdersonthepotentialofBigDataintermsofcreatedopportunitiesforconsumersandpolicyissues,namelysecuritybreaches,andbiasesandinaccuraciesinproperpopulation/eventsrepresentationinthedata.

-ThereportincludesalistofquestionsthataretobeconsideredandinvestigatedwhencompaniesareusingBigDataanalyticsintheirbusinessmodels.Thisprocedurecouldhelpthemtoavoidviolatinglawsintermsofdiscrimination,andtopreventerrorsthatmightoccurduetotrainingdatasetinconsistenciesthatcouldleadtopotentialbiasesinoffersandservices.Listofquestions:-Howrepresentativeisyourdataset?-Doesyourdatamodelaccountforbiases?-HowaccurateareyourpredictionsbasedonBigData?-DoesyourrelianceonBigDataraiseethicalorfairness

65http://www.copyright.com/wp-content/uploads/2016/01/Outsell-Market-Performance-22jan2016-Data-and-Text-Mining-Licensed.pdf66https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf

Page 39: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

39

applications,sothattheBigDataisusedtocreatenovelopportunitiesforsocietalimprovements,andespeciallytosupportlow-incomeandunderservedcommunities.

concerns?

2016 EuropeanParliamentresolutionon‘Towardsathrivingdata-driveneconomy’.67

EUParliamentJ.BuzekonbehalfoftheCommitteeonIndustry,ResearchandEnergy

Thisdocumentaimstostructuretheknowledge-basedeconomylandscape,withafocusontheEU,andtooutlinewhatareurgentmeasuresneedtobetakeninordertobetheleadersinthefieldataglobalscale.

Basedonanalysisofpreviouspublications,reports,andinternallyproducedfinancialestimationsofthemarketpotential.

-ThisresolutionoutlinesBigDataanddata-driveneconomypotentialtill2020,withafocusontheEuropeanissuesandneeds.-AnnouncementoftheEuropean‘FreeFlowofData’initiative-StressestheneedforamodernICTinfrastructureataEuropeanscale.

67http://www.europarl.europa.eu/sides/getDoc.do?type=MOTION&reference=B8-2016-0308&language=EN

Page 40: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

40

5.CONCLUSION

Theuptakeoftextanddataminingisgrowingacrossallsectorsofknowledge-basedeconomies,asthestrengthsofTDManditseconomicadvantages,suchastimesavingandscaling,areincreasinglywellrecognised(ICSU,ISSC,TWAS,IAP,2015),(AuclairD.,2016).WhilethedevelopmentofTDMbythe larger IT companies is directed at global markets, there is a long tail of TDM research anddevelopment in industry aimedat specificmarkets andniches.MostTDM industriesmakesuseofprovenTDMtechnologies,suchassupervisedmachine-learningalgorithmsforclassificationorrisksassessment(Ittooetal.,2016),(FleurenandAlkema,2015),howeverthereisahealthyrelationshipbetween industry and academic research on the development and uptake of newmethods fromfieldslikeArtificialIntelligence(e.g.deeplearningalgorithms).

Ingeneral,industrieshavenotroublefindingaccesstodata,andmanagetouseTDMforeconomicprofit.However,different typesofbarriersstilloccur. Ifbarriers to theavailabilityof full textsanddatasetsfromacademicdomainsandpublicsectorinformationwerelifted,thefirstapplicationsthatcouldbemadepossiblewouldfinallyreviveoldvisionsfromthesecondhalfofthe20thcenturyonaccomplishing automatic unconstrained scientific discovery from observational data (Simon et al.,1981),(Simonetal.,2007).

Large-scaleglobalcompanieslikeAlphabet(Google), IBM,andMicrosoft,currently investheavily inTDM algorithms and infrastructure development, as much of the value they accrue nowadays isleveragedfromthedataprovidedtothembytheircustomers,incombinationwithallthecrawlablecontent on the web. For companies and research institutes based in the European Union, acompetitivestrategynecessarilyinvolvesawideruptakeofTDM,basedonabetterawarenessofthepotentialofTDMacrossalleconomicsectors.

Page 41: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

41

REFERENCES

Agrawal,R., Imielinski,T.,Swami,A.,1993.Miningassociationrulesbetweensetsof items in largedatabases, in: Proceedings of the 1993 ACM SIGMOD International Conference onManagementofData.ACM,Washington,D.C.,USA,pp.207–216.

Auclair D., 2016. Text and Data Mining: Technologies Under Construction. (Outsell marketperformancereport).

Baeza-Yates, R.A., Ribeiro-Neto, B., 1999.Modern InformationRetrieval. Addison-Wesley LongmanPublishingCo.,Inc.,Boston,MA,USA.

Cohen,W.W.,1995. Learning to classifyEnglish textwith ILPmethods.Advances in inductive logicprogramming32,124–143.

DiakopoulosN.,2014.Algorithmicaccountability.Journalistic investigationofcomputatiionalpowerstructures.DigitalJournalism3,398–415.

Filippov,S.,2014.MappingTextandDataMininginAcademicandResearchCommunitiesinEurope.FlaounasI.,AliO.,Landsdall-WelfareT.,DeBieT.,MosdellN.,LewisJ.,CristianiniN.,2013.Research

methodsintheageofdigitaljournalism.Massive-scaleautomatedanalysisofnews-content-topics,style,gender.DigitalJournalism1,102–116.

Fleuren,W.W.M.,Alkema,W.,2015.Applicationoftextmininginthebiomedicaldomain.Methods74,97–106.doi:10.1016/j.ymeth.2015.01.015

Hargreaves I., Guibault L., Handke C., Valcke P., Martens B., Lynch R., Filippov S., 2014.Standardisationintheareaofinnovationandtechnologicaldevelopment,notablyinthefieldofTextandDataMining.ReportfromtheExpertGroup.

Hearst,M.A., 1999.Untangling TextDataMining. CollegePark,Maryland, Proceedingsof the37thAnnual Meeting of the Association for Computational Linguistics on ComputationalLinguistics(ACL’99)3–10.

ICSU, ISSC, TWAS, IAP, 2015. Science International: Open Data in a Big DataWorld. InternationalCouncil for Science, International Science Council, The World Academy of Sciences,InterAcademyPartnership.

Ittoo, A.,Nguyen, L.M., van denBosch, A., 2016. Text analytics in industry: Challenges, desiderataandtrends.ComputersinIndustry.doi:10.1016/j.compind.2015.12.001

Joachims,T.,1998.TextCategorizationwithSuportVectorMachines:LearningwithManyRelevantFeatures, in:Proceedingsof the10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.137–142.

Lewis,D.D.,1998.Naive(Bayes)atForty:TheIndependenceAssumptioninInformationRetrieval,in:Proceedingsofthe10thEuropeanConferenceonMachineLearning.Springer-Verlag,pp.4–15.

Lewis S.C., Westlund O., 2014. Big data and journalism: Epistemology, expertise, economics, andethics.DigitalJournalism3,447–466.

Mayer-Schoenberger,V.,Cukier,K.,2013.BigData:ARevolutionThatWillTransformHowWeLive,WorkandThink.JohnMurrayPublishers.

MotionforaresolutionfurthertoQuestionforOralAnswerB8-0116/2016pursuanttoRule128(5)oftheRulesofProcedureon“Towardsathrivingdata-driveneconomy”(2015/2612(RSP),2016.

OCDE, 1996. The knowledge-based economy. Organisation for Economic Co-operation andDevelopment,Paris.

RamirezE.,BrillJ.,OhlhaussenM.K.,McSweenyT.,2016.BigData.AToolforInclusionorExclusion?UnderstandingtheIssues.(FederalTradeCommission(FTC)Report).

Page 42: Deliverable D3.1 Research Report on TDM Landscape in Europe · Research Report on TDM Landscape in Europe. D3.1 RESEARCH REPORT ON TDM LANDSCAPE IN EUROPE ... As data is a manifold

D3.1RESEARCHREPORTONTDMLANDSCAPEINEUROPE

©2016FutureTDM|Horizon2020|GARRI-3-2014|665940

42

Simon,H.A., Langley,P.W.,Bradshaw,G.L., 1981. Scientificdiscoveryasproblemsolving. Synthese47,1–27.doi:10.1007/BF01064262

Simon,Uren,V.,Li,G.,Sereno,B.,Mancini,C.,2007.Modelingnaturalisticargumentationinresearchliteratures:Representationandinteractiondesignissues:ResearchArticles.Int.J.Intell.Syst.22,17–47.doi:10.1002/int.v22:1

Swanson, D.R., 1986. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. PerspectBiolMed30.

vanHaagen,H.H.H.B.M.,’tHoen,P.,Bovo,A.B.,deMorrée,A.,vanMulligen,E.,Chichester,C.,Kors,J.,denDunnen, J.,vanOmmen,G.J.B.,vanderMaarel,S.,Kern,V.M.,Mons,B.,Schuemie,M., 2009.Novel protein-protein interactions inferred from literature context. PLoSONE 4.doi:10.1371/journal.pone.0007894