21
Dacura: A New Solution to Data Harvesting and Knowledge Extraction for Archaeology Peter N. Peregrine Rob Brennan Thomas Currie Kevin Feeney Pieter François Peter Turchin, and Harvey Whitehouse SFI WORKING PAPER: 2017-07-023 SFI Working Papers contain accounts of scienti5ic work of the author(s) and do not necessarily represent the views of the Santa Fe Institute. We accept papers intended for publication in peer-reviewed journals or proceedings volumes, but not papers that have already appeared in print. Except for papers by our external faculty, papers must be based on work done at SFI, inspired by an invited visit to or collaboration at SFI, or funded by an SFI grant. ©NOTICE: This working paper is included by permission of the contributing author(s) as a means to ensure timely distribution of the scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the author(s). It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may be reposted only with the explicit permission of the copyright holder. www.santafe.edu SANTA FE INSTITUTE

Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

Dacura: A New Solution to DataHarvesting and KnowledgeExtraction for ArchaeologyPeter N. PeregrineRob BrennanThomas CurrieKevin FeeneyPieter FrançoisPeter Turchin, and Harvey Whitehouse

SFI WORKING PAPER: 2017-07-023

SFIWorkingPaperscontainaccountsofscienti5icworkoftheauthor(s)anddonotnecessarilyrepresenttheviewsoftheSantaFeInstitute.Weacceptpapersintendedforpublicationinpeer-reviewedjournalsorproceedingsvolumes,butnotpapersthathavealreadyappearedinprint.Exceptforpapersbyourexternalfaculty,papersmustbebasedonworkdoneatSFI,inspiredbyaninvitedvisittoorcollaborationatSFI,orfundedbyanSFIgrant.

©NOTICE:Thisworkingpaperisincludedbypermissionofthecontributingauthor(s)asameanstoensuretimelydistributionofthescholarlyandtechnicalworkonanon-commercialbasis.Copyrightandallrightsthereinaremaintainedbytheauthor(s).Itisunderstoodthatallpersonscopyingthisinformationwilladheretothetermsandconstraintsinvokedbyeachauthor'scopyright.Theseworksmayberepostedonlywiththeexplicitpermissionofthecopyrightholder.

www.santafe.edu

SANTA FE INSTITUTE

Page 2: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

1

Dacura:ANewSolutiontoDataHarvestingandKnowledgeExtractionforArchaeologyPeter N. Peregrine, Rob Brennan, Thomas Currie, Kevin Feeney,PieterFrançois,PeterTurchin,andHarveyWhitehouse

Peter N. Peregrine, Lawrence University, 711 E. Boldt Way, Appleton WI 54911 and Santa Fe Institute, 1399 Hyde Park Road, Santa Fe NM, 87501 ([email protected]) Rob Brennan, ADAPT & Knowledge and Data Engineering Group, School of Computer Science and Statistics, Trinity College Dublin, Ireland ([email protected]) Thomas Currie, Department of Biosciences, University of Exeter—Penryn Campus, Cornwall, TR10 9FE, UK ([email protected]) Kevin Feeney, Knowledge and Data Engineering Group, School of Computer Science and Statistics, Trinity College Dublin, Ireland ([email protected]) Pieter François, School of Humanities, De Havilland Campus, University of Hertfordshire, Hatfield, AL10 9EU, UK and Institute of Cognitive and Evolutionary Anthropology, Oxford University, Oxford OX4 1QH, UK ([email protected]) Peter Turchin, Department of Ecology and Evolutionary Biology, University of Connecticut, 75 N. Eagleville Road, Storrs, CT 06269-3042 ([email protected]) Harvey Whitehouse, Institute of Cognitive and Evolutionary Anthropology, Oxford University, Oxford OX4 1QH, UK. ([email protected])

AbstractArchaeologistsarebothblessedandcursedbytheinformationnowavailablethroughtheInternet.Weareblessedbythepureabundanceofarticles,images,anddatathatwecandiscoverwithasimplesearch,but we are also cursed by the difficult process of parsing thosediscoveries down to those of scholarly quality that relate to ourspecificinterests.Asanexampleofhownewadvancesincomputer

Page 3: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

2

science address these problems we introduce Dacura, a datasetcurationplatformdesignedtoassistresearchersfromanydisciplineinharvesting,evaluating,andcuratinghigh-qualityinformationsetsfrom the Internet and other sources. We provide an example ofDacurainpracticeasthesoftwareemployedtopopulateandmanagethe massive Seshat databank of historical and archaeologicalinformation.

Losarqueólogossebendecidoymaldecidosporlainformaciónahoradisponible a través de Internet. Somos bendecidos por la puraabundancia de artículos, imágenes y datos que podemos descubrircon una simple búsqueda, pero nosotros también estamosmaldecidosporeldifícilprocesodeanálisisdelosdescubrimientoshasta las de calidad académica que se relacionan con nuestrosintereses particulares. Como una cura para esta maldiciónintroducimosDacura,unaplataformadecomisariadodeconjuntodedatos diseñada para ayudar a los investigadores de cualquierdisciplinaenlarecolección,evaluaciónycomisariadodesistemasdeinformacióndealtacalidaddeInternetyotrasfuentes.LeofrecemosunejemplodeDacuraenlaprácticacomoelsoftwareempleadopararellenar y gestionar el databank de Seshat masivo de informaciónhistóricayarqueológica.

Current developments in computer science provide new ways of harvesting,storing,andretrievingdatafromtheInternetthathavethepotentialtotransformhowarchaeologicalliteraturereviewsanddataharvestingaredone.Dacuraisadatacurationplatformthatreflectstwoofthesedevelopments—a“graphical”datastructure (as opposed to the standard column and row data structure) and anautomatedprocessforweedingoutthethousandsofon-lineanddatabasehitsnotdirectlyrelatedtoaproblemofinterestand/orofdubiousaccuracy.Dacurawasbuilt using the Seshatdatabank,which identifies and coordinateshistorical andarchaeologicalinformationderivedinpartfromtheInternet,asaworkingfocus.We introduce both Dacura and Seshat here as concrete examples of how theadvancesincomputersciencemightbeemployedbyarchaeologists.

WebeginwiththebasicproblemtheDacuradatacurationplatformisintendedto address: the overabundance of unevaluated information available toresearchers.Asanexample,consideraresearcherwhowantstobuildadatabaseonaparticulartopic,suchaspopulationestimatesforthebigIslandofHawaiifromthetimeofcolonizationtothereignofKamehamehaII.Ifsheweretosimplytype“ancientHawaiipopulation”intoGoogle,shewouldobtainnearly250,000results

Page 4: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

3

(somediscussingmoderndemographics)withnoeasywayofknowingwhichofthemanythousandsofresultsonancientHawaiiwouldprovidetheinformationsheneeds,norwhichofthemwouldprovidereliableinformation(theWikipediapageon“AncientHawaiianPopulation”,forexample,providesonlyhighestimatesandapparentlyfromonlyonesource;theinabilitytoclearlyidentifythesourceofthedataisitselfaseriousproblem).IfthisresearcherweretouseGoogleScholarinstead, the results would be fewer (around 165,000), and although she couldexpect somewhat better quality, there would remain the daunting task ofidentifyingpapersandbooksdirectlyrelevanttoherinterests.EvenJSTOR,withquality-ensuredcontent,wouldprofferaround60,000articlestochurnthrough.

Theexampleaboveillustratesacentralproblemincontemporaryresearch:theInternetandopen-accesspublishingprovideresearchersabundantinformationonvirtuallyanytopicofinterest,butthereisnoqualityassuranceforInternetsearchresults,andevenwherequalitycanbeassumed(asinpeer-reviewedopen-accesspublications),theamountofinformationisoftenoverwhelming.Whatisneededisasearchtool thatprovidesamiddle-ground—easysearching,anassuranceofquality,andamanageablebodyofresults.Suchasearchtoolrequiresacarefullydesignedhierarchical structure (ontology) toallowascholar toeasilydigdownthrough results to those that are directly relevant to his or her research. Thissearchtoolalsorequiresdetailedindexingacrossresultdomainssothat“apples”notonlyrecoversall informationon“apples”butalso informationthatdoesnotretrieve “oranges”when applied to particular domains. In otherwords, such asearchtoolmustbeabletoapplyanintegratedthesaurusorsetofthesauriaspartofthebasicsearchroutine.

Thereareanumberofextantsearchtoolsthatprovidethisfunctionality:rapidretrieval of specific, quality information across domains. For example, eHRAF(HumanRelationsAreaFiles;hraf.yale.edu)maintainstwoarchivesofdocuments(ethnographic and archaeological, respectively) organized using detailedontologies(theOutlineofWorldCulturesandOutlineofArchaeologicalTraditions)and employing a rich thesaurus (the Outline of Cultural Materials). Individualparagraphs fromnearly three-quarters of amillion pages of archaeological andethnographicprimaryandsecondarysourcedocumentsareindexedineHRAFandcanbeeasilysearchedandretrievedatvaryinglevelsofdetailusinghierarchicalandBoolean search strategies. The results are specific, of excellentquality andspecificity,andmanageableinnumber.However,therangeofresultsislimitedtothedocumentsthathavebeenincludedintheeHRAFarchives.ThereasoneHRAFprovides such excellent information retrieval is that the information has beenextensivelypre-processedtotheextentthateverydocumenthasbeenindividuallyplaced into the ontology and every paragraph in every document individuallyindexed by Ph.D.-holding anthropologists. In short, a huge amount of work is

Page 5: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

4

requiredtomakesearchandretrievaleasy,andthatmeansthedataprovidedbyeHRAFgrowsslowlyandeHRAFcannotaffordtobeopen-source.

An alternative model of a search tool providing rapid retrieval of specific,quality information across domains is tDAR (the Digital Archaeological Record;www.tdar.org).LikeeHRAF,entiredocuments(includingrawdatasets,shapefiles,andthelike)areavailablethroughtDAR,andareorganizedwithinabasicontology.UnlikeeHRAF,thesedocumentsarenotprocessedbytDARstaff(althoughthereisreview of the processing to ensure it has been done correctly), but rather theindividualswhosubmitdocumentscompleteametadataformwhichisattachedtothe document (Watts 2011). This allows the number of documents in tDAR toincreaserelativelyrapidly,andalsoallowstDARtoremainopensource(therearemodestfeesforcontributingdocuments).However,becausecontributorsprovidethe ontological and indexing information themselves, the level of detail andaccuracy vary,meaning that searchesmay not retrieve all relevant documents.And,likeeHRAF,theavailableinformationislimitedtothedocumentswithinthedatabase.

Open Context (www.opencontext.org) is another excellent open-source datarepository for archaeology that is similar to tDAR, but which provides severaladditionalfeaturesthatexpanditsrangebeyondarchaeologicaldata.LiketDAR,archaeologicaldataarecontributedforamodestfee. UnliketDAR,OpenSourceeditorsworkwithcontributorstocreatethemetadataandcleanthedatasourcesforpublicationontheweb,andthedatasourcesthemselvesareevaluatedfortheirimportance; that is, not all data sources are published, only those that peerreviewersthinkwillbeofusetothebroaderfield.OnceincorporatedintoOpenContext,datasourcesarelinkedtorelateddatasourcesonthewebbyLinkedDatastandards(Kansa2010).ThisallowsOpenContexttoexpandbeyondthearchiveddata,overcomingalimitationofbotheHRAFandtDAR.

We present here what we argue is a more comprehensive approach to theproblemofretrievingspecific,qualityinformationacrossdomainsthanthethreeoutlinedabove(andtherearemanyotherexcellentprogramsanddatarepositorieswecouldhavecited)—asetofdataharvesting,evaluation,assembly,andoutputprocessesthathasbeenimplementedinDacura(dacura.cs.tcd.ie)andwhichitselfis being employed as the managing software for the Seshat databank(seshatdatabank.info).Bybeingdevelopedandimplementedasanintegralpartofa data-heavy research initiative, Dacura has benefitted from the ongoingidentification of problems and shortcomings that gathering andmanaging largeand complex data entail, and thus serves as a good example of a resource thatwouldbeofusetoacademicresearchers.WedonotintendthisarticletobesimplyanadvertisementforDacura,butratherweuseDacuraandtheSeshatdatabanktoillustrate an approach to harvesting, evaluating, and retrieving data from the

Page 6: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

5

Internetorany“bigdata”sourcethathasbeenmadepossiblebynewadvancesincomputerscienceandthatwebelievewillhaveprofoundimpactonarchaeologicalresearch.

DacuraDacura is a data curation platform designed to assist researchers from anydisciplineincreatingandcuratinghigh-qualitydatasets.Thebasicideaissimple—the researcher starts by defining the precise structure of the dataset that theywouldliketocollect.Thesystemusesthisdetailedinformationtosupporttheuserindiscovering,harvesting,filtering,correcting,refiningandanalyzinginformationfromtheInternetinordertocompilethehighestqualityinformationpossiblewithwhichtopopulatethedataset.Thedetailsprovidedbytheresearcherincludebasicinformationsuchthedefinitionofthefundamentalentitiesofinterest(e.g.Hawaii),thepropertiesof thoseentities inwhichsheorhe is interested (e.g.populationestimates), the datatypes and desired units of each property (e.g. descriptions,counts,kilos)relationshipswithotherentitiesbothwithinandbeyondthedatasetitself (e.g. Polynesia incorporates Hawaii; Hawaii isDescribedByhttp://wikipedia/Hawaii).

TheprocessofdefiningthestructureofthedesireddatasetisoneofDacura’sstrengths. Inorder forDacura tosearchanexistingdatabaseor the Internet toharvestdata,thedesireddatamustbecarefullyconsideredbytheresearcher,aswellasthestructureandtypeofdatadesired.Thisprocessforcestheresearcherto carefully design his or her research question and to carefully considerwhattypesofanalysesarenecessarytoanswerthatquestion.Itisimportanttopointout,however,thatbecauseDacuraprovidessuchaflexiblesearchstructure,dataharvesting can be an iterative process, changing as data are evaluated and asquestionsbecomemorefocused.

Dacura encodes the structure of the dataset defined by the researcher as asemanticwebontologyaccordingtotheWebOntologyLanguage(OWL)standardoftheWorldWideWebConsortium(W3C),themaininternationalstandardsbodyfortheweb. OWLisarichandflexible languagewhichallowsawidevarietyofconstraintsandinferencerulestobespecifiedonthedatatobecollected(e.g.thepopulationofasiteshouldnotbegreaterthanthepopulationoftheregionthatitisin).Incontrastwiththeunstructurednaturallanguagestringsthatdrivesearchengineresults,thehighlystructuredandpreciselyspecifiednatureofontologicaldatasetspecificationscanbeexploitedbythecomputertoprovidemuchgreaterspecificityinresults.Thericherthestructuralspecification,theeasieritisforthesystemtoautomatetheharvestingofdataandthegenerationofusefultoolswithwhichtoanalyze,improveandcurateitovertime.

Page 7: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

6

Dacura is based on semantic web technology. At its core is a ResourceDescriptionFramework(RDF) triplestore, a specific formofgraphdatabase (asopposed to a two-dimensional column and row database used in mostspreadsheets) in which data are identified by a subject-predicate-objectcombination like “Hawaii is Polynesia”, “Hawaii has Island”, or “Polynesia hasIsland” (www.w3.org/TR/rdr11-concepts/). The subject-predicate-objectstructure can be understood as nodes-edges-properties within a three-dimensionalgraphwhichrepresentsandstoresdata.ThegraphicstructureofanRDFtriplestoreallowsforindex-freeadjacency,meaningeverysubject-predicate-object triple directly links to related subject-predicate-object triples so that noindexlookupsarenecessary.Intheexampleabove,Polynesia,Hawaii,andIslandare all linked so that no indexed search is required to identify Hawaii as aPolynesianIsland.

Figure 1. An overview of the Dacura data curation platform. Source:https://youtu.be/AEb1wF3jAgk.

OWLontologiesareusedinDacuratoenablesemanticreasoninginqualitycontrolanddataharvesting;thatis,ifthereareconflictsbetweentriplesDacuraidentifiesand marks them as conflicts for further evaluation (seedacura.scss.tcd.ie/ontologies/dacura-130317.ttl). Dacuraisdesignedtoproduceandconsumedatainlinewiththelinkedopendataprinciples.Thismakesiteasy

Page 8: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

7

toimportinformationfromexistingstructuredinformationsourcesandtoenrichcurateddatasetsbyinterlinkingthemwithpubliclyavailableLinkedDatasources(e.g. DBpedia or wikidata, the linked data versions ofWikipedia) and datasetscuratedbyDacuracansimilarlybeeasilylinked.AnexampleofDacura’soperationispresentedinFigure1.

TheDacuraworkflowbreakstheprocessofdatasetcreationandcurationdowninto 4 stages, as illustrated in Figure 2. The first stage is data harvesting:identifyingsourcesofhighqualityinformationwithwhichtopopulatethedataset.Dacura supports a number of approaches to data harvesting: from identifyingrelevant data in known public data sources, to deploying agents to search theInternet,tomanualspecificationofinformationsourcesbycurators.Thegoalofthesystemistoautomate,asmuchaspossible,theidentificationofthesourcesofinformationthatwillbeneededinordertopopulatethedataset.Inthisstage,thegoalisnottofinddocumentsabouttheentitiesinwhichoneisinterested,buttofind specific sources of information which can populate the properties andrelationshipsthataresearcherhasdefinedintheirdatasetspecification.

Figure2.ThefourstagesoftheDacuradatacurationprocess.

Page 9: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

8

ThesecondstageintheDacuradatasetcreationandcurationprocessisknowledgeextraction. This involves extracting the precise information from harvestedsources into the structure required by the researcher’s dataset specification.Although Natural Language Processing and other artificial intelligencetechnologiescontinuetoimproveallthetime,theyremainerrorproneandthus,inordertoproducehighqualitydata,somehumaninputisnormallyrequiredtofilterout false positives. Dacura employs tools to support both human users andautomated agents in screening, filtering, improving, annotating and interlinkingcandidaterecordstoproduceknowledgereports;thatis,authoritativeaccountsoftherelevantknowledgecontainedinasource,enrichedthroughlinksintothewebofdata.

The third stage in the Dacura process is perhaps the most important forensuringdataquality:expertanalysis.Dacurafocusesstronglyondatasetquality,providing both automatic and manual tools to ensure that datasets provideaccurateandcompletedatathatconformstothedatasetspecifications.Initialdataevaluationisperformedthroughautomatedtools,whichusesemanticconsistencychecking and validity testing to reconcile various data points into a compositeaccount that represent the tools’ best estimate of authoritative data whichaccuratelyrepresentsreality.Thesecompositeaccountsarereviewedbyexpertsin the data domain (like our hypothetical researcher interested in Big IslandHawaiian population), allowing the expert to correct misinterpretations andidentifydisagreementsbetweentheexpertandtheautomatedtools.Expertscancreate their own personal interpretations (for example, by specifying that onlyparticularsourcesshouldbetrusted)andoverlaythisonthedatasettoproduceacustomdataset,representingtheirviewonwhatthedatashouldbe.Theexpertscurrently volunteering as data evaluators for the Seshat databank are listed athttp://seshatdatabank.info/seshat-about-us/contributor-database/.Thenumber(77atthetimeofthiswriting)andrangeofexpertiseofthesevolunteersillustratesthat it is quite feasible to incorporate expert evaluation into a data harvestingsystemlikeDacura.

Finally,Dacurasupportsavarietyofoutputtoolstomakedatasetsavailabletothirdpartiesinarangeofformats.DacurapublishesitscurateddatasetsasLinkedDataandprovidesaSPARQLendpoint,aquery language forRDFgraphs,whichsupports sophisticated filtering and retrieval of data. This allows intelligentapplicationstointeractwiththedatasetsinunforeseenways.ThesedatasetsareproducedinaccordancewiththeprinciplesofLinkedDatawhichallowsthemtointeract with the wider semantic web. For human users, Dacura can producegraphs, charts, maps and other visualizations to provide users with easy-to-understandinsightsintothedatainadataset.Dataforgraphsorotheroutputscanbebrowsed,searched,andselectedprovidinguserswiththeabilitytoaccessthe

Page 10: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

9

sectionsofdatasetstheyfindmostuseful.Dacuraalsoallowsdatasetsorsubsetsofthemtobeexportedinawiderangeofformatsforexternalanalysis,includinggeographicinformationsystemsandstatisticalpackagessuchasSPSSandR.

ImplementingDacura:TheSeshatMeta-modelAsanexampleofhowDacuraworksinpractice,Figure3showsthemeta-modelbeingusedtoimplementSeshat:GlobalHistoricalDatabank(Turchinetal.2015,2016; see also dacura.scss.tcd.ie/ontologies/seshat-130317.ttl). Seshat(seshatdatabank.info) is intended to bring together into one place acomprehensive body of knowledge about humanhistory andprehistory for thepurposeofempiricallytestinghypothesesaboutculturalevolution(e.g.“Howandunder what circumstances does prosocial behavior evolve in large societies?”“What rolesdo religionandritual activitiesplay ingroupcohesionandculturaldevelopment?” “What is the impactofclimaticand theenvironmental factors insocietal advance?”). Testing these hypotheses with appropriate statisticaltechniquesrequiresdatathatarebothvalidandreliable;thatis,datathatdefinewhattheresearcherthinkstheydefine(“apples”ratherthan“oranges”)andthataremeasuredinthesamemanneracrosscases.Casesmustalsobechosencarefullytoensurethatunitsofanalysisareequivalentacrosscasesanddomains

TherearetwofundamentalpiecesofinformationuponwhichSeshatcasesarebased:aLocationandaDuration.Alocationisapointorpolygonanywhereontheearth’ssurface,anddefinesanentitycalledaTerritory.Threeentityclassesof Territory have been defined in Seshat (moremay be defined later as Seshatexpands):

1. Natural-GeographicAreas(NGA),whichareacontiguousarearoughly100by 100 kilometers encompassing a reasonably homogenous ecologicalregion.

2. Biomes,whichencompassacontiguousbiotic regionorregionofsimilarclimaticconditions.

3. WorldRegions,whichmaybepre-definedentitiessuchasnationsorstates,orcanbedefinedbyotherspecificcriteria.

ADurationcanbeasingledateoradaterange.AddingaDurationtoaTerritoryentity class defines one of two temporally bounded entities: (1) a HumanPopulation,which is groupof humans in adefined territoryduring a specifiedperiodoftime;and(2)anEvent,definedasanoccurrencetakingplaceinaspecificterritoryinaspecificperiodoftime.

Page 11: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

10

Figure3.MetamodelforSeshat:TheGlobalHistoryDatabank

SeshatprovidestheabilitytocreateentityclasseswithinHumanPopulationsandEvents for specific research questions. Within the Human Population entity,currententityclassesare:

1. Tradition, which is defined as a human population “sharing similar

subsistencepractices,technology,andformsofsocio-politicalorganizationthatarespatiallycontiguousoverarelativelylargeareaandwhichenduretemporally for a relatively long period of time” (Peregrine and Ember,2001:ix). For this entity class there is a formal sampling universe forselecting cases, theOutline of Archaeological Traditions (hereafter OAT)(hraf.yale.edu/online-databases/ehraf-archaeology/outline-of-archaeological-traditions-oat/)andaformalthesaurusforcodingdata,theOutline of Cultural Materials (hereafter OCM) (hraf.yale.edu/online-databases/ehraf-world-cultures/outline-of-cultural-materials/).

2. Cultural Group, which is a human population sharing norms, beliefs,behaviors, values, attitudes, etc. The primary sampling universe for this

HumanPopulation

Territory

Polity Event

CulturalGroup

Tradition

Settlement

NGA

Location(pointorpolygon)

Duration(start-end)

IdentityGroup

WorldRegion

LanguageGroup

Biome

Socio-naturalDisaster

NaturalDisaster

TransitionRitual

SocialMovement

SocietalCollapseInter-group

Conflict

Technology

Page 12: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

11

entity class is the Outline of World Cultures (hereafter OWC)(Murdock,1983)andthethesaurusistheOCM.

3. Polity,whichisahumanpopulationthat isapolitically independentunitwithasharedsystemofgovernance.Thisisanexampleofanentityclasscreated for a specific research project. The sample consists of 30 casesselected for characteristics of sociopolitical organization and geographiclocation(Turchinetal.,2015).TheprimarythesaurusforthisentityclassistheOCM.

4. Settlement, is a human population in a physical location and materialfacilitiesranginginsizeandcomplexityfromatemporarycamptoagreatmetropolis.Becauseofthegreatrangeofsettlementsthatcouldbecoded,there is no defined sampling universe for the entity class. The primarythesaurusisagaintheOCM.

5. IdentityGroup,whichisahumanpopulationwithasharedsenseofbeingpart of the same group. Like Polity, this entity class was created for aspecific set of research projects and the sample is opportunistic (seeWhitehouse,Francois,andTurchin,2015).Thereisnoformalthesaurus,thoughtheOCMisusedforsomedomains.

6. LinguisticGroup,which isahumanpopulationwithacommon language.The sampling universe for this entity class is Ethnologue(www.ethnologue.com),butthereisnoformalthesaurus(again,theOCMisbeingusedforsomedomains).

Inaddition,subclassescanbeaddedtoentityclassestoprovideformorespecificsets of data. Figure 4 shows entity subclasses that have been created for thecurrententityclasseslistedabove.

TheEvententityobviouslyencompassesanalmost infinite rangeofpossibleentityclassesandsubclasses.TomaintainsomeordertheeventclassinDBpediaisused(mappings.dbpedia.org/server/ontology/classes/)asabasicontology.AsshowninFigure4,thecurrententityclassesfortheEvententityinclude:

1. Inter-groupConflict,suchasawar,abattle,afeud,orthelike.2. Socio-NaturalDisaster,suchasafamine,orepidemic.3. Natural Disaster, such as a drought, a flood, an infestation, a volcanic

eruption,etc.4. SocietalCollapse5. TransitionRitual,suchasamarriage,acoronation,oraninitiation.6. SocialMovement, includingphysicalmovements likemigration,butalso

socialmovementssuchasrevitalization,millenarianism,strikes,etc.7. Technological,suchasinventions,discoveries,innovations,andthelike.

Page 13: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

12

Figure4.DetailoftheHumanPopulationentity,showingcurrententityclassesandsubclasses.

PopulatingSeshat:TheDacuraWorkflowModelFigure 5 illustrates how archaeological data for the Tradition entity class isincorporated into the Seshat databank through Dacura. The area in the bluerectangle can be entirely automated, while the area outside the blue rectanglerequires analysts and experts to ensure the Seshat data are valid and reliable.Startingatthetopofthebluerectangle,aHumanPopulationentityisdefinedbyaDurationwithinaTerritory. ThecharacteristicsoftheHumanPopulationentityare thenclassified through theOATthesaurus todefineaTraditionentityclass.DataminingbeginsbysearchingtheInternetforCulturalDomaininformationasclassifiedthroughtheOCMthesaurus.Atthispointaresearchercanalsosearchfor Cultural Domain information through both Internet and print sources.InformationonaspecificCulturalDomain,identifiedinFigure5asArchaeologicalData,iscomparedwithvaluesinDBpediatodetermineiflinkedvaluesshouldbe

HumanPopulation

Quasi-Polity

Polity ReligiousGroup

Sub-Polity

CulturalGroup

Tradition

Settlement

Sub-tradition

IdentityGroup

City

LocalGroup

Coalition

Neighborhood

EthnicGroup

LanguageGroup

Nation

InterestGroup

Territory

Location(pointorpolygon)

Duration(start-end)

Page 14: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

13

included fromothersources, and thenoutput fromDacuraandevaluatedbyananalystforconsistency.TheoutputisnextsenttotheresearcheroranexpertontheCulturalGrouporCulturalDomain forevaluation. Theresearcherorexperteither decides upon a canonical value for the Cultural Domain or, if there areconflictsthatcannotberesolved,anon-canonicalvalueisgiven.Ineithercase,thevalue is included in Seshat, and marked as either canonical or non-canonical.Canonical values are also exported to DBpedia to assist other researchers andfuturesearches.

Figure5.WorkflowfortheincorporationofnumericaldatafortheArchaeologicalTraditionentityclassintoSeshatthroughDacura.

UsingSeshat:OutputsfromDacuraOurresearcherinterestedinBigIslandHawaiipopulationestimateswouldbeabletoquickly identify accurate and fully referenced estimates throughSeshat. ShewouldopentheSeshatpage(http://dacura.scss.tcd.ie/seshat/),selecttheNatural-GeographicAreaforHawaii,selectthePolitysub-classoftheHumanPopulation

Page 15: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

14

inhabitingHawaii for thetimeperiodof interest,andthenselect thepopulationvariable(Figure6).Thedataonpopulationshewouldobtaininthiscasewouldbetaken from the Seshat data repository created with Dacura through the dataharvestingandverificationprocessdescribedabove.ButourresearchercouldalsocreateanewontologyusingDacuratoconductherownuniquesearch,asdiscussedaboveandillustratedinFigure1.

Figure 6. An overview of data harvesting using Seshat Source:https://youtu.be/ERCcPN97850.Ourresearcherwouldhaveawiderangeofpossibleoutputsfromhersearch.Asnotedearlier,DacurapublishesdatasetsasLinkedDataandemploysSPARQLforoutput.SPARQLisaquerylanguageforRDFgraphswhichcanproducedocumentsandrawdatasetsbutalsographs,charts,mapsandothervisualisations.Importantforarchaeologists,SPARQLworkswithGeoSPARQLtoallowdataintegrationintogeographicinformationsystemusingwell-understoodOGCquerystandards(GML,

Page 16: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

15

WKT,etc.).RawtextualornumericdataproducedthroughDacuracanbebrowsed,searched,andselected,allowingourresearchertheabilitytoaccessthesectionsoftextsordatasetsshe findsmostuseful. Dacuraalsoallowstextsordatasets(orsubsetsofthem)tobeexportedinawiderangeofformatsforexternalanalysis.Forexampleourresearchermightwantnumericaldataonpopulationestimatesasoutputforstatisticalanalysis.Dacurawouldproduceacomma-delimitedfilethatcould be ported directly into a spreadsheet or statistical package and ourresearchercouldthenrunanyanalysissherequiredtoanswerherquestion.Figure7 shows a simple line graph of Big IslandHawaii population estimates derivedthroughDacuraandSeshatwithdataoutputtoanExcelspreadsheet.Thisisnotaparticularlyimpressiveresultinitself,butconsiderthatourresearcherwouldhavebeenabletocompilethesedatainamatterofminutes,beconfidentoftheirquality,andhaveaccesstoallthemetadataattachedtothem.

Figure7.PopulationdynamicsonBigIslandHawaiifrom1200to1700CE.

ConclusionsTheInternetprovidesscholarsabundantinformation,butoftentheinformationistooabundant,andusuallylacksqualitycontrol.Dacurawasdesignedtoaddresstheseproblems.ItprovidesawaytoharvestinformationfromtheInterneteasily,with an assurance of quality, andwith amanageable body of results. Dacura’scarefully designed ontology (dacura.scss.tcd.ie/ontologies/dacura-130317.ttl)allows researchers to immediately identify and retrieve information directlyrelevant to their research. Dacura’s integrated thesauri and RDF triplestore

Page 17: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

16

structureremovestheneedfordetailedindexingacrossresultdomainssothatallinformation on a given subject, even information that might not be obviouslyrelated or indexed as related, is retrieved. And Dacura offers a wide range ofpossibleoutputs, from texts tovisualizations to spreadsheets.Dacura isnot theonly data harvesting and curation package available, but because it has beendevelopedhand-in-handwiththeSeshatdatabank,itprovidesauniquemodelfornewcomputer-basedmethodsofarchaeologicaldatahandling.

In thisway,Dacurarepresentsan importantnewtool forarchaeologists. AsKintighetal.(2015:3)haverecentlypointedout“archaeologistsareincreasinglychallengedastheyacquire,manage,andanalyzelargevolumesofdisparatedata.”Dacura provides one answer to this problem. Dacura allows archaeologist toharvestdatafromestablishedresourcesliketDARandHRAFaswellasinputtheirowndataandtoidentifydatasourcesontheInternetthatmightnotbeotherwisediscovered.Dacuraallowstheenormousvolumeofdataavailabletoarchaeologiststo be quickly and easily reduced to themost important information on a givenquestion,andthenoutput inavarietyofuseful formats. Dacuraalsoprovidesaflexibledatamanagementtoolthat,perhapsmostimportantly,providesaccessforotherresearchersontheInternettodatathathasalreadyharvestedandevaluated(suchasthoseinSeshat).

In addition, Dacura represents a way to provide useful and accuratearchaeological data to scholars who are not archaeologists. It has long been afrustration among archaeologists that our data, which can provide both adiachronic record of cultural stability and change and empirical examples ofpractices that have been successful or unsuccessful in human societies, has notbeen widely used outside of archaeology. But it is also not surprising, asarchaeological data can be hard to access and hard to understand by non-archaeologists(Kintighetal.2015:2).Byprovidingasemi-automatedmeansofharvesting,evaluating,andexportingarchaeologicaldatathathasbeenevaluatedforaccuracy,Dacuraprovidesbothameansandamodelforeconomists,politicalscientists,ecologists,geographers,andotherstoaccessandexploretherichandvaluablerecordofthehumanpast.

Itisimportanttonote,however,thatthereisalsoadangerindatabankslikeSeshat and programs like Dacura that populate them. Because data are readilyavailablethroughestablisheddatabanks,researchersmaychoosetousetheextantinformation rather than to create a new dataset structure and perform a newsearch.Thistendstocodifyexistingdata,wheninfactthosedatacanbeerroneousorsupersededbynewinformation.OneofthebenefitsoftoolslikeDacuraisthatbecausetheycanmakespecifyingdatasetstructureandperformingsearchesbothhighlyspecificandrelativelysimple,researcherswillchoosetocreatenewdatasetspecifications rather than using existing ones thatmight be related to, but not

Page 18: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

17

specific to, their research questions. Researchers will thus be encouraged todevelopnewinformationfordatabankslikeSeshat.Thiswillpreventextantdatafrombeingcodifiedandprovideanever-expandingrealmofusableinformationonthehumanpastforbotharchaeologistsandnon-archaeologiststoemploy.

AcknowledgementsThe authorswish to thank the participants of aworkshop held at the Santa FeInstituteMay 4-6, 2015 duringwhich the needs for harvesting and integratingquality informationwas discussed and the Seshatmeta-model developed. Thiswork was supported by a John Templeton Foundation grant to the EvolutionInstitute,entitled"Axial-AgeReligionsandtheZ-CurveofHumanEgalitarianism,"aTricoastalFoundationgranttotheEvolutionInstitute,entitled"TheDeepRootsof theModernWorld: The Cultural Evolution of EconomicGrowth andPoliticalStability," an ESRC Large Grant to the University of Oxford, entitled "Ritual,Community, and Conflict" (REF RES-060-25-0085), a grant from the EuropeanUnion Horizon 2020 research and innovation program (grant agreement No644055[ALIGNED,www.aligned-project.eu]),andanEuropeanResearchCouncilAdvanced Grant to the University of Oxford, entitled “Ritual Modes: Divergentmodes of ritual, social cohesion, prosociality, and conflict.” We gratefullyacknowledge thecontributionsofour teamof researchassistants,post-doctoralresearchers, consultants, andexperts.Additionally,wehave received invaluableassistance from our collaborators. Please see the Seshat website(www.seshatdatabank.info)foracomprehensivelistofprivatedonors,partners,experts,andconsultantsandtheirrespectiveareasofexpertise.Finally,wewantto thank the anonymous reviewers whose insightful comments allowed us tosubstantiallyimprovethispaper.

DataAvailabilityStatementThe Seshat data bank can be accessed athttp://dacura.scss.tcd.ie/seshat/index.html. Information on Dacura can beobtained at http://dacura.cs.tcd.ie/. Both the Dacura and Seshat ontologies areavailableathttp://dacura.scss.tcd.ie/seshat/downloads.html.

ReferencesMurdock,GeorgePeter.1983.OutlineofWorldCultures,6thedition.Human

RelationsAreaFiles,NewHaven,CT.

Page 19: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos

18

Kansa,EricC.2010.OpenContextinContext:CyberinfrastructureandDistributedApproachestoPublishandPreserveArchaeologicalData.SAAArchaeologicalRecord10(5):12-16.

Kintigh,KeithW.,JeffreyAltschul,AnnKinzig,W.FredrickLimp,WilliamK.Michner,JeremySabloff,EdwardHackett,TimothyKohler,Bertram

François,Pieter.JosephManning,HarveyWhitehouse,RobBrennan,ThomasCurrie,KevinFeeneyandPeterTurchin.2016.AMacroscopeforGlobalHistory:SeshatGlobalHistoryDatabank.DigitalHumanitiesQuarterly10(4).

Ludäscher,andCliffordLynch.2015.CulturalDynamics,DeepTime,andData.AdvancesinArchaeologicalPractice3(1):1-15

Peregrine,PeterN.andMelvinEmber(editors).2001.EncyclopediaofPrehistory,9vols.KluverAcademic/PlenumPublishers,NewYork.

Turchin,Peter,RobBrennan,ThomasCurrie,KevinFeeney,PieterFrançois,DanielHoyer,JosephManning,ArkadiuszMarciniak,DanielMullins,AlessioPalmisano,PeterPeregrine,EdwardA.L.Turner,andHarveyWhitehouse.2015.Seshat:TheGlobalHistoryDatabank.Cliodynamics6(1):77-107

Watts,Joshua.2011.BuildingtDAR:Review,Reduction,andIngestofTwoReportsSeries.ReportsinDigitalArchaeology1:1-15.

Whitehouse,Harvey,PieterFrançois,andPeterTurchin.2015.TheRoleofRitualintheEvolutionofSocialComplexity:FivePredictionsandaDrumRoll.Cliodynamics6(2):199-216.

Page 20: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos
Page 21: Dacura: A New Solution to Data Harvesting and Knowledge ... · disponible a través de Internet. Somos bendecidos por la pura abundancia de artículos, imágenes y datos que podemos