70
World Sense-Making using Linked Data Tope Omitola (joint work with Prof. Nigel Shadbolt) Faculty Research Seminar Talk, Birmingham City University, UK Thurs 8 Dec. 2011 1

Omitola birmingham cityuniv

Embed Size (px)

DESCRIPTION

World Sense-Making using Linked Data.Faculty Seminar Talk at Birmingham City University, Birmingham, United Kingdom

Citation preview

  • 1.World Sense-Making using Linked Data Tope Omitola(joint work with Prof. Nigel Shadbolt)Faculty Research Seminar Talk, Birmingham City University, UKThurs 8 Dec. 2011 1

2. Thank You Thank you for inviting me. 3. World Sense-Making using Linked DataTope Omitola 3 4. Talk Outline EnAKTing: Its story From the Web to Semantic Web to Linked Data Public Sector Datasets: Publication and Consumption Findability of Appropriate Data Sources Service Descriptions Provenance and Trust in Linked Data 5. What is EnAKTing? EPSRC-funded project. Addressing 3 key research problems; (1) how to buildontologies quickly that are capable of exploiting thepotential of large-scale user participation, (2) how wequery an unbounded web of linked data, (3) how tovisualise, explore, browse and navigate this mass ofdata. Project Leaders: Prof. Sir Tim Berners-Lee, Prof. DameWendy Hall, and Prof. Nigel Shadbolt. 6. From the Web to Semantic Web to Linked Data The Web of Data Problems with the Web of Document RDF Linked Data 7. The Web of Data (a.k.a Semantic Web/Linked Data) Traditional Web of Documents Internet, Documents, Links Documents in HTML Links using URLs HTTP for document access and transfer 8. Data Silos on the Current WebAPIHTMLHTML XML 9. Some more problems with Web of Documents Difficult to Integrate Data Example Use Case: Making a Travel Plan Data Integration by looking and typing Slow Unproductive Workflow Difficult for apps to make sense of HTML text 10. Solutions Use RDF to give some structure to the data RDF subject predicate object RDF links things, not just documents, and theyare typed 11. RDF is a language (for data)Words URIsand literal textNouns and Verbs Classes andPropertiesSentence structureRDF Statements (triples)ParagraphsRDF GraphsFootnotes URIs[Domain Name Service]DictionariesRDF Schemas Generic grammar for languages of description Functions as native language, second language, or pidgin. 12. RDF and Ontology The AAA Slogan: Anyone can say Anythingabout Any topic. s po . (subject predicate object .) "Tony Benn . RDF is used to build ontologies; a formalrepresentation of shared knowledge by a setof concepts within a domain and therelationships between them Examples: Finance ontology; MusicBrainz,music ontology; GO, gene ontology, etc 13. What is Linked Data? Data, data, everywhere: We are surroundedby data: School performance, car fuelefficiency, etc Data help us to make better decisions You can discern the shape and structure of anentity by looking at the data it generates Data shapes conversations and markets 14. What is Linked Data? Linked Data: Framework where data is a first classcitizen on the Web Evolving the current Web into a Global DataSpace TimBL: 4 principles of Linked Data Use URIs as names for things, Use HTTP URIs, Whensomeone looks up a URI, provide useful information,using the standards (RDF, etc), Include links to otherURIs, so that they can discover more things 15. The Web of Linked Data Link everything. No silos.ThingThing ThingThingThing Thing 16. The Web of Linked Data Linked Data (Semantic Web ) is a graphdatabase: 17. Linked Data Advantage comes from linking the RDF(s)together.17 18. Some Linked Datastores BBC NY Times Guardian DBpedia Geonames 18 19. Linking (Linked) Open Data cloudlinkeddata.org Many of the datastores are being linkedtogether to form a network/graph. 19 20. Linked Data In summary: Linked Data provides: RDF A standardized data access mechanism, HTTP Hyperlink-based data discovery, using URIs Self-descriptive data, through using sharedvocabularies 21. Government Linked Data Explosion of Government (Linked) Open Data efforts and projects.data.gov, data.gov.uk, data.gov.au Examples: 22. Public Sector Datasets Inherent value in opening up public government data Systems and Services can be tailored to citizens priorities. Likely questions citizens may need answers to are: Where can I find a good school, a good investmentadvisor, a good employer? 23 23. Public Sector Datasets (contd.) Integration of datasets enables more complexquestions to be asked and answered Some examples: http://www.planningalerts.com/ http://ishortman.com/projects/expendituremap/ Governments freeing up their data. Holy grail is information integration: Meshing.24 24. Issues we focus on Findability of appropriate data sources SEARCH: Look at the data sources EXTRACT: Slicing of data sources INTEGRATE: Unifying the views EXPLORE: Answering the questions. 25. Examples of Government Public Data (csv) 26. Examples of Government Linked Data (rdf) 27. Workflow Identify Dataset Design/ Select Vocabularies Extract and convert data into RDFPublish as Linked DataConsume Linked Data (Application) 28 28. Publishing your data as Linked Data: Some Things toConsider How do you choose a good URI to name things? There areguidelines for this. Examples: http://dbpedia.org/resource/Wildlife_photographyTope Omitola @ Univ of Southampton:http://id.ecs.soton.ac.uk/person/24123. Describing a Data Set using: voiD (the Vocabulary of InterlinkedDatasets) Choosing and Using Vocabularies to Describe Data (SKOS, RDFS,OWL, scovo) Sourcing datasets: Where do you get the datasets from (e.g. SemanticWeb search engines, manual search, etc) Choice of join points: When you have different datasets, where do youjoin them together Data normalization: using RDF make things easier. Alignment of datasets 29. Architecture Infer new Dataconcepts and Integration relationshipsSPARQL RDFGatherersDataand RDF TriplestoreSources Extractors(4store) Services30 30. Data Publication Challenges and Solutions Research Questions: In our case, dealing with data that are centred aroundthe United Kingdoms democratic system, Using geography data from the UKs Ordnance Surveyas the join-point with data for criminal statistics,Members of Parliament, mortality rates, etc. Sourcing the datasets Many government data sets are in pdf, html, or xlsfiles, so automatic discovery methods are not possible(yet), Went through manual discovery process, searching forthem, We found some in pdf, html, and in xls, We decided against pdf and html 31 31. Data Publication Challenges and Solutions (contd.) We went for data in xls format. Why? Ability to source from a wider range of public sectordomains.Data SourceFormat DatasetPublicwhip.org.ukHTML MP votes records, etcTheyworkforyou.com XML dump Parliament, ParliamentexpensesHomeoffice.gov.ukExcelRecorded crime(England, 2008/09)Statistics.gov.ukExcelHospital Waiting List(England 2008/09)Performance.doh.gov.uk ExcelMortality rates(England 2008/09)Ordnancesurvey.co.uk Linked DataUKs mapping agency32 32. Data Publication Challenges and Solutions (contd.) Data normalisation. RDF as our standard model. Data conversion to RDF. Python + Java. Modelling the datasets: Multi-dimensional,used SCOVO. 33 33. Data Publication Challenges and Solutions (contd.) Crime dataset:Table 7.03 Recorded crime by offence group by police force area, English region andWales, 2008/09 RecordedNumbers crimePolice force area, English Total Violence Sexual Robbery Burglary OffencesOther FraudCriminal DrugOtherregion and Walesagainst offences against theftanddamage offences offences 1the vehicles offencesforgeryperson NumbersCleveland 55,09410,662566404 6,1755,224 13,697905 13,7462,636 1,079Durham45,074 7,435476170 6,2264,9409,674835 13,0271,327 964Northumbria105,23419,14798973211,418 11,620 24,0422,909 27,1785,166 2,033North East Region205,40237,2442,0311,30623,819 21,784 47,4134,649 53,9519,129 4,076 :TimePeriodrdf:typeowl:Class; rdfs:subClassOfscovo:Dimension. :TP2008_09 rdf:type :TimePeriod. :GeographicalRegionrdfs:subClassOfscovo:Dimension; dc:title "Police force area, English region and Wales". :CriminalOffenceTyperdf:typeowl:Class; rdfs:subClassOfscovo:Dimension.34 34. Some Issues in Linked Data Co-referencing, i.e. different sources referring tothe same entities by different names. Cardiff inDbpediahttp://dbpedia.org/resource/Cardiff orhttp://dbpedia.org/resource/Cardiff_City Cardiff inGeonameshttp://sws.geonames.org/2172349/ Which Cardiff shall we use? Solution: sameas service from Southampton35 35. 36 36. Alignment of datasets37 37. Alignment of Datasets (contd.) Asserted owl:sameAs relations between dataset geoand O.S. (using string matching) For example, the English county of Cumbria wasaligned as the following:http://www.w3.org/2002/07/owl#sameAs. A few special cases. Yorkshire and the HumberRegion vs Yorkshire & the Humber NHS Trust were labelled differently: e.g. South Tyneside NHSTrust had no equivalence in the OS. So used Google Maps. 38 38. Examples of Government Public Data (csv) 39. Examples of Government Linked Data (rdf) 40. Recap: Data Publication Sourcing : Many not in RDF yet. Some in html,pdf, and xls. We chose xls. Selection of RDF as the normal form. Used scovo to model multidimensional data. We used owl:sameAs to assert equivalencesbetween geo regions. We used string matching. Some did not work,e.g. Yorkshire and the Humber. Some have noequivalent OS entities, so we had to go viaGoogle Maps API41 41. Consuming Linked Data How do you visualize linked data sets. Linked Data browsers, e.g. Disco, Tabulator. Linked Data Search Engines, e.g. Sig.ma,Falcons, Sindice. Domain-specific Applications and Mashups,e.g. dayta.me(from Southampton), US GlobalForeign Aid Mashup. 42. Data Consumption Application acts as an aggregator ofinformation based on users postal (zip) code. Generates data views based on geographicalregion of postal code. Shows political representatives (MPs) forconstituencies, their voting records, and theirexpenses.43 43. Data Consumption (contd.)44 44. Data Consumption(contd.) Challenges: The lack of UIs to quickly browse, search or visualiseviews on a widerange of differently modelled data, Lack of suitable tools which allow efficientaggregation and presentation of datato the UI frommultiple datasets, Data consumers having partial knowledge of domainand finding it difficult to understand the domain andthe data being modelled.Points out the need for atoolset that helps developers givebetter description ofthe domain being modelled.45 45. Recap: Publish and Consume Information Integration; one of the holy grails Problems with data sources. Different formats, etc, RDF can act as a standard model. Publication to RDF. Challenges. Solutions. scovo for multi-dimensional data string matching and its complexities Consuming the data. Challenges. Solutions. Aggregating data based on zip code Complexities of geo boundaries We have re-published the data we generated into thelinked data cloud: EnAKTing datasetswww.enakting.org/enakting/datasets46 46. Some of our Outputs http://geoservice.psi.enakting.org: service to discovergeographical resources, http://map.psi.enakting.org/: integrate different PSI LinkedData sources by querying Backlinking service, http://backlinks.psi.enakting.org: service to discover back-links in PSI, http://void.rkbexplorer.com/: describes the contents ofdata sets, enabling discovery and reuse of resources, http://bagatelles.ecs.soton.ac.uk/psi/: platform forintegrating several PSI catalogues from the Web http://4sreasoner.ecs.soton.ac.uk/ Scalable Reasoning in4store; 4sr is a branch of4store where backward chainedreasoning is implemented http://apps.seme4.com/see-uk/ : Visualization tool forsome UK data 47 47. Our solutions/apps 48. Our solutions/apps 49. Findability of Appropriate Data Sources Service Descriptions How do you tell the world about your newlinked data sets? Provide good service descriptions of your datasets Use vocabulary of Interlinked Datasets 50. Vocabulary of Interlinked Datasets (VoID) allows description of datasets and theirinterlinking, e.g. "there are 200k links of typegr: predicates between dataset X and dataset Y;and dataset Y mainly offers data about homesand X about mortgages . A dataset: a set of RDF triples published,maintained or aggregated by a single provider,and accessible on the Web, e.g.:DBpedia a void:Dataset .allows the description of RDF links betweendatasets (using void:Linkset). 51. Three Areas of voiD General Metadata Access Metadata Structural Metadata 52. voiD (contd.)General metadata: the datasets title,description, date of creation, the creator,publisher, licence, subject(s), etc;:DBpedia a void:Dataset;dcterms:title "DBPedia";dcterms:description "RDF data extracted from Wikipedia";dcterms:contributor :FU_Berlin;dcterms:modified "2008-11-17"^^xsd:datedcterms:contributor:OpenLink_Software. 53. Access metadata: describes how the RDF data(set) can be accessed using sparql e.g.:DBpedia a void:Dataset;void:sparqlEndpoint. using URI lookup,Sindice a void:Dataset ;void:uriLookupEndpoint . using rdf dumps,:NYTimes a void:Dataset;void:dataDump. 54. Structural metadata describes the structure and schema ofdatasets naming some representative example entites for a dataset stating if datasets entities share common URIs:DBpedia a void:Dataset;void:uriSpace "http://dbpedia.org/resource/ .Stating the vocabularies used in a dataset:LiveJournal a void:Dataset;void:vocabulary.Providing statistics about datasets, e.g.expressing the number of RDF triples or thenumber of entities of a dataset.:DBpedia a void:Dataset;void:triples 1000000000 ; void:entities 3400000. 55. Publishing voiD files as void.ttl in the root directory of the site, with alocal hash URI for the dataset, e.g.http://example.com/void.ttl#MyDataset. Using the root URI of the site, such ashttp://example.com/,as the dataset URI, and servingboth HTML and an RDF format via contentnegotiation from that URI. Embedding the VoID description as HTML+RDFainto homepage of dataset, with a local hash URIfor the dataset, yielding URI such ashttp://example.com/#MyDataset. 56. Why is voiD useful -- voiD Discovery By enabling the discovery and usage of linkeddatasets. A sitemap such as http://www.yoursite.com/sitemap.xmlreferences void.ttl, and sitemap.xml added robots.txt. A search engine crawls the website indexingvoid.ttl plus a cache of the rdf triples referenced inthis void file. through backlinks:void:inDataset. Through a well-known URI: void.ttl can be placedin /.well-known/void on any Web server , e.g.http://www.example.com/.well-known/void . 57. @prefix void: . @prefix scovo: . a void:Dataset;foaf:homepage;rdfs:label "crime.psi.enakting.org Linked Data Repository";dcterms:date "2010-09-13T11:30:29"^^xsd:date;dcterms:title "crime.psi.enakting.org Linked Data Repository";foaf:nick "crime";dcterms:description "United Kingdoms crime statistics per region for the year 2008/09, provided by theUnited Kingdom Home Office. Dataset provenance:http://www.homeoffice.gov.uk/rds/pdfs09/hosb1109chap7.xls";dcterms:publisher;void:statItem [scovo:dimensionvoid:numberOfTriples; rdf:value 4988; rdfs:label "4,988 triples; ];void:subset [a void:Linkset; rdfs:label "crime.psi.enakting.org CRS -> http://data.ordnancesurvey.co.uk/";void:subjectsTarget;void:objectsTarget;void:linkPredicatecoref:duplicate;void:statItem [rdfs:label "133 URI equivalences"; rdf:value 133; scovo:dimensionvoid:numberOfTriples; ] ]. 58. Provenance and Trust in Linked Data Whom do you trust on the Web? 59. Provenance and Trust Mash-ups, aggregation, integration, data re-use. How do you elicit Reliability and Accuracy? Generate trust by revealing as much information ofyou as possible. Enables consumers to decide the quality andtrustworthiness of your data. Useful for Data Discovery/Mining + QueryPlanning. 60. Different kinds of Provenance When was x derived (when-provenance). How was x derived (how-provenance). What data was used to derive x (what- provenance). Who carried out the transformation(s) from whence x came (who-provenance). 61. Provenance Models for Linked Datasets Provenance Vocabulary Ontology 62. Provenance Models for Linked Datasets (contd) Open Provenance Model 63. Provenance Models for Linked Datasets (contd) Provenance for Datasets (voidp) http://www.enakting.org/provenance/voidp/ 64. voiD Provenance Extension voidp Designed to be simple and lightweight. Mainly for (RDF) data publishers. Includes necessary information of the process, its inputs, and outputs. Basis is simple: An agent runs a process on a data (or dataset) to get another data (or dataset). Agent Process Data Data . @prefix voidp: . 65. voidp Classes and Predicates voidp:ProvenanceEvent:items under provenancecontrol. voidp:actor: actor, person, group, software or physicalartifact, involved in this provenance event. voidp:certification:used to contain dataset signatureelements voidp:contact: contact details of whom to contact shouldpeople have queries about this dataset. voidp:item:the provenance characteristics of a data itemunder provenance control. voidp:processType: the type of transformation or conversionprocedure carried out on the items source voidp:resultingDataset: dataset that is the result of thisprovenance event. voidp:sourceDataset: source dataset for the data item underprovenance control. 66. voidp: A Concrete Example@prefix voidp: . a void:Datasetvoidp:activity [ a voidp:Provenance;voidp:item [foaf:name;rdf:typescovo:Dataset; rdfs:label "RECORDED CRIME STATISTICS 1898 - 2001/02"@en ;prv:createdBy [rdf:typeprv:Actorprv:performedBy ;];voidp:originatingSource ;voidp:hashValue "12335353535"^^xsd:string ;voidp:processType ;to:hasBeginning"2010-10-24T21:32:52"^^xsd:dateTime ;to:hasEnd"2010-10-25T09:32:00"^^xsd:dateTime ;]. 67. voidp in the WildThe Datalift projecthttp://data.lirmm.fr/ontologies/vdpp data.southampton.ac.ukhttp://graphite.ecs.soton.ac.uk/browser/?uri=http %3A%2F%2Fid.southampton.ac.uk%2Fdataset%2F jargon%2Flatest.rdfhttp://graphite.ecs.soton.ac.uk/browser/?uri=http %3A%2F%2Fdata.southampton.ac.uk%2Fdumps% 2Fjargon%2F2011-11-10%2Fjargon.rdf 68. Conclusion The Web of Data is real The Web of Data is here Its time to get on board 69. http://www.enakting.org/enakting/[email protected] Slides at:http://www.slideshare.net/TopeOmitola/omitola-birmingham-cityuniv Questions? 70