38
Matteo Brunati @dagoneye 22/07/2015 Data Curation @SpazioDati 33° Nexa Lunch Seminar http://nexa.polito.it/lunch-33

Data Curation @ SpazioDati - NEXA Lunch Seminar

Embed Size (px)

Citation preview

Page 1: Data Curation @ SpazioDati - NEXA Lunch Seminar

Matteo Brunati @dagoneye

22/07/2015

Data Curation @SpazioDati

33° Nexa Lunch Seminarhttp://nexa.polito.it/lunch-33

Page 2: Data Curation @ SpazioDati - NEXA Lunch Seminar

Big Data - Linked Data - Machine Learning

spaziodati.eu

Page 3: Data Curation @ SpazioDati - NEXA Lunch Seminar

a lot of European projectshttp://www.spaziodati.eu/en/#research

Page 4: Data Curation @ SpazioDati - NEXA Lunch Seminar

Data Curation?https://www.google.com/search?q=data+curation&ie=utf-8&oe=utf-8

Page 5: Data Curation @ SpazioDati - NEXA Lunch Seminar

!

!

Data curation is the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.

http://strataconf.com/stratany2014/public/schedule/detail/36021

Page 6: Data Curation @ SpazioDati - NEXA Lunch Seminar

a lot of things involved

!

ETL (Extract-Transform-Load) tools Data Science tools Linked Data tools Big Data tools Domain Knowledge

Page 7: Data Curation @ SpazioDati - NEXA Lunch Seminar

why we need a data curation process?

@

Page 8: Data Curation @ SpazioDati - NEXA Lunch Seminar

it’s our mantra: ALL YOU NEED IS DATA

Page 9: Data Curation @ SpazioDati - NEXA Lunch Seminar

:)accessible

for everyone

lat 00° 00’ 00” -> GPS -> Smartphones -> UI IPhone / Android

it’s our mantra: ALL YOU NEED IS DATA

Page 10: Data Curation @ SpazioDati - NEXA Lunch Seminar

we are building two products

Page 11: Data Curation @ SpazioDati - NEXA Lunch Seminar

Dandelion API Text Analytics as a service

www.dandelion.eu

Page 12: Data Curation @ SpazioDati - NEXA Lunch Seminar

Sales Intelligence

www.atoka.io

Page 13: Data Curation @ SpazioDati - NEXA Lunch Seminar

<Powered by SpazioDati> codename

2014

2015

data platform

Page 14: Data Curation @ SpazioDati - NEXA Lunch Seminar

Why a knowledge graph?

Page 15: Data Curation @ SpazioDati - NEXA Lunch Seminar

Our Entity Extraction API is based on a graph

Brussels

Paris

Berlin

Eiffel Tower

2009 World Championships in Athletics

King Baudouin Stadium

Champ de Mars

0.42

0.80

0.43

0.53

0.53

0.53

0.63

0.59

0.440.44

https://dandelion.eu/docs/api/datatxt/nex/v1/

Page 16: Data Curation @ SpazioDati - NEXA Lunch Seminar

CONTEXTUAL DATA

Page 17: Data Curation @ SpazioDati - NEXA Lunch Seminar

different sources; different semantics; companies, people, Wikipedia topics, POI… simple to query on traversals global statistics

why a knowledge graph

Page 18: Data Curation @ SpazioDati - NEXA Lunch Seminar

let’s start with some details on the “Powered by SpazioDati” data platform…

Page 19: Data Curation @ SpazioDati - NEXA Lunch Seminar

http://blog.spaziodati.eu/en/2014/10/21/spaziodati-at-iswc-2014-visit-our-booth-research-plans-available/

“Powered by SpazioDati” data platform backstage

PWR-BY-SD

Page 20: Data Curation @ SpazioDati - NEXA Lunch Seminar

OpenRefine

https://azkaban.github.io/

Azkaban Open Source Workflow Manager

Apache Silk

Titan graph db

Apache Cassandra

The Linked Data Integration Framework

Tools involved

Page 21: Data Curation @ SpazioDati - NEXA Lunch Seminar

http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/

starting from OpenRefine to clean up the data easily, for example

* reconcile and clean up the data* align the data model to our internal ontologies, using RDF skeletons

* export the RDF modelled using our rules

Page 22: Data Curation @ SpazioDati - NEXA Lunch Seminar

in other words…

Rexster: JSON-based REST interface to Titan

Page 23: Data Curation @ SpazioDati - NEXA Lunch Seminar

Our internal ontology: a sample

Page 24: Data Curation @ SpazioDati - NEXA Lunch Seminar

and now

www.atoka.io

Page 25: Data Curation @ SpazioDati - NEXA Lunch Seminar

~5,9 ★ MLN companies

>10 ★ MLN persons

900k

updated weekly

★ Weekly web crawl of the Italian corporate

Page 26: Data Curation @ SpazioDati - NEXA Lunch Seminar

★ Real-time data collection from company social accounts

★ ~1600 online & offline newspapers (updated daily)

updated weekly

Page 27: Data Curation @ SpazioDati - NEXA Lunch Seminar

23

www.atoka.io

Page 28: Data Curation @ SpazioDati - NEXA Lunch Seminar

Search: how it works

Direct search of one particular company through its name or “partita iva” (vat number)

Content search into company websites

Keyword search among extracted and refined entities from company resources !Dandelion API is the extraction engine!

1.

2. [*]

3. [*]

Page 29: Data Curation @ SpazioDati - NEXA Lunch Seminar

Corporate page

atoka.io

Page 30: Data Curation @ SpazioDati - NEXA Lunch Seminar

Some details on

• Five main “types”:!– Company!– Person!– Site!– Administrative Division!–Website

Page 31: Data Curation @ SpazioDati - NEXA Lunch Seminar

our infrastructure to crawl the Web for ATOKA

Page 32: Data Curation @ SpazioDati - NEXA Lunch Seminar

other details

Cerved • Company • People • Site • Position+Share

ISTAT • AdminDiv

ES

DBPedia • Company

cluster computing

Page 33: Data Curation @ SpazioDati - NEXA Lunch Seminar

something really interesting on OpenRefine

Page 34: Data Curation @ SpazioDati - NEXA Lunch Seminar

OpenRefine as usual

Page 35: Data Curation @ SpazioDati - NEXA Lunch Seminar

OpenRefine on Spark

Page 36: Data Curation @ SpazioDati - NEXA Lunch Seminar

it rocks! :)

more background details on http://blog.spaziodati.eu/wp-content/uploads/2015/07/RefineOnSpark.pdf

Page 37: Data Curation @ SpazioDati - NEXA Lunch Seminar

Thanks :)

@spaziodati

[email protected]

Page 38: Data Curation @ SpazioDati - NEXA Lunch Seminar

References

1) From raw data to dataGEMs for developers - http://ceur-ws.org/Vol-1268/paper1.pdf 2) Knowledge Graph ovunque: http://www.slideshare.net/dagoneye/knowledge-graphs-ovunque-un-quadro-di-insieme-e-le-implicazioni-per-uno-sviluppo-condiviso-del-web-of-data 3) Linking Enterprise Data - https://www.springer.com/it/book/9781441976642 4) Using OpenRefine - https://www.packtpub.com/big-data-and-business-intelligence/using-openrefine 5) Why Your Business Needs A Customer Data Knowledge Graph - http://www.dataversity.net/business-needs-customer-data-knowledge-graph/ 6) Enabling parallel processing for OpenRefine: Spark vs Akka - http://refinepro.com/blog/enabling-parallel-processing-for-openrefine-spark-vs-akka/