38
Boost Your Data Analytics with Open Data and Public News Content Ontotext Webinar, 24 Mar 2016

Boost your data analytics with open data and public news content

Embed Size (px)

Citation preview

Boost Your Data Analytics with Open Data and Public News Content

Ontotext Webinar, 24 Mar 2016

Open Data & News Analytics #2

Presentation Outline – PART I

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #3

Quick news-analytics case

Mar 2016

• Our Dynamic Semantic Publishing platform already offers linking of text with big open data graphs

• One can get navigate from text to concepts, get trends, related entities and news

• Try it at http://now.ontotext.com

Open Data & News Analytics #4

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics

Our approach to Big Data

1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and

taxonomies integrated with millions of facts of Linked Data

2. Infer new facts and unveil relationships− Performing reasoning across data from different sources

3. Interlink text and with big data− Using text-mining to automatically discover references to

concepts and entities

4. Use NoSQL graph database for metadata management, querying and search

Mar 2016 #5

Open Data & News Analytics #6

NoSQL Graph Database

Mar 2016

myData: Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:relativeOf

owl:inverseOfowl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:t

ype

rdf:t

ype

rdf:type

• The hottest NoSQL trend• W3C standards• Efficient Data Integration

− Using logical inference

− For data integration and BI

Open Data & News Analytics #7

Analyzing Text

Mar 2016

• Full spectrum of NLP weaponry

• Semantic indexing− Tag references with entity IDs

− Generate semantic metadata descriptions of documents

− Store metadata in GraphDB

Open Data & News Analytics #8

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics

The Web of Linked Data in 2007

Mar 2016 #9

structured database version of Wikipedia

database of all locations on Earth

product reviews

semantic synonym dictionary

Note: Each bubble represents a dataset. Arrows represent mappings across datasets; e.g. dbpedia:Paris owl:sameAs geo:2988507

Open Data & News Analytics

The Web of Linked Data is Gaining Mass

Mar 2016 #10

Open Data & News Analytics

The Web of Data is Gaining Mass (2011)

Mar 2016 #11

Open Data & News Analytics

The Web of Linked Data is Gaining Mass

Mar 2016 #12

• 2013 stats: 2 289 public datasets− http://stats.lod2.eu/

• Growing exponentially − see the dotted trend line

• Structured markup− Schema.org; semantic SEO

• Enables better semantic tagging!− As there are more concepts and richer

descriptions to refer to

27 43 89 162295

822

2,289

2007 2008 2009 2010 2011 2012 2013

Linked Data Datasets

Open Data & News Analytics #13

The FactForge Data

• DBpedia (the English version only): 496M statements

• Geonames: 150M statements− SameAs links between DBpedia and Geonames: 471K statements

• NOW data – metadata about news: 128M statements

• Total size: 938М statements− 656M explicit statements + 281M inferred statements

− RDFRank and geo-spatial indices enabled to allow for ranking and efficient geo-region constraints

Mar 2016

Open Data & News Analytics #14

News Metadata

• Metadata from Ontotext’s Dynamic Semantic Publishing platform− Automatically generated as part of the NOW.ontotext.com semantic news showcase

• News corpus from Google since Feb 2015, about 10k news/month

• ~70 tags (annotations) per news article

• Tags link text mentions of concepts to the knowledge graph− Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases

Mar 2016

Open Data & News Analytics #15

News Metadata

Mar 2016

Open Data & News Analytics #16

News Metadata

Mar 2016

Category Count International 52 074

Science and Technology 23 201Sports 20 714Business 15 155Lifestyle 11 684

122 828

Mentions / entity type Count Keyphrase 2 589 676Organization 1 276 441Location 1 260 972Person 1 248 784Work 309 093Event 258 388RelationPersonRole 236 638Species 180 946

Open Data & News Analytics #17

News Geographic Coverage

Mar 2016

• Quite focused on USA!

Open Data & News Analytics #18

Class Hierarchy Map (by number of instances)

Mar 2016

Left: The big pictureRight: dbo:Agent class (2.7M organizations and persons)

Open Data & News Analytics #19

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #20

Sample queries

• There is a rich set of sample queries that allow exploration of this combination of DBPedia, GeoNames and news metadata

• We will showcase few of those, starting from the simple once

• In bold we marked the “parameters” of the queires

Mar 2016

Open Data & News Analytics #21

Query: Big Cities in Eastern Europe# benefits from inference over transitive gn:parentFeature# benefits from owl:sameAs mapping between DBPedia and Geonames

PREFIX dbr: <http://dbpedia.org/resource/>PREFIX onto: <http://www.ontotext.com/>PREFIX gn: <http://www.geonames.org/ontology#>PREFIX dbo: <http://dbpedia.org/ontology/>select *from onto:disable-sameAswhere { ?loc gn:parentFeature dbr:Eastern_Europe ; gn:featureClass gn:P. ?loc dbo:populationTotal ?population ; dbo:country ?country . FILTER(?population > 300000 )} order by ?country

Mar 2016

Open Data & News Analytics #22

Query: People and Organizations related to Google# benefits from inference over transitive dbo:parent# RDFRank makes it easy to see the “top suspects” in a list of 93 entities

PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>PREFIX dbr: <http://dbpedia.org/resource/> select distinct ?related_entity ?rankwhere { BIND (dbr:Google as ?entity) { ?related_entity a dbo:Person ; ?p ?entity . } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } ?related_entity rank:hasRDFRank ?rank} order by desc(?rank)

Mar 2016

Open Data & News Analytics #23

Query: Airports near London# GraphDB’s geo-spatial plug-in allows efficient evaluation of near-by# RDFRank brings the top 6 passanger airports at the top of a list of 80

PREFIX dbr: <http://dbpedia.org/resource/>PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX gdb-geo: <http://www.ontotext.com/owlim/geo#>PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX gdb: <http://www.ontotext.com/owlim/>

SELECT distinct ?airport ?rrankWHERE { { SELECT * { dbr:London geo-pos:lat ?lat ; geo-pos:long ?long . } LIMIT 10 } ?airport gdb-geo:nearby(?lat ?long "50mi"); a dbo:Airport ; gdb:hasRDFRank ?rrank .} ORDER BY DESC(?rrank)

Mar 2016

Open Data & News Analytics #24

Query: Top-level Industries by number of companies# benefits from mapping and consolidation of industry classifications# and predicates in DBPedia (ff-map)

PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX ff-map: <http://factforge.net/ff2016-mapping/>select distinct ?topIndustry (count(?company) as ?companies)where { ?company dbo:industry ?industry . ?industrySum ff-map:industryVariant ?industry . ?industrySum ff-map:industryCenter ?topIndustry .} group by ?topIndustry order by desc(?companies)

Mar 2016

Open Data & News Analytics #25

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #26

Semantic Press-Clipping

• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state

of the art Named Entity Recognition technology is used

− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)

• We can trace and consolidate references to daughter companies

• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.

company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)

Mar 2016

Open Data & News Analytics #27

Query: News Mentioning an IBM# technical example to demonstrate how news metadata can be accessed

PREFIX pub-old: <http://ontology.ontotext.com/publishing#>PREFIX pub: <http://ontology.ontotext.com/taxonomy/>PREFIX dbr: <http://dbpedia.org/resource/>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?news ?title ?date ?pub_entity where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch dbr:IBM . ?news pub-old:creationDate ?date; pub-old:title ?title . FILTER ( (?date > "2015-10-01T00:02:00Z"^^xsd:dateTime) &&

(?date < "2015-11-01T00:02:00Z"^^xsd:dateTime))} limit 100

Mar 2016

Open Data & News Analytics #28

Query: News Mentioning Gazprom and Its Related Entities

# benefits from inference over transitive dbo:parent relation and mappings to it

select distinct ?news ?title ?date ?related_entity where { { select distinct ?related_entity { BIND (dbr:Gazprom as ?entity) { ?related_entity a dbo:Person ; ?p ?entity .

FILTER NOT EXISTS { ?related_entity dbo:club ?entity } } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } UNION { BIND(?entity as ?related_entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?related_entity . ?news pub-old:creationDate ?date; pub-old:title ?title .} order by desc(?date) limit 1000

Mar 2016

Open Data & News Analytics #29

Query: Most Popular in the News Automotive Companies# benefits from mapping and consolidation of industry classifications

select distinct ?pub_entity (max(?entity_label) as ?label) (count(?news) as ?news_count)

where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity; pub:preferredLabel ?entity_label. dbr:Automotive ff-map:industryVariant ?industry . ?entity dbo:industry ?industry . ?news pub-old:creationDate ?date .} group by ?pub_entity order by desc(?news_count)

Mar 2016

Open Data & News Analytics #30

Query: Most Popular in the News, including children# benefits from mapping and consolidation of industry classifications

select distinct ?parent (count(?news) as ?news_count)where { { select distinct ?parent ?entity { BIND(dbr:Software as ?industry) ?industry ff-map:industryVariant ?industryVar . ?parent dbo:industry ?industryVar . ?parent a dbo:Company . FILTER NOT EXISTS { ?parent dbo:parent / dbo:industry / ff-map:industryVariant ?industry } { ?entity dbo:parent ?parent . } UNION { BIND(?parent as ?entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity . ?news pub-old:creationDate ?date .} group by ?parent order by desc(?news_count)

Mar 2016

Open Data & News Analytics #31

News Popularity Ranking: Automotive

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715

10 Honda 717 10 Takata Corporation 547

Open Data & News Analytics #32

News Popularity: Finance

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 China Merchants Bank 409402 Goldman Sachs 1992 2 Alphabet Inc. 242193 JP Morgan Chase 1712 3 Capital Group Companies 43794 Wells Fargo 1688 4 Bloomberg L.P. 38935 Citigroup 1557 5 Exor (company) 27756 HSBC Holdings 1546 6 JP Morgan Chase 27157 Deutsche Bank 1414 7 Nasdaq, Inc. 21788 Bank of America 1335 8 Oaktree Capital Management 17579 Barclays 1260 9 Goldman Sachs 1085

10 UBS 694 10 Sentinel Capital Partners 1064

Note: Including investment funds, stock exchanges, agencies, etc.

Open Data & News Analytics #33

News Popularity: Banking

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428

10 Chase 126 10 UBS 347

Note: including investment funds, stock exchanges, agencies, etc.

Open Data & News Analytics #34

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #35

Today’s News Map: Business

Mar 2016

Open Data & News Analytics #36

Today’s News Map: International

Mar 2016

Open Data & News Analytics #37

Expect in Part II

• Mentions of entity and related by month

• Most relevant co-occurrnig entities

• Most relevant co-occurrnig entities per month

• Related News

• and more

Mar 2016

Open Data & News Analytics #38

Thank you!

Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com

Start using GraphDB and text-mining with S4 in the cloudhttp://s4.ontotext.com

Learn more at our website or simply get in touch [email protected], @ontotext

Mar 2016