Boost your data analytics with open data and public news content

Preview:

Citation preview

Boost Your Data Analytics with Open Data and Public News Content

Ontotext Webinar, 24 Mar 2016

Open Data & News Analytics #2

Presentation Outline – PART I

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #3

Quick news-analytics case

Mar 2016

• Our Dynamic Semantic Publishing platform already offers linking of text with big open data graphs

• One can get navigate from text to concepts, get trends, related entities and news

• Try it at http://now.ontotext.com

Open Data & News Analytics #4

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics

Our approach to Big Data

1. Integrate relevant data from many sources− Build a Big Knowledge Graph from proprietary databases and

taxonomies integrated with millions of facts of Linked Data

2. Infer new facts and unveil relationships− Performing reasoning across data from different sources

3. Interlink text and with big data− Using text-mining to automatically discover references to

concepts and entities

4. Use NoSQL graph database for metadata management, querying and search

Mar 2016 #5

Open Data & News Analytics #6

NoSQL Graph Database

Mar 2016

myData: Maria

ptop:Agent

ptop:Person

ptop:Woman

ptop:childOf

ptop:parentOf

rdfs:range

owl:inverseO

f

inferred

myData:Ivan

owl:relativeOf

owl:inverseOfowl:SymmetricProperty

rdfs:subPropertyOf

owl:inverseOf

owl:inverseOf

rdf:t

ype

rdf:t

ype

rdf:type

• The hottest NoSQL trend• W3C standards• Efficient Data Integration

− Using logical inference

− For data integration and BI

Open Data & News Analytics #7

Analyzing Text

Mar 2016

• Full spectrum of NLP weaponry

• Semantic indexing− Tag references with entity IDs

− Generate semantic metadata descriptions of documents

− Store metadata in GraphDB

Open Data & News Analytics #8

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics

The Web of Linked Data in 2007

Mar 2016 #9

structured database version of Wikipedia

database of all locations on Earth

product reviews

semantic synonym dictionary

Note: Each bubble represents a dataset. Arrows represent mappings across datasets; e.g. dbpedia:Paris owl:sameAs geo:2988507

Open Data & News Analytics

The Web of Linked Data is Gaining Mass

Mar 2016 #10

Open Data & News Analytics

The Web of Data is Gaining Mass (2011)

Mar 2016 #11

Open Data & News Analytics

The Web of Linked Data is Gaining Mass

Mar 2016 #12

• 2013 stats: 2 289 public datasets− http://stats.lod2.eu/

• Growing exponentially − see the dotted trend line

• Structured markup− Schema.org; semantic SEO

• Enables better semantic tagging!− As there are more concepts and richer

descriptions to refer to

27 43 89 162295

822

2,289

2007 2008 2009 2010 2011 2012 2013

Linked Data Datasets

Open Data & News Analytics #13

The FactForge Data

• DBpedia (the English version only): 496M statements

• Geonames: 150M statements− SameAs links between DBpedia and Geonames: 471K statements

• NOW data – metadata about news: 128M statements

• Total size: 938М statements− 656M explicit statements + 281M inferred statements

− RDFRank and geo-spatial indices enabled to allow for ranking and efficient geo-region constraints

Mar 2016

Open Data & News Analytics #14

News Metadata

• Metadata from Ontotext’s Dynamic Semantic Publishing platform− Automatically generated as part of the NOW.ontotext.com semantic news showcase

• News corpus from Google since Feb 2015, about 10k news/month

• ~70 tags (annotations) per news article

• Tags link text mentions of concepts to the knowledge graph− Technically these are URIs for entities (people, organizations, locations, etc.) and key phrases

Mar 2016

Open Data & News Analytics #15

News Metadata

Mar 2016

Open Data & News Analytics #16

News Metadata

Mar 2016

Category Count International 52 074

Science and Technology 23 201Sports 20 714Business 15 155Lifestyle 11 684

122 828

Mentions / entity type Count Keyphrase 2 589 676Organization 1 276 441Location 1 260 972Person 1 248 784Work 309 093Event 258 388RelationPersonRole 236 638Species 180 946

Open Data & News Analytics #17

News Geographic Coverage

Mar 2016

• Quite focused on USA!

Open Data & News Analytics #18

Class Hierarchy Map (by number of instances)

Mar 2016

Left: The big pictureRight: dbo:Agent class (2.7M organizations and persons)

Open Data & News Analytics #19

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #20

Sample queries

• There is a rich set of sample queries that allow exploration of this combination of DBPedia, GeoNames and news metadata

• We will showcase few of those, starting from the simple once

• In bold we marked the “parameters” of the queires

Mar 2016

Open Data & News Analytics #21

Query: Big Cities in Eastern Europe# benefits from inference over transitive gn:parentFeature# benefits from owl:sameAs mapping between DBPedia and Geonames

PREFIX dbr: <http://dbpedia.org/resource/>PREFIX onto: <http://www.ontotext.com/>PREFIX gn: <http://www.geonames.org/ontology#>PREFIX dbo: <http://dbpedia.org/ontology/>select *from onto:disable-sameAswhere { ?loc gn:parentFeature dbr:Eastern_Europe ; gn:featureClass gn:P. ?loc dbo:populationTotal ?population ; dbo:country ?country . FILTER(?population > 300000 )} order by ?country

Mar 2016

Open Data & News Analytics #22

Query: People and Organizations related to Google# benefits from inference over transitive dbo:parent# RDFRank makes it easy to see the “top suspects” in a list of 93 entities

PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>PREFIX dbr: <http://dbpedia.org/resource/> select distinct ?related_entity ?rankwhere { BIND (dbr:Google as ?entity) { ?related_entity a dbo:Person ; ?p ?entity . } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } ?related_entity rank:hasRDFRank ?rank} order by desc(?rank)

Mar 2016

Open Data & News Analytics #23

Query: Airports near London# GraphDB’s geo-spatial plug-in allows efficient evaluation of near-by# RDFRank brings the top 6 passanger airports at the top of a list of 80

PREFIX dbr: <http://dbpedia.org/resource/>PREFIX geo-pos: <http://www.w3.org/2003/01/geo/wgs84_pos#>PREFIX gdb-geo: <http://www.ontotext.com/owlim/geo#>PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX gdb: <http://www.ontotext.com/owlim/>

SELECT distinct ?airport ?rrankWHERE { { SELECT * { dbr:London geo-pos:lat ?lat ; geo-pos:long ?long . } LIMIT 10 } ?airport gdb-geo:nearby(?lat ?long "50mi"); a dbo:Airport ; gdb:hasRDFRank ?rrank .} ORDER BY DESC(?rrank)

Mar 2016

Open Data & News Analytics #24

Query: Top-level Industries by number of companies# benefits from mapping and consolidation of industry classifications# and predicates in DBPedia (ff-map)

PREFIX dbo: <http://dbpedia.org/ontology/>PREFIX ff-map: <http://factforge.net/ff2016-mapping/>select distinct ?topIndustry (count(?company) as ?companies)where { ?company dbo:industry ?industry . ?industrySum ff-map:industryVariant ?industry . ?industrySum ff-map:industryCenter ?topIndustry .} group by ?topIndustry order by desc(?companies)

Mar 2016

Open Data & News Analytics #25

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #26

Semantic Press-Clipping

• We can trace references to a specific company in the news− This is pretty much standard, however we can deal with syntactic variations in the names, because state

of the art Named Entity Recognition technology is used

− What’s more important, we distinguish correctly in which mention “Paris” refers to which of the following: Paris (the capital of France), Paris in Texas, Paris Hilton or to Paris (the Greek hero)

• We can trace and consolidate references to daughter companies

• We have comprehensive industry classification− The one from DBPedia, but refined to accommodate identifier variations and specialization (e.g.

company classified as dbr:Bank will also be considered classified as dbr:FinancialServices)

Mar 2016

Open Data & News Analytics #27

Query: News Mentioning an IBM# technical example to demonstrate how news metadata can be accessed

PREFIX pub-old: <http://ontology.ontotext.com/publishing#>PREFIX pub: <http://ontology.ontotext.com/taxonomy/>PREFIX dbr: <http://dbpedia.org/resource/>PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

select distinct ?news ?title ?date ?pub_entity where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch dbr:IBM . ?news pub-old:creationDate ?date; pub-old:title ?title . FILTER ( (?date > "2015-10-01T00:02:00Z"^^xsd:dateTime) &&

(?date < "2015-11-01T00:02:00Z"^^xsd:dateTime))} limit 100

Mar 2016

Open Data & News Analytics #28

Query: News Mentioning Gazprom and Its Related Entities

# benefits from inference over transitive dbo:parent relation and mappings to it

select distinct ?news ?title ?date ?related_entity where { { select distinct ?related_entity { BIND (dbr:Gazprom as ?entity) { ?related_entity a dbo:Person ; ?p ?entity .

FILTER NOT EXISTS { ?related_entity dbo:club ?entity } } UNION { ?related_entity a dbo:Organisation ; dbo:parent ?entity . } UNION { BIND(?entity as ?related_entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?related_entity . ?news pub-old:creationDate ?date; pub-old:title ?title .} order by desc(?date) limit 1000

Mar 2016

Open Data & News Analytics #29

Query: Most Popular in the News Automotive Companies# benefits from mapping and consolidation of industry classifications

select distinct ?pub_entity (max(?entity_label) as ?label) (count(?news) as ?news_count)

where { ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity; pub:preferredLabel ?entity_label. dbr:Automotive ff-map:industryVariant ?industry . ?entity dbo:industry ?industry . ?news pub-old:creationDate ?date .} group by ?pub_entity order by desc(?news_count)

Mar 2016

Open Data & News Analytics #30

Query: Most Popular in the News, including children# benefits from mapping and consolidation of industry classifications

select distinct ?parent (count(?news) as ?news_count)where { { select distinct ?parent ?entity { BIND(dbr:Software as ?industry) ?industry ff-map:industryVariant ?industryVar . ?parent dbo:industry ?industryVar . ?parent a dbo:Company . FILTER NOT EXISTS { ?parent dbo:parent / dbo:industry / ff-map:industryVariant ?industry } { ?entity dbo:parent ?parent . } UNION { BIND(?parent as ?entity) } } } ?news pub-old:containsMention / pub-old:hasInstance ?pub_entity . ?pub_entity pub:exactMatch ?entity . ?news pub-old:creationDate ?date .} group by ?parent order by desc(?news_count)

Mar 2016

Open Data & News Analytics #31

News Popularity Ranking: Automotive

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 General Motors 2722 1 General Motors 46202 Tesla Motors 2346 2 Volkswagen Group 39993 Volkswagen 2299 3 Fiat Chrysler Automobiles 26584 Ford Motor Company 1934 4 Tesla Motors 23705 Toyota 1325 5 Ford Motor Company 21256 Chevrolet 1264 6 Toyota 16567 Chrysler 1054 7 Renault-Nissan Alliance 13328 Fiat Chrysler Automobiles 1011 8 Honda 8649 Audi AG 972 9 BMW 715

10 Honda 717 10 Takata Corporation 547

Open Data & News Analytics #32

News Popularity: Finance

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Bloomberg L.P. 3203 1 China Merchants Bank 409402 Goldman Sachs 1992 2 Alphabet Inc. 242193 JP Morgan Chase 1712 3 Capital Group Companies 43794 Wells Fargo 1688 4 Bloomberg L.P. 38935 Citigroup 1557 5 Exor (company) 27756 HSBC Holdings 1546 6 JP Morgan Chase 27157 Deutsche Bank 1414 7 Nasdaq, Inc. 21788 Bank of America 1335 8 Oaktree Capital Management 17579 Barclays 1260 9 Goldman Sachs 1085

10 UBS 694 10 Sentinel Capital Partners 1064

Note: Including investment funds, stock exchanges, agencies, etc.

Open Data & News Analytics #33

News Popularity: Banking

Mar 2016

Rank Company News # Rank Company incl. mentions of controlled News #1 Goldman Sachs 996 1 China Merchants Bank * 382882 JP Morgan Chase 856 2 JP Morgan Chase 19723 HSBC Holdings 773 3 Goldman Sachs 10304 Deutsche Bank 707 4 HSBC 9665 Barclays 630 5 Bank of America 7716 Citigroup 519 6 Deutsche Bank 7427 Bank of America 445 7 Barclays 6818 Wells Fargo 422 8 Citigroup 6309 UBS 347 9 Wells Fargo 428

10 Chase 126 10 UBS 347

Note: including investment funds, stock exchanges, agencies, etc.

Open Data & News Analytics #34

Presentation Outline

• Quick news-analytics case• Technology approach• FactForge-News: Data architecture• Sample queries on Linked Open Data• News analytics examples• Today’s News Map

Mar 2016

Open Data & News Analytics #35

Today’s News Map: Business

Mar 2016

Open Data & News Analytics #36

Today’s News Map: International

Mar 2016

Open Data & News Analytics #37

Expect in Part II

• Mentions of entity and related by month

• Most relevant co-occurrnig entities

• Most relevant co-occurrnig entities per month

• Related News

• and more

Mar 2016

Open Data & News Analytics #38

Thank you!

Experience the technology with NOW: Semantic News Portalhttp://now.ontotext.com

Start using GraphDB and text-mining with S4 in the cloudhttp://s4.ontotext.com

Learn more at our website or simply get in touch info@ontotext.com, @ontotext

Mar 2016

Recommended