34
A Cloud-Based Marketplace for Information and Analyses Peter Adolphs, Project Manager R&D, Neofonie GmbH 6th Data Science Day, 8 May 2014

MIA at the 6th Data Science Day Berlin

Embed Size (px)

DESCRIPTION

Presentation of MIA – A Cloud-Based Marketplace for Information and Analyses – on the 6th Data Science Day in Berlin.

Citation preview

Page 1: MIA at the 6th Data Science Day Berlin

A Cloud-Based Marketplace for Information and Analyses

Peter Adolphs, Project Manager R&D, Neofonie GmbH 6th Data Science Day, 8 May 2014

Page 2: MIA at the 6th Data Science Day Berlin

Research Project MIA

Data Mining Social Media Monitoring

Data Enrichment

Data Security, Reliable Applications

Data Acquistion & Enrichment, Text Mining,

Media Publisher Services

Scalable Database Technologies, Data Cleansing

Real-Time Data Mining

Funding Period: 2012 -2014

Page 3: MIA at the 6th Data Science Day Berlin

Who owns the Web?

Image source: ©iStock.com/ahlobystov (Stock Photo: 4619850)

Page 4: MIA at the 6th Data Science Day Berlin

Usage

4

Page 5: MIA at the 6th Data Science Day Berlin

Usage Scenarios

Market Research

Brand Monitoring

Reputation Management

Internet

Data Extraction

MIA

Page 6: MIA at the 6th Data Science Day Berlin

MIA Platform and Marketplace

Technology (Apps) Stored Queries €

Providers of Application

Providers of Analysis Algorithms

Technology (Algorithms) €

Analysts

Ad-Hoc Questions €

Analysis Results

German Speaking

Web

Data Providers

Data & Aggregation

Tru

st /

Cert

ific

atio

n

Infra Structure: Cloud/Real Time Processing + Storage

Marketplace: Distribution of Technologies, Purchase of Computing Capacity

Acquisition Cleansing

Enrichment Aggregation

Data Mining Storage

Page 7: MIA at the 6th Data Science Day Berlin

Marketplace

Page 8: MIA at the 6th Data Science Day Berlin

Using MIA as a Service

Neofonie Individual Configuration

Application Developers

Multiple Application Scenarios

Web Annotation Tool (WATT) ZEITMASCHINE: a Search App for News Archives

Dashboards with Current Data

Page 9: MIA at the 6th Data Science Day Berlin

Using MIA for Ad-Hoc Queries

Page 10: MIA at the 6th Data Science Day Berlin

Using MIA for Providing Data / Applications

Page 11: MIA at the 6th Data Science Day Berlin

Text Analysis

11

Page 12: MIA at the 6th Data Science Day Berlin

Boilerplate Removal & Document Structure Analysis

Goals

✱ Extract the document core

✱ Remove ads and navigation

✱ Determine document structure

Approach

✱ Determine text core and title using SVMs

✱ Features: text characteristics, linguistic properties, DOM structure, link/anchor structure

Page 13: MIA at the 6th Data Science Day Berlin

Named Entities

✱ Recognition of Names for Real-World Entities like

✱ People

✱ Locations

✱ Organizations

✱ Products

✱ ...

✱ Named Entity Recognition NER

Page 14: MIA at the 6th Data Science Day Berlin

References for Names

✱Names alone are not really useful

✱ Entity type is often also not enough

✱We want a reference (e.g. a URI) w.r.t. some world model.

Image sources: 1) Bundesarchiv, B 145 Bild-F074398-0021 / Engelbert Reineke / CC-BY-SA 3.0 Germany. Shortlink: http://goo.gl/hTzdkH 2) Brassica oleracea convar. capitata var. alba, spitskool (2).jpg by user Rasbak / CC-BY-SA 3.0 Unported. Shortlink: http://goo.gl/IQhqQC

Page 15: MIA at the 6th Data Science Day Berlin

Knowledge Bases

Page 16: MIA at the 6th Data Science Day Berlin

Ambiguities

“Peter Müller”

Page 17: MIA at the 6th Data Science Day Berlin

✱ Supervised Machine Learning Requires labeled training data

✱ Sequence Learning Method

Conditional Random Fields for NER

Page 18: MIA at the 6th Data Science Day Berlin

Dependency Parsing

Yahoo trennt sich von CEO Scott Thomson

ORG PERSON

Yahoo trennen sich von CEO Scott Thomson

Token

Lemma

NE

Page 19: MIA at the 6th Data Science Day Berlin

Relation Extraction

Yahoo trennt sich von CEO Scott Thomson

ORG PERSON

Yahoo trennen sich von CEO Scott Thomson

Token

Lemma

NE

Page 20: MIA at the 6th Data Science Day Berlin

Sentiment Analysis

✱ Goal: determine positive or negative sentiments

✱ Base: SentiWS (Sentiment-Lexicon of University Leipzig; freely available for research)

✱ Simple approach: sentiment of sentence = average sentiment weights of the words

Lemma: Stem|PoS Sentiment

Abhängigkeit|NN -0.3653

abfällig|ADJX -0.3197

abgedroschen|ADJX -0.1839

absolut|ADJX 0.2418

Ablehnung|NN -0.5118

Ablenkung|NN -0.0435

Anerkennung|NN 0.0855

anspruchsvoll|ADJX 0.2216

Freispruch|NN 0.0040

Freude|NN 0.6502

Freund|NN 0.0116

Peter Adolphs, Neofonie GmbH 20

Page 21: MIA at the 6th Data Science Day Berlin

Text Analysis Components

Topic Classification

Sentence Segmentation

Boilerplate Removal

PoS-Tagging Tokenization Lemmatization

Subjectivity Recognition

Dependency Parsing

NER & NERD

Quote Recognition Relation

Extraction

Brand Monitoring

Data Extraction Reputation

Management Market

Research

Sentiment Analysis

Page 22: MIA at the 6th Data Science Day Berlin

An Application

Page 23: MIA at the 6th Data Science Day Berlin

Are there Political Tendencies in the Coverage of German Online Media?

Page 24: MIA at the 6th Data Science Day Berlin

Selected Data

Page 25: MIA at the 6th Data Science Day Berlin

Some Statistics

Investigation Period

12 Months >18,000,000

Documents

6,543 Politicians

Page 26: MIA at the 6th Data Science Day Berlin

Method

✱ Recognize names with named entity recognition with reference linking and disambiguation

✱ Join recognized references with Freebase subset of German politicians and their party

✱ Aggregate and count

✱ Inspect & Visualize in Excel

Page 27: MIA at the 6th Data Science Day Berlin

184 Online-Portals, 1.8.2012-31.7.2013

12% Chancellor Angela Merkel

0,00%

2,00%

4,00%

6,00%

8,00%

10,00%

12,00%

14,00% Angela Merkel

Peer Steinbrück

Philipp Rösler

WolfgangSchäuble

Horst Seehofer

Mentions of Politicians

Page 28: MIA at the 6th Data Science Day Berlin

Media Coverage on Average

63% about current coalition

7,92%

39,85%

8,72% 2,15%

14,26%

0,12%

0,01%

26,95%

184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige

Page 29: MIA at the 6th Data Science Day Berlin

Media Coverage in Particular News Sources

7%

39%

7%

2%

13%

32%

Bild

15%

34%

6% 5%

9%

30%

die tageszeitung

8%

25%

6% 20%

7%

34%

Neues Deutschland

184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige

Page 30: MIA at the 6th Data Science Day Berlin

Sentiments in Online Media

B90/Grüne

CDU

CSU

Linke

FDP

NPD

Piraten

SPD

Averaged over all Mentions in all News Articles

NPD

Page 31: MIA at the 6th Data Science Day Berlin

Conclusions

Page 32: MIA at the 6th Data Science Day Berlin

✱ Processing Web-scale amounts of textual data is a real challenge

✱ Requires the right tools, data and infrastructure

✱MIA sketches a marketplace & execution platform which allows users to basically apply SQL to the Web

✱Marketplace allows algorithm developers and data providers to share (& monetize) their assets

Summary & Conclusions

Page 33: MIA at the 6th Data Science Day Berlin

Monday, May 26, 2014, 7 PM / 19:00

Data Talk

Common Crawl meets MIA:

Gathering and Crunching Open Web Data

http://ow.ly/wC1Fm

CINIQ, Einsteinufer 37, Berlin, Tickets on Eventbrite

Save the Date!

Page 34: MIA at the 6th Data Science Day Berlin

Peter Adolphs Project Manager R&D [email protected] T: +49 30 246 27 525

Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin www.neofonie.de

Thank You For Your Attention!