MIA at the 6th Data Science Day Berlin

Preview:

DESCRIPTION

Presentation of MIA – A Cloud-Based Marketplace for Information and Analyses – on the 6th Data Science Day in Berlin.

Citation preview

A Cloud-Based Marketplace for Information and Analyses

Peter Adolphs, Project Manager R&D, Neofonie GmbH 6th Data Science Day, 8 May 2014

Research Project MIA

Data Mining Social Media Monitoring

Data Enrichment

Data Security, Reliable Applications

Data Acquistion & Enrichment, Text Mining,

Media Publisher Services

Scalable Database Technologies, Data Cleansing

Real-Time Data Mining

Funding Period: 2012 -2014

Who owns the Web?

Image source: ©iStock.com/ahlobystov (Stock Photo: 4619850)

Usage

4

Usage Scenarios

Market Research

Brand Monitoring

Reputation Management

Internet

Data Extraction

MIA

MIA Platform and Marketplace

Technology (Apps) Stored Queries €

Providers of Application

Providers of Analysis Algorithms

Technology (Algorithms) €

Analysts

Ad-Hoc Questions €

Analysis Results

German Speaking

Web

Data Providers

Data & Aggregation

Tru

st /

Cert

ific

atio

n

Infra Structure: Cloud/Real Time Processing + Storage

Marketplace: Distribution of Technologies, Purchase of Computing Capacity

Acquisition Cleansing

Enrichment Aggregation

Data Mining Storage

Marketplace

Using MIA as a Service

Neofonie Individual Configuration

Application Developers

Multiple Application Scenarios

Web Annotation Tool (WATT) ZEITMASCHINE: a Search App for News Archives

Dashboards with Current Data

Using MIA for Ad-Hoc Queries

Using MIA for Providing Data / Applications

Text Analysis

11

Boilerplate Removal & Document Structure Analysis

Goals

✱ Extract the document core

✱ Remove ads and navigation

✱ Determine document structure

Approach

✱ Determine text core and title using SVMs

✱ Features: text characteristics, linguistic properties, DOM structure, link/anchor structure

Named Entities

✱ Recognition of Names for Real-World Entities like

✱ People

✱ Locations

✱ Organizations

✱ Products

✱ ...

✱ Named Entity Recognition NER

References for Names

✱Names alone are not really useful

✱ Entity type is often also not enough

✱We want a reference (e.g. a URI) w.r.t. some world model.

Image sources: 1) Bundesarchiv, B 145 Bild-F074398-0021 / Engelbert Reineke / CC-BY-SA 3.0 Germany. Shortlink: http://goo.gl/hTzdkH 2) Brassica oleracea convar. capitata var. alba, spitskool (2).jpg by user Rasbak / CC-BY-SA 3.0 Unported. Shortlink: http://goo.gl/IQhqQC

Knowledge Bases

Ambiguities

“Peter Müller”

✱ Supervised Machine Learning Requires labeled training data

✱ Sequence Learning Method

Conditional Random Fields for NER

Dependency Parsing

Yahoo trennt sich von CEO Scott Thomson

ORG PERSON

Yahoo trennen sich von CEO Scott Thomson

Token

Lemma

NE

Relation Extraction

Yahoo trennt sich von CEO Scott Thomson

ORG PERSON

Yahoo trennen sich von CEO Scott Thomson

Token

Lemma

NE

Sentiment Analysis

✱ Goal: determine positive or negative sentiments

✱ Base: SentiWS (Sentiment-Lexicon of University Leipzig; freely available for research)

✱ Simple approach: sentiment of sentence = average sentiment weights of the words

Lemma: Stem|PoS Sentiment

Abhängigkeit|NN -0.3653

abfällig|ADJX -0.3197

abgedroschen|ADJX -0.1839

absolut|ADJX 0.2418

Ablehnung|NN -0.5118

Ablenkung|NN -0.0435

Anerkennung|NN 0.0855

anspruchsvoll|ADJX 0.2216

Freispruch|NN 0.0040

Freude|NN 0.6502

Freund|NN 0.0116

Peter Adolphs, Neofonie GmbH 20

Text Analysis Components

Topic Classification

Sentence Segmentation

Boilerplate Removal

PoS-Tagging Tokenization Lemmatization

Subjectivity Recognition

Dependency Parsing

NER & NERD

Quote Recognition Relation

Extraction

Brand Monitoring

Data Extraction Reputation

Management Market

Research

Sentiment Analysis

An Application

Are there Political Tendencies in the Coverage of German Online Media?

Selected Data

Some Statistics

Investigation Period

12 Months >18,000,000

Documents

6,543 Politicians

Method

✱ Recognize names with named entity recognition with reference linking and disambiguation

✱ Join recognized references with Freebase subset of German politicians and their party

✱ Aggregate and count

✱ Inspect & Visualize in Excel

184 Online-Portals, 1.8.2012-31.7.2013

12% Chancellor Angela Merkel

0,00%

2,00%

4,00%

6,00%

8,00%

10,00%

12,00%

14,00% Angela Merkel

Peer Steinbrück

Philipp Rösler

WolfgangSchäuble

Horst Seehofer

Mentions of Politicians

Media Coverage on Average

63% about current coalition

7,92%

39,85%

8,72% 2,15%

14,26%

0,12%

0,01%

26,95%

184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige

Media Coverage in Particular News Sources

7%

39%

7%

2%

13%

32%

Bild

15%

34%

6% 5%

9%

30%

die tageszeitung

8%

25%

6% 20%

7%

34%

Neues Deutschland

184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige

Sentiments in Online Media

B90/Grüne

CDU

CSU

Linke

FDP

NPD

Piraten

SPD

Averaged over all Mentions in all News Articles

NPD

Conclusions

✱ Processing Web-scale amounts of textual data is a real challenge

✱ Requires the right tools, data and infrastructure

✱MIA sketches a marketplace & execution platform which allows users to basically apply SQL to the Web

✱Marketplace allows algorithm developers and data providers to share (& monetize) their assets

Summary & Conclusions

Monday, May 26, 2014, 7 PM / 19:00

Data Talk

Common Crawl meets MIA:

Gathering and Crunching Open Web Data

http://ow.ly/wC1Fm

CINIQ, Einsteinufer 37, Berlin, Tickets on Eventbrite

Save the Date!

Peter Adolphs Project Manager R&D peter.adolphs@neofonie.de T: +49 30 246 27 525

Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin www.neofonie.de

Thank You For Your Attention!

Recommended