Upload
jan-maller
View
104
Download
0
Embed Size (px)
DESCRIPTION
Presentation of MIA – A Cloud-Based Marketplace for Information and Analyses – on the 6th Data Science Day in Berlin.
Citation preview
A Cloud-Based Marketplace for Information and Analyses
Peter Adolphs, Project Manager R&D, Neofonie GmbH 6th Data Science Day, 8 May 2014
Research Project MIA
Data Mining Social Media Monitoring
Data Enrichment
Data Security, Reliable Applications
Data Acquistion & Enrichment, Text Mining,
Media Publisher Services
Scalable Database Technologies, Data Cleansing
Real-Time Data Mining
Funding Period: 2012 -2014
Who owns the Web?
Image source: ©iStock.com/ahlobystov (Stock Photo: 4619850)
Usage
4
Usage Scenarios
Market Research
Brand Monitoring
Reputation Management
Internet
Data Extraction
MIA
MIA Platform and Marketplace
Technology (Apps) Stored Queries €
Providers of Application
Providers of Analysis Algorithms
Technology (Algorithms) €
Analysts
Ad-Hoc Questions €
Analysis Results
German Speaking
Web
Data Providers
Data & Aggregation
€
Tru
st /
Cert
ific
atio
n
Infra Structure: Cloud/Real Time Processing + Storage
Marketplace: Distribution of Technologies, Purchase of Computing Capacity
Acquisition Cleansing
Enrichment Aggregation
Data Mining Storage
Marketplace
Using MIA as a Service
Neofonie Individual Configuration
Application Developers
Multiple Application Scenarios
Web Annotation Tool (WATT) ZEITMASCHINE: a Search App for News Archives
Dashboards with Current Data
Using MIA for Ad-Hoc Queries
Using MIA for Providing Data / Applications
Text Analysis
11
Boilerplate Removal & Document Structure Analysis
Goals
✱ Extract the document core
✱ Remove ads and navigation
✱ Determine document structure
Approach
✱ Determine text core and title using SVMs
✱ Features: text characteristics, linguistic properties, DOM structure, link/anchor structure
Named Entities
✱ Recognition of Names for Real-World Entities like
✱ People
✱ Locations
✱ Organizations
✱ Products
✱ ...
✱ Named Entity Recognition NER
References for Names
✱Names alone are not really useful
✱ Entity type is often also not enough
✱We want a reference (e.g. a URI) w.r.t. some world model.
Image sources: 1) Bundesarchiv, B 145 Bild-F074398-0021 / Engelbert Reineke / CC-BY-SA 3.0 Germany. Shortlink: http://goo.gl/hTzdkH 2) Brassica oleracea convar. capitata var. alba, spitskool (2).jpg by user Rasbak / CC-BY-SA 3.0 Unported. Shortlink: http://goo.gl/IQhqQC
Knowledge Bases
Ambiguities
“Peter Müller”
✱ Supervised Machine Learning Requires labeled training data
✱ Sequence Learning Method
Conditional Random Fields for NER
Dependency Parsing
Yahoo trennt sich von CEO Scott Thomson
ORG PERSON
Yahoo trennen sich von CEO Scott Thomson
Token
Lemma
NE
Relation Extraction
Yahoo trennt sich von CEO Scott Thomson
ORG PERSON
Yahoo trennen sich von CEO Scott Thomson
Token
Lemma
NE
Sentiment Analysis
✱ Goal: determine positive or negative sentiments
✱ Base: SentiWS (Sentiment-Lexicon of University Leipzig; freely available for research)
✱ Simple approach: sentiment of sentence = average sentiment weights of the words
Lemma: Stem|PoS Sentiment
Abhängigkeit|NN -0.3653
abfällig|ADJX -0.3197
abgedroschen|ADJX -0.1839
absolut|ADJX 0.2418
Ablehnung|NN -0.5118
Ablenkung|NN -0.0435
Anerkennung|NN 0.0855
anspruchsvoll|ADJX 0.2216
Freispruch|NN 0.0040
Freude|NN 0.6502
Freund|NN 0.0116
Peter Adolphs, Neofonie GmbH 20
Text Analysis Components
Topic Classification
Sentence Segmentation
Boilerplate Removal
PoS-Tagging Tokenization Lemmatization
Subjectivity Recognition
Dependency Parsing
NER & NERD
Quote Recognition Relation
Extraction
Brand Monitoring
Data Extraction Reputation
Management Market
Research
Sentiment Analysis
An Application
Are there Political Tendencies in the Coverage of German Online Media?
Selected Data
Some Statistics
Investigation Period
12 Months >18,000,000
Documents
6,543 Politicians
Method
✱ Recognize names with named entity recognition with reference linking and disambiguation
✱ Join recognized references with Freebase subset of German politicians and their party
✱ Aggregate and count
✱ Inspect & Visualize in Excel
184 Online-Portals, 1.8.2012-31.7.2013
12% Chancellor Angela Merkel
0,00%
2,00%
4,00%
6,00%
8,00%
10,00%
12,00%
14,00% Angela Merkel
Peer Steinbrück
Philipp Rösler
WolfgangSchäuble
Horst Seehofer
Mentions of Politicians
Media Coverage on Average
63% about current coalition
7,92%
39,85%
8,72% 2,15%
14,26%
0,12%
0,01%
26,95%
184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige
Media Coverage in Particular News Sources
7%
39%
7%
2%
13%
32%
Bild
15%
34%
6% 5%
9%
30%
die tageszeitung
8%
25%
6% 20%
7%
34%
Neues Deutschland
184 Online-Portals, 1.8.2012-31.7.2013 CDU CSU SPD Grüne FDP Linke Piraten NPD Übrige
Sentiments in Online Media
B90/Grüne
CDU
CSU
Linke
FDP
NPD
Piraten
SPD
Averaged over all Mentions in all News Articles
NPD
Conclusions
✱ Processing Web-scale amounts of textual data is a real challenge
✱ Requires the right tools, data and infrastructure
✱MIA sketches a marketplace & execution platform which allows users to basically apply SQL to the Web
✱Marketplace allows algorithm developers and data providers to share (& monetize) their assets
Summary & Conclusions
Monday, May 26, 2014, 7 PM / 19:00
Data Talk
Common Crawl meets MIA:
Gathering and Crunching Open Web Data
http://ow.ly/wC1Fm
CINIQ, Einsteinufer 37, Berlin, Tickets on Eventbrite
Save the Date!
Peter Adolphs Project Manager R&D [email protected] T: +49 30 246 27 525
Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin www.neofonie.de
Thank You For Your Attention!