34
STREAM REASONING AN APPROACH TO TAME THE VELOCITY AND VARIETY DIMENSIONS OF BIG DATA Emanuele Della Valle Politecnico di Milano http://emanueledellavalle.org @manudellavalle Oslo, Norway - 15.6.2017

Stream reasoning: an approach to tame the velocity and variety dimensions of Big Data

Embed Size (px)

Citation preview

STREAM REASONINGAN APPROACH TO TAME THE VELOCITY AND VARIETY DIMENSIONS OF BIG DATA

Emanuele Della VallePolitecnico di Milano

http://emanueledellavalle.org@manudellavalle

Oslo, Norway - 15.6.2017

BIG DATA TECHS

CAN TAME VOLUME

▸ Hadoop, MapReduce, HIVE

▸ “schema on read” methodology

▸ spark (x100 faster)

▸ “data lake” concept

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

BIG DATA TECHS

CAN TAME VELOCITY

▸ Storm

▸ Kafka

▸ Spark Streaming

▸ Flink

▸ paradigmatic change

▸ from persistent data and transient queries

▸ to persistent queries and transient data

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

BIG DATA TECHS

CANNOT TAME VOLUME AND VELOCITY SIMULTANEOUSLY

ZB

EB

PB

TB

GB

MB

KB

months days hours min. sec. ms. Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

BIG DATA TECHS

CAN TAME VARIETY USING SEMANTIC TECHNOLOGIES

▸ RDF data model

▸ SPARQL query language

▸ OWL ontological language

▸ R2RML mapping language

▸ Ontology Based Data Access methodology

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

BIG DATA TECHS

VARIETY MAKES PROBLEMS HARDER

ZB

EB

PB

TB

GB

MB

KB

months days hours min. sec. ms.

VARIETY

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

OFF-SHORE OIL OPERATIONS

‣ When sensors on a drilling pipe in an oil-rig indicate that it is about to get stuck, how long — according to historical records — can I keep drilling?

‣ 400,000 sensors from 10s of differente producers

‣ 10,000 observations per second, many out-of-operational-ranges

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

SMART CITIES

▸ Can you suggest where to spend my next hours given my interests, the presence of people and what their doing?

▸ 100,000s people generating 10,000s information items per second

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

SOCIAL MEDIA ANALYSIS

▸ Who are the current top influencer users that are driving the discussion about the top emerging topics across all the social networks

▸ billions of active users (facebook, 1.86 bln in February 2017)

▸ millions of actions (facebook, 2.92 mln post per minute)

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

REQUIREMENT ANALYSIS

A system able to answer those queries must be able to

▸ handle massive datasets x

▸ process data streams on the fly x

▸ cope with heterogeneous datasets x

▸ cope with incomplete data x x

▸ cope with noisy data x

▸ provide reactive answers x

▸ support fine-grained information access x x

▸ integrate complex domain models x

Volu

me

Velo

city

Varie

ty

VERA

CITY

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

(PARTIAL) SOLUTIONS: STREAM PROCESSING

▸ A paradigmatic change!window

input streams streams of answerRegistered Continuous Query

Dynamic System

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

STREAM PROCESSING VS. REQUIREMENTS

Requirement SPmassive datasets

data streamsheterogeneous dataset incomplete datanoisy data

reactive answers fine-grained information access complex domain models

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

(PARTIAL) SOLUTIONS: SEMANTIC TECHS

▸ Given an ontology O (an information model), a query Q and a set of ground facts A contained in multiple heterogenous databases …,

▸ use O to rewrite Q as Q’ so that

▸ answer(Q,O,A) = answer(Q’,!,A) The answer of the query Q using the ontology O for any set of ground facts A is equal to answer of a query Q’ without considering the ontology O

▸ Use mapping M to map Q’ to multiple SQL queries to the various databases

Rewrite

O

Q Q’Map

SQL

M

answerA

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

SEMANTIC TECHS VS. REQUIREMENTS

Requirement SP STmassive datasetsdata streamsheterogeneous dataset incomplete datanoisy data reactive answers fine-grained information access complex domain models

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

Is it possible to make sense in real time of multiple, heterogeneous, gigantic and inevitably noisy and incomplete data streams in order to support the decision processes of extremely large numbers of concurrent users?

E. Della Valle, S. Ceri, F. van Harmelen & H. Stuckenschmidt, 2010

STILL THERE ARE USERS WHOSE DECISIONS NEED TO TAME ALL Vs

STREAM REASONING RESEARCH QUESTION

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

(,13),(,12),(,8),(,8)

STREAM REASONING

THEORY: STREAM PROCESSING

time

1minutewidewindow

Which are the top-4 most frequent colours in the last minute?

Is there a followed by a in the last minute yes, many

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

THEORY: STREAM PROCESSING + SEMANTIC TECHS

time

1minutewidewindow

An ontology of colours

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

(,13),(,8),(,8)

STREAM REASONING

THEORY: STREAM REASONING

time

1minutewidewindow

Which are the top-2 most frequent cool colours in the last minute?

Is there a primary cool colour followed by a secondary warm one

yes, followed by .

An ontology of colours

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

THEORY: STREAM REASONING

time

1minutewidewindow

A better ontology of colours

Which are the most frequent sentiments in the last minute?

Is there a impulsive, irritating colour followed by an happy one

The better is the ontology of the colours we are using the more expressive are the queries we can register

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

THEORY: 1000 SCIENTIFIC PAPERS IN 10 YEAR

▸ It is possible extend the Semantic Web stack in order to represent heterogeneous data streams (RDF streams), continuous queries (C-SPARQL, CQELS-QL, … RSP-QL), and continuous reasoning (LARS, STARQL, …) tasks

▸ The ordered nature of data streams and the possibility to forget old enough information allow to optimise continuous querying (C-SPARQL Engine, CQELS, MorphStream, … RSP Engine) and continuous reasoning (IMaRS, RDFox, StreamRule, ETALIS…) tasks so to provide reactive answers

▸ Semantic Web and Machine Learning technologies can be jointly employed to cope with the noisy and incomplete nature of data streams

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

Traditional

STREAM REASONING

THEORY: STREAM REASONING PARADIGMATIC CHANGE ENABLED

TRADITIONAL APPROACH

Data “in-motion” Data

“in-motion”Registered

analysisInsights

“in-motion”

Data put“at-rest”in DWH

Analysis

Analysis

Insight

PANOPTIQUE APPROACH

Ontology+

Mappings

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

Traditional Stream Reasoning

STREAM REASONING

THEORY: STREAM REASONING PARADIGMATIC CHANGE ENABLED

TRADITIONAL APPROACH

Data “in-motion” Data

“in-motion”Registered

analysisInsights

“in-motion”

Data put“at-rest”in DWH

Analysis

Analysis

Insight

PANOPTIQUE APPROACH

Ontology+

Mappings

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

(MY) APPLICATIONS

BOTTARI Winner of Semantic Web Challenge 2011

URBAN BIG DATA SCIENCE

Winner of IBM faculty award 2013Funded by 8 EIT Digital yearly grants

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

URBAN BIG DATA SCIENCE: CITYSENSING PROJECT

STREAM REASONING

URBAN BIG DATA SCIENCE: CROWDINSIGHTS PROJECT

October July

1000

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

PRODUCTS: I STARTED UP

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

PRODUCTS: I STARTED UP

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

PRODUCTS: I STARTED UP

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

STREAM REASONING VS. REQUIREMENTS

Requirement Stream Reasoningmassive datasetsdata streamsheterogeneous dataset incomplete datanoisy data reactive answers fine-grained information access complex domain models

not specifically treated so far treated but not resolved universally addressed by all studies

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

NOW WHAT?

▸ Focus on languages and abstractions able to easily capture user needs

▸ Analytic queries

▸ Which electricity-producing turbine has sensor readings similar (i.e., Pearson correlated by at least 0.75) to any turbine that subsequently had a critical failure in the past year?

▸ Advance analytics (Machine Learning) tasks

▸ Where am I likely going to run into a traffic jam during my commute tonight and how long will it take, given current weather and traffic conditions?

▸ … many more …

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

▸ Find the sweet-spot between scalability and expressive semantics

▸ the data access layers are clear (enough)

▸ … but, what kind of reasoning should we put at the top?

▸ Rule language? Answer set programming? Temporal logic?

STREAM REASONING

NOW WHAT?

ComplexityRaw Stream Processing

Semantic Streams

DL-Lite

???Abstraction

Selection

Interpretation

Reasoning

Re-writing

Mapping

Change Frequency

PTIME

NEXPTIME

104 Hz

1 Hz

Complexity vs. Dynamics AC0

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

NOW WHAT?

▸ Used semantics to model more than the data access

▸ Data are imperfect, get over it!

STREAM REASONING

ARE YOU INTERESTED TO LEARN MORE?

▸ the official stream reasoning community web site

▸ http://streamreasoning.org/

▸ the RDF Stream Processing W3C community

▸ https://www.w3.org/community/rsp/

▸ my personal pages

▸ http://emanueledellavalle.org/ + twitter: @manudellavalle

▸ my company page

▸ http://fluxedo.com/en/

Emanuele Della Valle - http://emanueledellavalle.org - @manudellavalle

STREAM REASONING

THANK YOU! ANY QUESTION?

Emanuele Della VallePolitecnico di Milano

http://emanueledellavalle.org@manudellavalle

Oslo, Norway - 15.6.2017