53
Multi-technique data analytics workflow using a Logical Data Warehouse architecture: web mining use case Antonio Laureti Palma, ISTAT, …@istat.it Summary: - A Logical Data Warehouse schema - Predictive modelling - Use case: SBS-ICT by web mining daWos Amsterdam, 11-12 September 2018

Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Multi-technique data analytics workflow using a Logical Data Warehouse architecture:

web mining use case Antonio Laureti Palma, ISTAT, …@istat.it Summary: - A Logical Data Warehouse schema - Predictive modelling - Use case: SBS-ICT by web mining daWos Amsterdam, 11-12 September 2018

Page 2: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

2

ESS

Vis

ion

20

20

Total Quality Management

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 3: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

3

Data Warehouse 2.0

visions: B. Immon: “The data warehouse of next-generation, while still building on the founding principles of an enterprise version of truth and a “single” data repository must address the needs of data of new types, new volumes, new data-quality levels, new performance needs, new metadata, and new user requirements.” K. Krishnan: “The next-generation data warehouse architecture will be complex from a physical architecture deployment, consisting of a myriad of technologies, and will be data-driven from an integration perspective, extremely flexible, and scalable from a data architecture perspective.”

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 4: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Logical DWH

New sources increase complexity of IT components move the DWH architectures toward logical architectures

The Logical DWH is a new management architecture combining the strengths of traditional repository warehouses with alternative data management and access strategy

A Logical DWH is an evolution and augmentation of DWH practices, not a replacement

Data Virtualization enables Logical DWH

4 Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 5: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

S-DWH (RDBMS)

5

An

alysis/Data M

inin

g/R

epo

rts

data virtu

alization

collect

machine learning

distributed data store (NoSql/Spark/Hadoop)

WEB

scraper

Logical DWH Example: a possible data virtualization architecture:

S-DWH (RDBMS) Stat-DWH (RDBMS)

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 6: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

6

LSDW - Logical Statistical Data Warehouse

Logical Statistical Data Warehouse: a virtual central statistical data store based on logical layers for managing all available data of interest, improving to: produce the necessary information, (re)use data to create new data/new outputs, perform data analytics, execute analysis, produce reports, support dashboard tools

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 7: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

7

LSDW Architecture domains:

Pic

ture

fro

m:

Kri

sh K

rish

na

n-

Da

ta W

are

ho

use

in t

he

eag

e o

f B

ig D

ata

Functional domain Technology domain Data domain

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 8: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

8

LSDW functional domains Functional layers: processes, actions or tasks

STAT

ISTI

CA

L D

ATA

WA

REH

OU

SE

OPERATIONAL DATA

DATA WAREHOSING

INTERPRETATION LAYER

ACCESS LAYER

INTEGRATION LAYER

SOURCES LAYER

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 9: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

STAT

ISTI

CA

L D

ATA

WA

REH

OU

SE

OPERATIONAL DATA

DATA WAREHOUSE

INTERPRETATION AND ANALYSIS LAYER

ACCESS LAYER

INTEGRATION LAYER

SOURCES LAYER COLLECT

PROCESS

ANALYZE

DISSEMINATE

SURVEY

COLLECT

PROCESS

ANALYZE

DISSEMINATE

ADMIN

COLLECT

PROCESS

ANALYZE

DISSEMINATE

BIG DATA

9

LSDW - functional layers vs Data Sources:

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 10: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

preprocessing learning

prediction

learning

algorithm

training

labeled dataset

dataset

labeled

dataset

final

model

test

Flow diagram example of predictive modelling

evaluations

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 11: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

preprocessing learning prediction analysis

11

SOURCE INTERPRETATION INTEGRATION

ETL

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

data mining

scraper primary

labels

data mart

LSDH layers: predictive modelling

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 12: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

12

Case Study: SBS-ICT by Web Mining The case study focuses on the use of survey data as a ground

truth to create a classification model enabling the prediction of variables on Enterprises ICT Survey.

Items: analysis units ICT Enterprises

ICT variables involved: web ordering, presence in social media, job advertisements

Web scraped content from a URL-list

predictor target variables: add to cart; shop online; account; order; job opportunities; career; job;…

ML supervised learning models for data classification

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 13: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Web Mining: SBS-ICT data processing

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Page 14: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

analysis prediction learning preprocessing

14

SOURCE LAYER INTERPRETATION LAYER

ACCESS LAYER

R

SAS

INTEGRATION LAYER

NLP: text mining

learning models evaluation matrix

tokenization

lemmatization

classifications (LR, SVM, RF)

POS tagging

summarization

ML data

Analysis

Web Mining on LSDW layers

web scraping

URLs validation

URLs retrieval

text documents

Data Mart - ICT

Register

DW-Thematic

Antonio Laureti Palma , daWos - Amsterdam, 11-12 September 2018

Python

Page 15: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Thank you for your attention

Antonio Laureti Palma, ISTAT, …[email protected]

Page 16: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

“Multi-technique data analytics workflow using a Logical Data Warehouse

architecture: web mining use case”

Antonio Laureti Palma, ISTAT, [email protected]

Page 17: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

preprocessing learning prediction analysis

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

17

data warehouse operational data store

SOURCE INTERPRETATION INTEGRATION

ETL

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

preparation

scraper primary

labels

data mart

LSDH: Flow diagram of predictive modelling

distributed database

Page 18: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

18

my question: what is the difference between Analytics and Analysis?

Analysis is ”A careful study of something to learn about its parts, what they do and how they are related to each other”

Analytics is “the method of logical analysis”

-> Therefore, we do analysis using analytics. big data analytics, method of logical analysis on Big Data.

Introduces epistemological changes in the design of new possible official statistical production processes that could force to an relevant infrastructure change

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

Big Data analytics

Page 19: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

19

Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads

to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure

for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and

any associated context.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 20: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

20

Big Data processing life cycle

The first step after acquisition of big data is to perform “data discovery”; this can be automated using algorithms:

- Text mining

- Data mining

- Pattern processing

- Statistical models

- Mathematical models

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Data Analytic Data

Analysis

Page 21: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

21

Analytics layer To create the foundational structure for data analysis, you need to have subject-matter experts who can understand the different layers of data being integrated and what granularity levels of integration can be completed to create the holistic picture. Big Data analytics can be defined as the combination of traditional

analytics and data mining techniques along with large volumes of data

Data discovery for analytics can be defined in these distinct steps: Data tagging is the process of creating an identifying link on the

data for metadata integration. Data classification is the process of creating subsets of value pairs

for data processing and integration. Data modeling is the process of creating a model for data

visualization or analytics.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 22: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

22

Big-Data and S-DWH integration Inbound data processing

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 23: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

23

Big Data integration strategies 1°) S-DWH data bus based:

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 24: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

24

Big Data integration strategies 1°) S-DWH data bus based: a data bus is developed using metadata and semantic technologies, which will create a data integration environment for data exploration and processing. A simple layer or an overwhelmingly complex layer of processing. Pros: Scalable design for RDBMS and Big Data processing. Reduced overload on processing. Heterogeneous physical architecture deployment. Cons: Data bus architecture can become increasingly complex. Possible poor metadata architecture due to multiple layers of data

processing. Data integration can become a performance bottleneck.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 25: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

25

Big Data integration strategies 2°) S-DWH data connector

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 26: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

26

Big Data integration strategies 2°) S-DWH data connector, this connecter is a bridge to exchange data between the two platforms. Pros: Scalable design for RDBMS and Big Data processing. Modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class

integration at the data processing layer. Metadata and MDM solutions can be held with relative ease across the

solution. Cons: Performance of the Big Data connector is the biggest area of weakness. Data integration and query scalability can become complex.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 27: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

27

Big Data integration strategies 3°) S-DWH based on big data appliances

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 28: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

28

Big Data integration strategies 3°) S-DWH based on big data appliances; these appliances are configured to handle the rigors of workloads and complexities of Big Data and the current RDBMS architecture Pros: Scalable design and modular data integration architecture. Heterogeneous physical architecture deployment, providing best-in-class integration

at the data processing layer. Custom configured to suit the processing rigors as required for each organization. Cons: Customized configuration can be maintenance-heavy. Data integration and query scalability can become complex as the configuration

changes over a period of time.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 29: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

29

Big Data integration strategies 4°) S-DWH based data virtualization

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 30: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

30

Big Data integration strategies 4°) S-DWH based data virtualization, allows to solve the data integration challenge while leveraging all the investments on the current infrastructure trough a semantic data integration architecture. Pros: Extremely scalable and flexible architecture. Workload optimized. Easy to maintain. Lower initial cost of deployment. Cons: Lack of governance can create too many silos and degrade performance. Complex query processing can become degraded over a period of time. Performance at the integration layer may need periodic maintenance.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 31: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

31

Page 32: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

32

Big Data definitions… Big Data can be defined as volumes of data available in varying

degrees of complexity, generated at different velocities and varying degrees of ambiguity, that cannot be processed using traditional technologies, processing methods, algorithms, or any commercial off-the-shelf solutions.

In statistics we may speak about “four V” (by Diego Kuonen):

volume

variety

velocity

veracity

IT items Stat items

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 33: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

33

Big Data definitions… Volume: amount of data with respect to the number of observations, size of the data, but also with respect to the number of variables, dimensionality of the data; Variety: data in many forms, i.e. different types of data (e.g. structured, semi-structured and unstructured; data sources (e.g. internal, external, open, public); data resolutions and data granularities; Velocity: data in motion, i.e. the speed by which data are generated and need to be handled (e.g. streaming data from machines, sensors and social data); Veracity: data in doubt, i.e. the varying levels of noise and processing errors, including the reliability, capability and validity of the data.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 34: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

34

New class of challenges and issues on Big Data 1/2.

i. Data does not have a finite architecture.

ii. Data can have multiple formats, semi-structured or unstructured.

iii. Data is self-contained and needs several external business to interpret and process the data.

iv. Data has no specificity with volume or complexity.

v. Data is not relational.

vi. Data has a minimal or zero concept of referential integrity.

vii. Data depends on metadata for creating context.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 35: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

35

New class of challenges and issues on Big Data. 2/2

viii. Data needs more analytical processing.

ix. Data needs multiple cycles of processing, but each cycle needs to be processed in one pass due to the size of the data.

x. Data needs business rules for processing like we handle structured data today, but these rules need to be created in a rules engine architecture rather than the database or the ETL tool.

xi. Data needs more governance than data in the database.

xii. Data has no defined quality.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 36: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

36

Big Data workloads The major areas where workload definitions are important include:

Data is file based for acquisition and storage.

Data processing will happen in three steps: • Discovery, in this step the data is analyzed and categorized. The data will

need to be processed and computed where it is and not moved across the network.

• Analytics, in this step the data is converted to metrics, structured format and extracting for processing to the data warehouse or analytical engines.

• Analysis, in this step the data is associated with master data and metadata. This will require minimal transformation and movement of data across the network.

Maintain file system–driven consistency, due to no database involved in the processing of Big Data.

Big Data query workloads are more program execution of MapReduce code, which is completely opposite of executing SQL and optimizing for SQL performance.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 37: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

37

Page 38: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

New DWH, key IT challenges The users of a data warehouse and the downstream business intelligence and analytics applications measure the efficiency and effectiveness as units of speed, both on the inbound and outbound sides of the data warehouse.

Data loading: data quality, slowly changing dimensional data, master data management (MDM), metadata management, transformation and processing.

Availability is a benchmark, both due the loading process and the infrastructure as a whole.

Data volumes, due to: analytics, compliance requirements, legal requirements, data security, business users, social media, nonspecific requirements.

Storage performance, the issue is both at the data architecture and storage architecture.

Query performance, for ad-hoc queries and analytical queries, due to thei nondeterministic nature.

Data transport, aspect of performance that can improve efficient processing of data transportation from one layer to another and its subsequent availability.

38 Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 39: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

39

Component of the new DWH Analytics layer Technology layer Data Layer

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 40: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

40

Data layer (1/2) The data layer in the new platform includes:

i. Legacy data, that include structured and semi-structured formats of data, stored online or offline (census, socioeconomic, urban planning, etc..)

ii. Transactional (OLTP) data, in the new platform all transactional data can be loaded and all these segments of data can be used in creating a powerful back-end data platform that analyzes data and organizes it at every processing step.

iii. Unstructured data, the next-generation platform will provide interfaces to investigate into the content by navigating it based on user-defined rules for processing. The output of content processing will be used in defining and designing analytics for exploration mining of unstructured data.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 41: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

41

Data layer (2/2) The data layer in the new platform includes:

iv. Video, there are three components in a video, the content, the audio, and the associated metadata. The new data platforms, however, provide the infrastructure necessary to process this data (i.e. automobile traffic analysis).

v. Audio, extracts data can be processed and stored as contextual data associated with the metadata in the next-generation data warehouse; i.e. data from call centers.

vi. Images, static images carry a lot of data that can be very useful in government agencies (geospatial integration), and other areas.

vii. Numerical/patterns/graphs, sensor data, stock market data, scientific data, cellular tower data, GPS data and other such data occur and repeat their manifests in periodic time intervals. Processing such data and integrating the results with the data warehouse will provide analytical opportunities to perform correlation or cluster analysis.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 42: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

42

Technology layer

i. RDBMS

ii. Hadoop

iii. NoSQL

iv. MDM solutions (Master Data Management)

v. Metadata solutions

vi. Semantic technologies

vii. Rules engines

viii. Data mining algorithms

ix. Text mining algorithms

x. Data discovery technologies

xi. Data visualization technologies

xii. Reporting and analytical technologies

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 43: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

43

Case Studies

population statistics from mobile phone traffic: “Persons and Places” project, OD matrix by mobile phone data

business statistics produced by web mining: survey ICT, variable estimations by using internet data

DWH IT environment (distributed computing platform):

Oracle Exadata Database Machine Software language: Py-Spark MLLib-Spark, Scikit-learn HUE (Hadoop User Experience):

Editors for Hive, Impala, Spark, SQL Browser and Scheduler of jobs and workflows for HDFS, SQL Tables,..

Hadoop/Spark based infrastructure based on 8 nodes

Invest in new IT tools and methodology

Page 44: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

44

case study 1 population statistics from mobile phone traffic

The case study focuses on an ISTAT project “Persons and Places” which compares two approaches to mobility profile estimation: based on administrative archives based on mobile phone data

Items: analysis units: resident, embedded and daily city users OD matrix of daily mobility at municipality level calling data from mobile phone CDRs (Call Detail Record) classification based on unsupervised learning process comparison of estimates

Invest in new IT tools and methodology

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

Page 45: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018 45

case study 1: Logical S-DWH processing lifecycle

CDR

integration data discovery stage

source collect stage

Individual Call Profiles

HDFS prototype extractions K-Means algorithm

prototype labelling

label propagation 1-Nearest-Neighbor

RDD

operational layers data warehouse layers

interpretation analysis stage

access

archetype definitions

ICP DWH

MPT-OD matrix

R

SAS

P&P-OD matrix

population DWH

distributed database

distributed computing platform

Plotly PyLib

population register

Invest in new IT tools and methodology

Page 46: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Preprocessing Learning Evaluation Prediction

learning

algorithm training

labeled dataset

dataset

labeled

dataset

final

model

test labels

Flow diagram of predictive modelling

Page 47: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Logical DWH

Data Virtualization enables Logical DWH

focusing more on the logic of information than data structures means adding semantic data abstraction based on:

virtual (any data) management

high quality level of metadata

active system self-monitoring

distributed processes (parallel-processing)

service level tracking

47 Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

Page 48: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

48

Big Data processing Data processing can be defined as the collection, processing, and management of data resulting in information generation to end consumers. Traditional data processing life cycle: - first analyze the transactional data and create a set of requirements, which leads

to data discovery and data model creation, - then, a database structure is created to process the data. Big Data data processing life cycle: - first, the data is collected and loaded to a target platform where a data structure

for the content is created, a metadata layer is applied to the data, - the data is then transformed and analyzed to provide insights into the data and

any associated context.

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 49: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

49

Big Data integration strategies 2°) S-DWH data connector

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 50: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

50

Big Data integration strategies 3°) S-DWH based on big data appliances

Antonio Laureti Palma, CoE S-DWH, CM Rome April 12th 2017

Page 51: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

51

Logical S-DWH layered architecture: ML Flow diagram

Page 52: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

preprocessing learning prediction analysis

Antonio Laureti Palma , Q2018 - Kraków, Poland. 26-29 June 2018

52

data warehouse operational data store

SOURCE INTERPRETATION INTEGRATION

ETL

surveys

admins

big data learning

ACCESS

data mining

reports

dashboard

analysis

data mining

scraper primary

labels

data mart

LSDH layers: predictive modelling

distributed database

Page 53: Multi-technique data analytics workflow using a Logical ... · 3 Data Warehouse 2.0 visions: B. Immon: ^The data warehouse of next-generation, while still building on the founding

Preprocessing Learning Evaluation Prediction

learning

algorithm training

labeled dataset

dataset

labeled

dataset

final

model

test labels

Flow diagram: predictive modelling