Conceptualizing Analytics - Conceptual Modeling and Data ... · CONCEPTUALIZING ANALYTICS...

Preview:

Citation preview

CONCEPTUALIZINGANALYTICSConceptual Modeling and Data Analytics –A Tutorial

Christoph G. Schuetz Michael Schrefl

CONCEPTUALIZINGANALYTICSConceptual Modeling and Data Analytics –A Tutorial

Christoph G. Schuetz Michael SchreflThanksIlko Kovacic, Median Hilal, and Georg Grossmann (UniSA) formaterial that served as the basis for parts of this tutorial.

Table of Contents

Introduction and Background

Acquisition and Recording

Extraction, Cleaning, and Annotation

Integration, Aggregation, and Representation

Analysis and Modeling

Interpretation and Action

Open Issues

1/131

INTRODUCTION ANDBACKGROUND

Scope of this Tutorial

How may conceptual modeling facilitate data analytics?

2/131

Scope of this Tutorial

How may conceptual modeling facilitate data analytics?

2/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”

“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”

“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”

“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”

“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”

“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”

“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”

“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”

“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

What is Data Analytics?

Definition from industry [32, p. 37]“data-based applications of quantitative analysis methods”

Another definition from industry [29, p. 16]“examination of information to uncover insights that give abusiness person the knowledge to make informed decisions”“Analytics tools enable people to query and analyze informa-tion”

Definition from academia [31, p. 329]“discovery and communication of meaningful patterns in data”“Organizations apply analytics to their data in order to de-scribe, predict, and improve organizational performance.”

3/131

Data Analytics

� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.

� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.

� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.

Data analytics must be viewed in the broader context ofbusiness intelligence

4/131

Adv

ance

d

Data Analytics

� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.

� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.

� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.

Data analytics must be viewed in the broader context ofbusiness intelligence

4/131

Adv

ance

d

Data Analytics

� Descriptive: What happened? Multidimensional analysis(OLAP), statistical analysis of the past. Dashboards,scorecards, key performance indicators.

� Predictive: Use of statistical methods in an attempt topredict what will happen in the future.

� Prescriptive: What actions should be taken? Alerts andactions triggered by analysis results. Active datawarehouse.

Data analytics must be viewed in the broader context ofbusiness intelligence

4/131

Adv

ance

d

What is Business Intelligence?

Data

Integration

Data

Warehousing

Business

Intelligence

Store

Integrated

Data

Integrate &

Cleanse Data

from Multiple

Sources

Present &

Analyze

Information

Figure: The relationship between data integration, data warehousing,and business intelligence [29, p. 15]

5/131

What about big data analytics?

Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view

Acquisition/

Recording

Exctraction/

Cleaning/

Annotation

Integration/

Aggregation/

Representation

Analysis/

Modeling

Interpretation/

Action

Figure: The (big) data analysis pipeline (adapted from [3])

One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]

The specific technologies, however, may differ.

6/131

What about big data analytics?

Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view

Acquisition/

Recording

Exctraction/

Cleaning/

Annotation

Integration/

Aggregation/

Representation

Analysis/

Modeling

Interpretation/

Action

Figure: The (big) data analysis pipeline (adapted from [3])

One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]

The specific technologies, however, may differ.

6/131

What about big data analytics?

Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view

Acquisition/

Recording

Exctraction/

Cleaning/

Annotation

Integration/

Aggregation/

Representation

Analysis/

Modeling

Interpretation/

Action

Figure: The (big) data analysis pipeline (adapted from [3])

One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]

The specific technologies, however, may differ.

6/131

What about big data analytics?

Compared to traditional business intelligence (BI), analysis ofbig data is really not so different from a conceptual point of view

Acquisition/

Recording

Exctraction/

Cleaning/

Annotation

Integration/

Aggregation/

Representation

Analysis/

Modeling

Interpretation/

Action

Figure: The (big) data analysis pipeline (adapted from [3])

One may argue that business intelligence has always beenabout the analysis of what constituted “big” data at the time [32]

The specific technologies, however, may differ.

6/131

Scope of this Tutorial

How may conceptual modeling facilitate data analytics?

This tutorial follows the steps of the (big) data analysis pipelineand illustrates selected examples of conceptual modelingsupporting each step.

7/131

Running Example: Precision Dairy Farming

From this ...

8/131

Running Example: Precision Dairy Farming

From this ...

8/131

Running Example: Precision Dairy Farming

... to that!

9/131

The AgriProKnow Project

Joint research effort between various companies and researchinstitutions on data analytics in dairy farming

� Smartbow develops smart animal eartags to track activity.� Wasserbauer develops automated feeding machines.� The University of Veterinary Medicine Vienna provides the

domain knowledge.� Johannes Kepler University (JKU) Linz has statistical and

business intelligence (BI) knowledge for data analysis.

Project goal: Building an active semantic data warehouse forprecision dairy farming [28]

10/131

The AgriProKnow Project

Joint research effort between various companies and researchinstitutions on data analytics in dairy farming

� Smartbow develops smart animal eartags to track activity.� Wasserbauer develops automated feeding machines.� The University of Veterinary Medicine Vienna provides the

domain knowledge.� Johannes Kepler University (JKU) Linz has statistical and

business intelligence (BI) knowledge for data analysis.

Project goal: Building an active semantic data warehouse forprecision dairy farming [28]

11/131

Further ReadingC. G. Schuetz, S. Schausberger, M. Schrefl. Building anactive semantic data warehouse for precision dairy farming.Journal of Organizational Computing and Electronic Com-merce, 28(2), 122-144, 2018.

ACQUISITION ANDRECORDING

Acquisition and Recording

Interesting data originate from various sources such asoperational databases, sensors, or the web.

Possible storage forms for (big) data with support for dataanalysis are:

� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis

� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.

12/131

Acquisition and Recording

Interesting data originate from various sources such asoperational databases, sensors, or the web.

Possible storage forms for (big) data with support for dataanalysis are:

� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis

� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.

12/131

Acquisition and Recording

Interesting data originate from various sources such asoperational databases, sensors, or the web.

Possible storage forms for (big) data with support for dataanalysis are:

� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis

� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.

12/131

Acquisition and Recording

Interesting data originate from various sources such asoperational databases, sensors, or the web.

Possible storage forms for (big) data with support for dataanalysis are:

� Data Warehouse: A clean and integrated databaseproviding data of interest in a format fit for analysis

� Data Lake: Store the raw data as-is, possibly withadditional metadata to help retrieve datasets. Data aretransformed when needed for the analysis.

12/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it.

(But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?

Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it.

(But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?

Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?

Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?

Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?

Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?Really?

But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

The Data Warehouse is Dead!

“Barely a company uses a data warehouse anymore becauseit’s too cumbersome to build it. (But the multidimensional modelremains relevant.)”

— An actual colleague from a consulting firm

“You’re using a data warehouse to analyze sensor data?Really? But everyone uses data stream processing for that.”

— An actual attendant of EDOC 2016

13/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse?

Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today.

Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse.

(. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture.

It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“What is a data warehouse? Our view is that the data is thewarehouse, and our data just happens to be managed with arelational database today. Our data could be managed on anon-relational platform, and it would still be a warehouse. (. . . )The idea that Hadoop would replace a warehouse is misguidedbecause the data and its platform are two non-equivalent layersof the data warehouse architecture. It’s more to the point toconjecture that Hadoop might replace an equivalent dataplatform, such as a relational database management system.”[26, p. 15]

⇒ Be flexible regarding implementation technology!

14/131

Is The Data Warehouse Dead?

“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed.

(. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]

The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.

⇒ Conceptual modeling to the rescue!

15/131

Is The Data Warehouse Dead?

“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration.

Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]

The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.

⇒ Conceptual modeling to the rescue!

15/131

Is The Data Warehouse Dead?

“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]

The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.

⇒ Conceptual modeling to the rescue!

15/131

Is The Data Warehouse Dead?

“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]

The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.

⇒ Conceptual modeling to the rescue!

15/131

Is The Data Warehouse Dead?

“There have been numerous times when vendors proclaim thatdata warehousing is no longer needed. (. . . )There is no “silverbullet” that helps an enterprise avoid the hard work of dataintegration. Information that is clean, comprehensive,consistent, conformed, and current is not a happenstance; itrequires thought and work.” [29, p. 12]

The data warehouse remains a relevant concept also in the ageof big data, storing clean data in a format and granularitysuitable for analysis.

⇒ Conceptual modeling to the rescue!

15/131

Data Lake

A data lake serves to store raw data for later analysis

once theanalysts have figured out what to do with the data

A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams

16/131

Data Lake

A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data

A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams

16/131

Data Lake

A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data

A data lake may complement a traditional data warehouse,especially in the presence of high velocity data streams

16/131

Sensor Data Warehousing (Dobson et al. [5])

Real-Time

Analysis

Business

IntelligenceData Lake

Event

Processing

Data

Warehouse

Stream

Processing

17/131

Conceptual Model: Sensor Measurements

Agent

+ receptionTimestamp

+ sensingTimestamp

+ value

+ accuracy

Measurement*

1

+ name

+ unit

MeasurementType

1

*

+ name

Transformation 0..1 *

+ latitude

+ longitude

Location0..1*

Figure: A domain model for sensor measurements [5]

18/131

Conceptual Model: Agent Types

Agent

+ name

+ age

+ position

Person

+ id

Process

+ name

ProcessType1 *

+ assignmentTimestamp

AssignedDevice

+ id

+ nominalAccuracy

PhysicalDevice

+ id

LogicalDevice

Stationary MobileLocation

1 1* *

1 *

0..1

*

Figure: A domain model for sensor agents [5]

19/131

Dimensional Fact Model: Measurements

Figure: Multidimensional model for measurements [5]

Example Sensor Readings

S. Time Meas. Type Agent Transform. Acc. Value

2018/10/02 14:00 3 1 AVG10 0.1 22.2

2018/10/02 14:10 3 1 AVG10 0.1 22.4

2018/10/02 14:05 2 2 AVG5 0.1 61.3

2018/10/02 14:15 2 2 AVG5 0.2 60.9

Meas. Type ID Meas. Type Unit

1 Milk yield kg

2 Rumination activity Chews/Cud

3 Temperature °C

2018/10/03 10:20 2 3 62

Agent ID

1

2

3

Agent

THE01

EAR23

VET01

Agent Type

Device

Device

Person

Phys. Dev. Log. Dev. Loc. Dev. Type

THE01232

EAR03143

Temp. Feed

Area #1

Feed

Area #1Thermo.

Earmark

The Need for Shared Conceptualization

In order to allow for comparison of results, a sharedconceptualization is vital.

Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm.

Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area

.

But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.

22/131

The Need for Shared Conceptualization

In order to allow for comparison of results, a sharedconceptualization is vital.

Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm.

Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area

.

But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.

22/131

The Need for Shared Conceptualization

In order to allow for comparison of results, a sharedconceptualization is vital.

Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area

, e.g., feeding area, restingarea, milking area

.

But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.

22/131

The Need for Shared Conceptualization

In order to allow for comparison of results, a sharedconceptualization is vital.

Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area.

But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.

22/131

The Need for Shared Conceptualization

In order to allow for comparison of results, a sharedconceptualization is vital.

Example: Activity Tracking in AgriProKnowSensors track movement activity of animals within a farm. Inorder to allow for a comparison of movement activity data be-tween farms, rather than the precise location, it is more impor-tant to capture the function area, e.g., feeding area, restingarea, milking area. But first, common function areas acrossfarms must be identified, and then captured during data ac-quisition and recording.

22/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.

Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnow

Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnow

Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnow

Sensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.

30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animal

Large farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals

⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farm

The ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.

But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowSensors may track position and activity every two seconds.30× 60× 24 = 43, 200 readings per day and animalLarge farm: 1000 animals⇒ 15, 768, 000, 000 readings per year for one farmThe ultimate vision of AgriProKnow is to collect data fromthousands of farms for inter-farm data analysis.But: All those readings are often not needed. More abstractlevel is more interesting.

23/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful.

For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types

, which shouldbe known upon recording to be able to reduce data early on

24/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful.

For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types

, which shouldbe known upon recording to be able to reduce data early on

24/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.

⇒ shared conceptualization of activity types

, which shouldbe known upon recording to be able to reduce data early on

24/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types

, which shouldbe known upon recording to be able to reduce data early on

24/131

The Need for Shared Conceptualization

Furthermore, a shared conceptualization may help to reducethe load of data early on during data acquisition and recording.Later computations might be expensive or even unfeasible.

Example: Activity Tracking in AgriProKnowRather than storing thousands of location points and move-ment patterns for each animal, a higher level of abstraction ismore useful. For each animal, the walking distance and du-ration as well as the lying and standing duration within eachhour of the day is more important.⇒ shared conceptualization of activity types, which shouldbe known upon recording to be able to reduce data early on

24/131

Dimensional Fact Model: Movement Activity

For example, animal AT23464 on the Kremesberg farm sitemay have spent 0 minutes lying, 10 minutes standing, and 5minutes walking in a feeding area on the 10 October 2018 inthe 13th hour of the day.

Dimensional Fact Model: Movement Activity

For example, animal AT23464 on the Kremesberg farm sitemay have spent 0 minutes lying, 10 minutes standing, and 5minutes walking in a feeding area on the 10 October 2018 inthe 13th hour of the day.

Data Lake

A data lake serves to store raw data for later analysis

once theanalysts have figured out what to do with the data.

Of course, the data sets need to be organized such that theanalysts can find them.

⇒ Structured data lake approach

26/131

Data Lake

A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.

Of course, the data sets need to be organized such that theanalysts can find them.

⇒ Structured data lake approach

26/131

Data Lake

A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.

Of course, the data sets need to be organized such that theanalysts can find them.

⇒ Structured data lake approach

26/131

Data Lake

A data lake serves to store raw data for later analysis once theanalysts have figured out what to do with the data.

Of course, the data sets need to be organized such that theanalysts can find them.

⇒ Structured data lake approach

26/131

Semantic Data Containers [28]

The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions

(orfacets).

The dimensions/facets consist of concepts, which arehierarchically organized.

Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.

27/131

Semantic Data Containers [28]

The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).

The dimensions/facets consist of concepts, which arehierarchically organized.

Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.

27/131

Semantic Data Containers [28]

The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).

The dimensions/facets consist of concepts, which arehierarchically organized.

Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.

27/131

Semantic Data Containers [28]

The semantic container approach allows to organize data setsalong spatial, temporal, and other semantic dimensions (orfacets).

The dimensions/facets consist of concepts, which arehierarchically organized.

Example: ATM Information CubesModern air traffic management (ATM) heavily relies on timelyexchange of accurate information. ATM stakeholders requireinformation at various granularities and levels of details. ATMinformation cubes are structured repositories of ATM mes-sages, where each cell is a semantic container.

27/131

ATM Information Cube: Operations

Operational

Restriction

ED

UU

-01

Flight Critical

ED

UU

-02

Essential

Briefing Package

ED

UU

� Merge: Change granularity of the cube by merging thecontents of the cells.

� Abstract: Replace entities inside a cell with more abstractentities.

ATM Information Cube: Example

Operational

Restriction

TS

-LO

WW

-01

Flight

Critical

TS

-LO

WW

-02

TS

-LZ

IB-0

1T

S-L

ZIB

-02

Potential

Hazard

Additional

Information

ATM Information Cube: Merge

Essential

Briefing Package

LO

VV

LZ

BB

Supplementary

Briefing Package

1

2

3

4

EXTRACTION, CLEANING,AND ANNOTATION

Process Modeling and ETL

Extract, transform, and load (ETL) processes feed the datafrom the sources into the data warehouse.

Traditionally, the implementation of ETL processes involves alot of low-level programming.

Process modeling approaches with support for code generationmay facilitate the implementation of ETL processes and alsoserve as documentation.

Besides proprietary modeling languages, the Business ProcessModel and Notation (BPMN) or UML activity diagrams mayserve for ETL process modeling.

31/131

BPMN Models of ETL Processes(El Akkaoui et al. [7, 6])

Two perspectives on ETL processes:

� Control process (process orchestration): Handlebranching and synchronizing of the data flow

� Data process: Specify precisely how the input datatransforms into output data

32/131

Control Process: Example

Before animal movement data can be loaded into theAgriProKnow data warehouse, the animal dimension and thefunction areas at specific farms must be loaded.

Agr

iPro

Kn

ow

DW

H

FarmFunctionArea Load

AnimalDimLoad

AnimalMovement Load

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID,Lat,Long,TimestampAT-12,5,10,1537348997000AT-12,6,10,1537348998000AT-23,7,15,1537348997000AT-23,7,15,1537348998000

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID Lat Long TimestampAT-12 5 10 1537348997000AT-12 6 10 1537348998000AT-23 7 15 1537348997000AT-23 7 15 1537348998000

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID Lat Long TimestampAT-12 5 10 Sep 19, 2018 09:23:17AT-12 6 10 Sep 19, 2018 09:23:18AT-23 7 15 Sep 19, 2018 09:23:17AT-23 7 15 Sep 19, 2018 09:23:18

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID Coordinates TimestampAT-12 (5, 10) Sep 19, 2018 09:23:17AT-12 (6, 10) Sep 19, 2018 09:23:18AT-23 (7, 15) Sep 19, 2018 09:23:17AT-23 (7, 15) Sep 19, 2018 09:23:18

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID Coordinates TimestampAT-12 (5, 10) Sep 19, 2018 09:23:17AT-12 (6, 10) Sep 19, 2018 09:23:18AT-23 (7, 15) Sep 19, 2018 09:23:17AT-23 (7, 15) Sep 19, 2018 09:23:18

FunctionArea Area FunctionAreaType1stFarmFeeding [(0,0);(6,12)] Feeding1stFarmResting [(5,14);(10,20)] Resting Retrieve: FunctionAreaType

Database: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NationalID FAType ... TimestampAT-12 Feeding Sep 19, 2018 09:23:17AT-12 Feeding Sep 19, 2018 09:23:18AT-23 Resting Sep 19, 2018 09:23:17AT-23 Resting Sep 19, 2018 09:23:18

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

Data Process: Animal Movement

Input Data

Lookup

Insert Data

File: EAR34-Movement.csvType: CSV

Insert Data

NatlID FAType Hour Year Month Day DurAT-12 Feeding 10 2018 9 19 2AT-23 Resting 10 2018 9 19 2

Retrieve: FunctionAreaTypeDatabase: AgriProKnowDWHTable: FarmFunctionAreaWhere: AreaContains: CoordinatesFile: BadCoordinates.txt

Type: Text

NotFound

Column: CoordinatesExpression: SDO_POINT_TYPE(Lat, Long, NULL)

Add Column

Column: Timestamp: Date

Convert Column

Group By: EXTRACT(Hour FROM Date), EXTRACT(Year FROM Date),EXTRACT(Month FROM Date),EXTRACT(Day FROM Date), NationalID, FunctionAreaTypeColumns: Duration = COUNT(*)

AggregateDatabase: AgriProKnowDWH

Table: Movement

Found

ETL Patterns (Oliveira et al. [23, 24])

Identify conceptual models for a set of standard ETL processessuch as change data capture, slowly changing dimensions, andsurrogate key pipelining [23].

The goal is to foster code reusability.

Oliveira et al. [24] also extend the BPMN metamodel withconcepts specific to ETL processes.

42/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

Mining ETL Patterns(Theodorou et al. [30])

ETL process modeling also faciliates a comprehensive analysisof ETL processes based on the mining for ETL patterns, usingthe Workflow Patterns Initiative as guide.

What are the use cases for such mined ETL patterns?

� Identify recurring patterns in existing ETL processes tosubsequently redesign the ETL processes

� Apply quality metrics on ETL process models at higherlevel of abstraction.

� Show a higher level summary of the ETL process to fosterunderstanding.

43/131

ETL Processes for Big Data (Bala et al. [4])

Massive distribution and parallelization is key to handling bigdata processing.

Employ distribution and parallelization for ETL processes withbig data!

� Describe ETL process in terms of core functionalities

� Distribute processing of these core functionalities tomultiple nodes

⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data

44/131

ETL Processes for Big Data (Bala et al. [4])

Massive distribution and parallelization is key to handling bigdata processing.

Employ distribution and parallelization for ETL processes withbig data!

� Describe ETL process in terms of core functionalities

� Distribute processing of these core functionalities tomultiple nodes

⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data

44/131

ETL Processes for Big Data (Bala et al. [4])

Massive distribution and parallelization is key to handling bigdata processing.

Employ distribution and parallelization for ETL processes withbig data!

� Describe ETL process in terms of core functionalities

� Distribute processing of these core functionalities tomultiple nodes

⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data

44/131

ETL Processes for Big Data (Bala et al. [4])

Massive distribution and parallelization is key to handling bigdata processing.

Employ distribution and parallelization for ETL processes withbig data!

� Describe ETL process in terms of core functionalities

� Distribute processing of these core functionalities tomultiple nodes

⇒ Conceptual modeling is key to effective optimization ofETL processes in the age of big data

44/131

ETL Processes for Big Data (Bala et al. [4])

An ETL library contains a list of ETL functionalities, which canbe used to design ETL processes.

LookUp

Source

Target

Lookup Table

Errors

Outputs

Which data will be stored?

Cache mode

Figure: Example description of ETL functionality in the ETL library [4]

45/131

Modeling Transformations for Data Mining(Ordonez et al. [25])

Data mining algorithms require the source data in a veryspecific format.

The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).

Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.

Modeling transformations as separate entities along with anSQL query allows to track lineage of data.

Modeling Transformations for Data Mining(Ordonez et al. [25])

Data mining algorithms require the source data in a veryspecific format.

The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).

Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.

Modeling transformations as separate entities along with anSQL query allows to track lineage of data.

Modeling Transformations for Data Mining(Ordonez et al. [25])

Data mining algorithms require the source data in a veryspecific format.

The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).

Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.

Modeling transformations as separate entities along with anSQL query allows to track lineage of data.

Modeling Transformations for Data Mining(Ordonez et al. [25])

Data mining algorithms require the source data in a veryspecific format.

The source data, however, are often scattered across multipledatasets/relations (even in a data warehouse).

Transformations include denormalizations and aggregations,where denormalization is a rather broad term that also includesapplying complex expressions on attributes.

Modeling transformations as separate entities along with anSQL query allows to track lineage of data.

Example: Source Schema

Animal

MilkYield

Feeding

FeedMix

NationalIDPK

Name

Sex

Breed

AnimalNationalIDFKPK

DatePK

TimePK

MilkYield

AnimalNationalIDFKPK

DatePK

TimePK

Quantity

FeedMixIDFK

FeedMixIDPK

PercentRoughage

FeedMixType

PercentSilage

Example: Transformation for Data Mining

Can the feed intake serve as a predictor for milk yield on thenext day?

A data mining algorithm may answer that question.

But first, we need to obtain a data set that contains the animalsmilk yield on a particular date along with the feed intake fromthe day before.

Transformation Entity: Denormalization

Animal

MilkYield

Denormalzation: T1

NationalIDPK

Name

Sex

Breed

AnimalNationalIDFKPK

DatePK

TimePK

MilkYield

AnimalNationalIDFKPK

DateFKPK

TimeFKPK

MilkYield

AnimalBreed

SQL

Transformation Entity: Denormalization

Denormalization: T2 Feeding

FeedMix

AnimalNationalIDFKPK

QuantityRoughage

QuantitySilage

AnimalNationalIDFKPK

DatePK

TimePK

Quantity

FeedMixIDFK

FeedMixIDPK

PercentRoughage

FeedMixType

PercentSilage

DateFKPK

TimeFKPK

FeedMixIDFK

FeedMixType

SQL

Transformation Entity: Aggregation

Aggregation: T3 Aggregation: T4

AnimalNationalIDFKPK AnimalNationalIDFKPK

DatePK

QuantityRoughage

FeedMixIDFKPK

DatePK

SQL

MilkYield

AnimalBreedPK

QuantitySilage

FeedMixTypePK

SQL

Transformation Entity: Target

Denormalization: T5

AnimalNationalIDFKPK

MilkingDatePK

FeedingDatePK

MilkYield

FeedMixIDFKPK

AnimalBreedPK

QuantitySilage

QuantityRoughage

Superimposed Multidimensional Schemas

In some cases, it may be impractical to extract the data fromthe source systems.

⇒ Volume/Velocity/Volatility

Rather, a multidimensional schema with mapping rules may besuperimposed over the sources

Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.

53/131

Superimposed Multidimensional Schemas

In some cases, it may be impractical to extract the data fromthe source systems.

⇒ Volume/Velocity/Volatility

Rather, a multidimensional schema with mapping rules may besuperimposed over the sources

Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.

53/131

Superimposed Multidimensional Schemas

In some cases, it may be impractical to extract the data fromthe source systems.

⇒ Volume/Velocity/Volatility

Rather, a multidimensional schema with mapping rules may besuperimposed over the sources

Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.

53/131

Superimposed Multidimensional Schemas

In some cases, it may be impractical to extract the data fromthe source systems.

⇒ Volume/Velocity/Volatility

Rather, a multidimensional schema with mapping rules may besuperimposed over the sources

Further ReadingM. Hilal, C. G. Schuetz, M. Schrefl. Using superimposed mul-tidimensional schemas and OLAP patterns for RDF data anal-ysis Open Computer Science, 8(1), 18-37, 2018.

53/131

Example: Superimposed MultidimensionalSchemas for Linked Data Analysis [12]

Repositories of linked data such as Wikidata can be animportant resource for data analysis.

� RDF data do not follow a structure suitable for OLAP-styledata analysis

� These data are not under analyst’s control.

� Exploiting these data by casual analysis is not an easytask and requires knowledge of SPARQL

⇒ Superimposition of multidimensional schemas renders thesedata accessible for OLAP

54/131

Analytical SPARQL Query over Wikidata

Film Cube over Wikidata

INTEGRATION,AGGREGATION, ANDREPRESENTATION

Data Integration

Most ETL processes integrate data from multiple sources

The presented techniques for conceptual ETL processmodeling account for that fact

With the emergence of the (semantic) web and social media,the data generated on web platforms has become a valuableresource for the analysis

57/131

Fusion Cubes (Abelló et al. [2])

The vision:

� Complement existing data cubes with fusion cubes thatinclude external data from RDF and linked data sources.

� Provide a drill-beyond operator to allow the user to definehow and where an existing cube should be extended withexternal data.

� Business intelligence should become truly self-service.

⇒ A uniform representation format might help

58/131

Fusion Cubes (Abelló et al. [2])

The vision:

� Complement existing data cubes with fusion cubes thatinclude external data from RDF and linked data sources.

� Provide a drill-beyond operator to allow the user to definehow and where an existing cube should be extended withexternal data.

� Business intelligence should become truly self-service.

⇒ A uniform representation format might help

58/131

QB4OLAP: BI Vocabulary for Linked Data(Etcheverry et al. [8])

QB4OLAP extends the W3C’s Data Cube (QB) Vocabulary withconcepts required for OLAP, e.g., hierarchies.

⇒ Representation of statistical linked data.

In AgriProKnow, QB4OLAP serves for the semantic descriptionof the data warehouse schema, where elements can be linkedto domain ontologies and websites.

QB4OLAP may also serve as the vocabulary for superimposedmultidimensional schemas as well [12].

59/131

Social Business Intelligence

Combines data from companies (e.g., sales) with datagenerated by users on social media.

Often, social business intelligence involves sentiment analysisof user content based on natural language processing.

The results of such analysis may be stored in cubes for furtheranalysis [9].

Example query: What is the average sentiment towardssmartphones?

60/131

ANALYSIS AND MODELING

Business Intelligence Model (BIM)(Horkoff et al. [13])

Representation of business strategies:

� Goals, which are selected from the Balanced Scorecarddimensions (financial, customer, processes, learning), atstrategic, tactial, or operational level.

� Situations represent internal and external factors thatinfluence goals positively or negatively.

� Processes aim to achieve the goals.

� Key Performance Indicators (→ later in this tutorial).

61/131

Example: BIM for AgriProKnow

Maximize milk

yielddesires

Milk yield

evaluates

Farmer

Prevent

animal illness

Optimize

feed intake

++

AND

Automatic

Feeding

Well-fed

animals

Strength

Body Condition Score

Antibiotics

resistance

Threat

# of known

resistant germs

P L

F

Business-Driven Data Analytics(Nalchigar and Yu [20])

Requirements analysis and design of data analytics systemshas multiple, complementary views.

� Business view: Starting from the business goals, the dataanalytics goals are defined.

� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.

� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics

→ similar to ETL models but at higher level

63/131

Business-Driven Data Analytics(Nalchigar and Yu [20])

Requirements analysis and design of data analytics systemshas multiple, complementary views.

� Business view: Starting from the business goals, the dataanalytics goals are defined.

� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.

� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics

→ similar to ETL models but at higher level

63/131

Business-Driven Data Analytics(Nalchigar and Yu [20])

Requirements analysis and design of data analytics systemshas multiple, complementary views.

� Business view: Starting from the business goals, the dataanalytics goals are defined.

� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.

� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics

→ similar to ETL models but at higher level

63/131

Business-Driven Data Analytics(Nalchigar and Yu [20])

Requirements analysis and design of data analytics systemshas multiple, complementary views.

� Business view: Starting from the business goals, the dataanalytics goals are defined.

� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.

� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics

→ similar to ETL models but at higher level

63/131

Business-Driven Data Analytics(Nalchigar and Yu [20])

Requirements analysis and design of data analytics systemshas multiple, complementary views.

� Business view: Starting from the business goals, the dataanalytics goals are defined.

� Data analytics design view: Explore different methods toachieve the data analytics goals by comparing theirstrengths and weaknesses.

� Data preparation view: Define what data sets and datapreparation steps are required to perform the chosenanalytics→ similar to ETL models but at higher level

63/131

Business View: Animal Illness

Maximize milk

yielddesires

Milk yield

evaluates

Farmer

Prevent

animal illness

Decision on

feed change

DDecision on

calling veterinarian

D

Which animals

are at risk?

QWhich feed mix

is best for animals?

Q

Optimize

feed intake

++

AND

+ type = Predictive

model

+ input = movement

and health data

+ output = Alert

+ usageFrequency =

daily

+ updateFrequency =

quarterly

+ learningPeriod =

12 months

Animals at risk

predictive model

answers

Data Analytics Design View: Animal Illness

Predict animal

illness

Recall

evaluatesClassification

of animals

+ type = Predictive

model

+ input = movement

and health data

+ output = Alert

+ usageFrequency =

daily

+ updateFrequency =

quarterly

+ learningPeriod =

12 months

Animals at risk

predictive model

generates

Precision evaluates

Logistic Regression

Deep Learning

0.85

0.55

achievesachieves

0.75

0.65

65/131

Data Analytics Design View: Animal Illness

Predict animal

illness

Recall

evaluatesClassification

of animals

+ type = Predictive

model

+ input = movement

and health data

+ output = Alert

+ usageFrequency =

daily

+ updateFrequency =

quarterly

+ learningPeriod =

12 months

Animals at risk

predictive model

generates

Precision evaluates

Logistic Regression

Deep Learning

0.85

0.55

achievesachieves

0.75

0.65

Tolerance to

missing values

++

-

66/131

Reference Modeling: BIRD Approach [27]

The idea stems from a small industry project we had a coupleof years ago.

Lightweight reference models for OLAP cubes, calculatedmeasures, and business terms should be customizable fordifferent small and medium-sized companies within an industryor large companies with multiple divisions.

Calculated measures and business terms are representedusing snippets of SQL code.

Further ReadingC. G. Schuetz, B. Neumayr, M. Schrefl, T. Neuböck. Refer-ence Modeling for Data Analysis: The BIRD Approach. Inter-national Journal of Cooperative Information Systems, 25(2),1-46, 2016.

Example: Reference Model

plannedQuantity

actualQuantity «mandatory»

plannedCosts

actualCosts «mandatory»

/plannedCostsPerUnit

/actualCostsPerUnit «mandatory»

/actualCostsYTD

/actualCostsToPreviousDay

MaterialUsedForProduct

costs «mandatory»

shippingCosts

/totalCosts

/totalCostsYTD

/totalCostsToPrevWeek

MaterialInSupplyOrder

«mandatory»

week

day

quarter

month

year

«mandatory»

«mandatory»

«mandatory»productOrder

productCategory

minLifeTime

minTemperature

maxTemperature

ColdResistantProduct

HeatResistantProduct

DurableProduct

«mandatory»

materialCategory

material

maxTemperature

minTemperature

ColdResistantMaterial

HeatResistantMaterial

customercustomer-

Region

industry

consumer-

Group

building

site

country

«mandatory» «mandatory» «mandatory»

«mandatory»«mandatory»

Product Time

Material

Factory

Customer

Example: Reference Model Customization

plannedQuantity

actualQuantity

plannedCosts

actualCosts

/plannedCostsPerUnit

/actualCostsPerUnit

/actualCostsYTD

/actualCostsToPreviousDay

MaterialUsedForProduct

+ orderedQuantity

+ deliveredQuantity

costs

shippingCosts

/totalCosts

+ /totalCostsPerUnit

/totalCostsYTD

/totalCostsToPrevWeek

MaterialInSupplyOrder

week

day

quarter

month

year

productOrder

productCategory

minLifeTime

minTemperature

maxTemperature

ColdResistantProduct

HeatResistantProduct

DurableProduct

materialCategory

material

maxTemperature

minTemperature

ColdResistantMaterial

HeatResistantMaterial

+ supplier-

Region

+ supplier

customercustomer-

Region

industry

consumer-

Group

building

site

country

Product Time

Material

+ Supplier

Factory

Customer

69/131

Tool Support [27]

Implementation using Indyco Builder as modeling tool, XML forthe specification of business terms and calculated measures aswell as customizations/redefinitions, and XQuery to apply thetransformations to the reference model.

70/131

Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.

The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.

Thus, once operational, the AgriProKnow data warehousecould consist of two parts:

� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.

� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.

Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.

The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.

Thus, once operational, the AgriProKnow data warehousecould consist of two parts:

� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.

� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.

Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.

The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.

Thus, once operational, the AgriProKnow data warehousecould consist of two parts:

� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.

� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.

Applicability of BIRD to AgriProKnowAgriProKnow aims to integrate data from multiple farms in orderto generate new process knowledge, e.g., early indicators of aswell as influence factors for animal illness.

The thus generated knowledge should be applied forrule-based process monitoring and control, e.g., by calling adoctor when danger of illness is detected.

Thus, once operational, the AgriProKnow data warehousecould consist of two parts:

� Inter-farm data warehouse: Integrating the data fromvarious sources in order to generate new knowledge.

� Farm-specific data warehouses: A data warehouse foreach farm built through customization of a referencemodel; the analysis rules are executed over thefarm-specific data warehouses.

AgriProKnow Reference Model [27]

72/131

AgriProKnow Reference Model [27]

73/131

AgriProKnow Reference ModelCustomization [27]

74/131

Fact Table [27]

A fact table is generated by Indyco Builder based on acustomized reference model.

75/131

Query [27]

Queries are formulated using analysis situations (or OLAPpatterns) and then automatically translated into SQL.

XQuery functions take the customized reference model and theSQL DDL statements to generate SQL queries for analysissituations.

76/131

OLAP Patterns: Basic Idea

77/131

OLAP Patterns: Basic Idea

78/131

OLAP Patterns: Basic Idea

79/131

OLAP Patterns: Definition

80/131

OLAP Patterns: Examples

81/131

Enhanced Dimensional Fact Model (eDFM)

82/131

eDFM in QB/QB4OLAP + Extension

83/131

OLAP Patterns: Description Form

84/131

OLAP Patterns: Description

85/131

OLAP Patterns: Framework

86/131

OLAP Patterns: RDF Definition

:HomogeneousIndependentSetComparison a pl:Pattern;

pl:name "Homogeneous independent-set comparison"@en;

pl:situation "Compare SI and SC with the same ..."@en;

pl:solution "The fact class, dimensions, grouping ..."@en;

pl:structure "SI: 1 fact class, 1..* selection ..."@en;

pl:example "Calculate the delta (comparative ..."@en.

pl:hasElement :base, :baseSlice, :measure, :dimensionLevel,

:dimension, :measureNotNull, :siSlice, :scSlice,

:compMeasure, :compHaving, :SetOfInterest,

:SetOfComparison;

pl:result :compMeasure, :dimensionLevel,

[pl:element :measure; pl:elementPrefix "SI_"],

[pl:element :measure; pl:elementPrefix "SC_"].

87/131

OLAP Patterns: Instantiation

:DeltaMilkYield a pl:QbPatternInstance;

pl:instanceOf :HIndependentSetComparison;

:base agri:Milk;

:baseSlice :DateIn2017;

:measure :SumOfMilkYield;

:dimension agri:Animal, agri:FarmSite;

:dimensionLevel agri:Animal, agri:FarmSite;

:siSlice :today;

:scSlice :prior5days;

:compMeasure :DeltaMilkYield;

:compHaving :positiveDeltaMilkYield;

88/131

OLAP Patterns: Measures and Predicates

89/131

OLAP Patterns: Measures and Predicates

90/131

Pattern Expression

For each pattern, a generic query template in a target languageis defined – the pattern expression.

That target language can be SQL but also another languagesuch as SPARQL [12]

.

Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.

91/131

Pattern Expression

For each pattern, a generic query template in a target languageis defined – the pattern expression.

That target language can be SQL

but also another languagesuch as SPARQL [12]

.

Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.

91/131

Pattern Expression

For each pattern, a generic query template in a target languageis defined – the pattern expression.

That target language can be SQL but also another languagesuch as SPARQL [12].

Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.

91/131

Pattern Expression

For each pattern, a generic query template in a target languageis defined – the pattern expression.

That target language can be SQL but also another languagesuch as SPARQL [12].

Upon pattern instantiation, predicate and measure expressionsare inserted into the placeholders in the pattern expression.

91/131

OLAP Patterns: Guided Instantiation

The RDF representation of pattern and multidimensional modelelements as well as the relationships among those elementsmay serve to build a “wizard” for guided query instantiation.

A demonstration video can be found here:https://www.youtube.com/watch?v=BLt6heO7WKY

92/131

Analysis Graphs (Neuböck et al. [21, 27])

Analysis graphs explicitly represent knowledge about analysisprocesses.

Potential applications:

� Documentation of analysis processes

� To build tool support for exploratory OLAP

� Potentially automate complex analysis processes

� As the representation format for analysis process mining

93/131

Example: Analysis Graph (Bird’s-Eye View)

Quantity and Expected Delivery

Time of Undelivered MaterialMaterial

Order

Canceled

Orders from other Suppliers that

Contain Undelivered Material

Products that Contain

Undeliverd Material

Ordered Material with Properties

Similar to Undeliverd Material

List of Customer Orders

Affected by Material Order

Canceling

Figure: An unrefined analysis graph for analysis in the event of ordercancellation [27]

94/131

Example: Analysis Graph

factClass = MaterialUsedForProduct

measure = {SUM(actualCosts)}

MonthlyCostsOfMaterialUse : AnalysisSituation

diceLevel = material

diceNode = ?mat

MaterialParameters

diceLevel = ?tmLevel

diceNode = ?tm

granularity = Time.month

TimeParameters

factClass = MaterialUsedForProduct

measure = {SUM(actualCosts)}

MonthlyCostsOfMaterialSupplyOrderWithProperty : AnalysisSituation

diceLevel = material

diceNode = ?mat

sliceCondition = ?prop

MaterialParameters

diceLevel = ?tmLevel

diceNode = ?tm

granularity = Time.month

TimeParameters

diceLevel =

Product.productCategory

diceNode = ?prodCat

ProductParameters

addSliceCondition(Material, ?prop)

moveToNode(Product, Product.product-

Category, ?prodCat)

FocusO

n-

Pro

pert

y

Figure: Example navigation step between analysis situations [27]

OLAP Endpoints for Linked Data [11]

Linked data repositories could provide an endpoint for OLAPanalysis based on superimposed multidimensional schemasand analysis graphs in order to facilitate exploration andanalysis of the data repository.

OLAP Endpoints for Linked Data [11]: Video

A demonstration video of a preliminary version can be foundhere: https://youtu.be/ymhkqla8J1I

We have since improved the appearance and are currentlypreparing a user study.

97/131

INTERPRETATION ANDACTION

Summarizability

What is summarizability about? Correct interpretation

Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:

� Concepts in the modeling language [22, 17, 15]

� Constraint-based approaches [16, 14, 1]

98/131

Summarizability

What is summarizability about? Correct interpretation

Conceptual modeling may help to ensure summarizability

or atleast make issues with summarizability explicit:

� Concepts in the modeling language [22, 17, 15]

� Constraint-based approaches [16, 14, 1]

98/131

Summarizability

What is summarizability about? Correct interpretation

Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:

� Concepts in the modeling language [22, 17, 15]

� Constraint-based approaches [16, 14, 1]

98/131

Summarizability

What is summarizability about? Correct interpretation

Conceptual modeling may help to ensure summarizability or atleast make issues with summarizability explicit:

� Concepts in the modeling language [22, 17, 15]

� Constraint-based approaches [16, 14, 1]

98/131

Conditions for Summarizability

DaVinciCode

Book

HonoluluSkirt

ClothingCategory

All

Product

Entertainment

+ Profit = 10 + Profit = 20

+ Profit = 10 + Profit = 10 + Profit = 20

All+ Profit = 30

10 + 10 + 20 = 40

10 + 20 = 30

Figure: Condition 1: Disjointness (Strict Hierarchies)

99/131

Conditions for Summarizability

Allall

AustriaSwitzerland

Vaud

Lausanne Montreux Salzburg Viennacity

canton

country

profit = 70

profit = 10 profit = 5

profit = 15

profit = 15 profit = 40

Figure: Condition 2: Completeness (Balanced Hierarchies)

100/131

Attribute Groups in DFM

101/131

Generalized/Hetero-HomogeneousHierarchies

<agent>

<agentType>

Sensor : <T>

concretization of

<agent>

<position>

Person : <agentType>

<age>

<agent>

<processType>

Process : <agentType>

concretization of

<logicalDevice>

<deviceType>

Device : <agentType>

<agent>

+ nominalAccuracy

concretization of

<logicalDevice>

<milkingParlorType>

MilkingParlor : <deviceType>

<agent>

+ measuredIngredients

concretization of

Key Performance Indicators (KPIs)

The base measures are typically combined into morecomprehensive indicators of economic success.

Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies.

KPIs are traditionally defined with respect to abusiness strategy and business objectives”

103/131

Key Performance Indicators (KPIs)

The base measures are typically combined into morecomprehensive indicators of economic success.

Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies.

KPIs are traditionally defined with respect to abusiness strategy and business objectives”

103/131

Key Performance Indicators (KPIs)

The base measures are typically combined into morecomprehensive indicators of economic success.

Definition [31, p. 362]“KPIs are complex measurements used to estimate the effec-tiveness of an organization in carrying out their activities andto monitor the performance of their processes and businessstrategies. KPIs are traditionally defined with respect to abusiness strategy and business objectives”

103/131

Goal Modeling and KPIs (Maté et al. [18])

The Business Intelligence Model (BIM) can be employed for thesystematic derivation of KPIs that are in line with businessstrategy.

104/131

Example: BIM for AgriProKnow

Maximize milk

yielddesires

Milk yield

evaluates

Farmer

Prevent

animal illness

Optimize

feed intake

++

AND

Automatic

Feeding

Well-fed

animals

Strength

Body Condition Score

Antibiotics

resistance

Threat

# of known

resistant germs

P L

F

Goal Modeling and KPIs (Maté et al. [18])

The Business Intelligence Model (BIM) can be employed for thesystematic derivation of KPIs that are in line with the overallbusiness strategy.

Using the Semantics of Business Vocabulary and Rules(SBVR), the KPIs are subsequently precisely specified inStructured English.

The KPIs defined in SBVR then translate into executable MDXqueries over a multidimensional schema.

106/131

Goal-Based Selection of Visualizations(Golfarelli et al. [10])

Idea: The user specifies their analysis goals and otherparameters, which are subsequently used to recommend (orrecommend against) certain types of visualizations.

Users may hence declare:

� Goal : composition, order, cluster, distribution, etc.

� Interaction: overview, zoom, filter, details-on-demand

� Experience: lay or tech person

� Dimensionality : n-dimensional, tree, graph

� Cardinality : low, high

� Type of measure: nominal, ordinal, interval, ratio

107/131

Goal-Based Selection of Visualizations(Golfarelli et al. [10])

Given a single criterion, a visualization may be fit, acceptable,neutral, discouraged, or unfit.

For example, a pie chart is fit for composition whereas a heatmap is unfit. A pie chart is fit for giving an overview whereas abubble chart is acceptable.

Given selections for multiple criteria, the optimal visualizationtypes can be calculated based on the qualitative suitability ofeach visualization for the different criteria.

108/131

VizDSL (Morgan et al. [19])

Goals:

� Platform-independent and extensible modeling language� Non-IT experts are able to quickly and easily describe,

model, and create interactive visualizations

Structured Source Code VizDSL Model Interactive Visualization

109/131

VizDSL (Morgan et al. [19])

Extension of Interaction Flow Modeling Language (IFML) withmodeling elements for interactive visualization of data.

Figure: The visual notation for VizDSL

110/131

VizDSL (Morgan et al. [19])

111/131

VizDSL (Morgan et al. [19])

112/131

VizDSL (Morgan et al. [19])

113/131

Analysis Rules

In AgriProKnow, we have implemented analysis rules based onthe notion of OLAP patterns.

An action, e.g., calling a vet, can be triggered by noteworthyresults of analyses that are periodically carried out.

The analyses are specified using OLAP patterns.

114/131

OPEN ISSUES

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning

→ overlap between ER and Semantic Web communities

� .

.. any thoughts?

115/131

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities

� .

.. any thoughts?

115/131

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities

� .

.. any thoughts?

115/131

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities

� ..

. any thoughts?

115/131

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities

� ...

any thoughts?

115/131

Open Issues

� Integration of conceptual models (or knowledge graphs)with machine/deep learning→ overlap between ER and Semantic Web communities

� ... any thoughts?

115/131

References I

[1] Combining objects with rules to represent aggregationknowledge in data warehouse and OLAP systems.

[2] A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N.Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo,P. Vassiliadis, and G. Vossen.Fusion cubes: towards self-service business intelligence.International Journal of Data Warehousing and Mining.

117/131

References II

[3] D. Agrawal, P. Bernstein, E. Bertino, S. Davidson,U. Dayal, M. Franklin, and others.Challenges and opportunities with big data – a communitywhite paper developed by leading researchers across theUnited States.Technical report, 2012.https:

//cra.org/ccc/resources/ccc-led-whitepapers/ (lastaccess: 2 October 2018).

118/131

References III

[4] M. Bala, O. Boussaid, and Z. Alimazighi.A fine-grained distribution approach for ETL processes inbig data environments.Data & Knowledge Engineering, 111:114–136, 2017.

[5] S. Dobson, M. Golfarelli, S. Graziani, and S. Rizzi.A reference architecture and model for sensor datawarehousing.IEEE Sensors Journal, 18(18):7659–7670, 2018.

119/131

References IV

[6] Z. El Akkaoui, J. Mazón, A. A. Vaisman, and E. Zimányi.BPMN-based conceptual modeling of ETL processes.In A. Cuzzocrea and U. Dayal, editors, DaWaK 2012,volume 7448 of Lecture Notes in Computer Science,pages 1–14. Springer, 2012.

[7] Z. El Akkaoui and E. Zimányi.Defining ETL worfklows using BPMN and BPEL.In Proceedings of the ACM 12th International Workshopon Data Warehousing and OLAP, pages 41–48, 2009.

120/131

References V

[8] L. Etcheverry, A. Vaisman, and E. Zimányi.Modeling and querying data warehouses on the semanticweb using QB4OLAP.In L. Bellatreche and M. K. Mohania, editors, DaWaK2014, volume 8646 of LNCS, pages 45–56. Springer,2014.

[9] M. Golfarelli.Design issues in social business intelligence projects.In E. Zimányi and A. Abelló, editors, eBISS 2015, volume253 of LNBIP, pages 62–86. Springer, 2016.

121/131

References VI

[10] M. Golfarelli, T. Pirini, and S. Rizzi.Goal-based selection of visual representations for big dataanalytics.In S. de Cesare and U. Frank, editors, ER 2017Workshops, volume 10651 of LNCS, pages 47–57.Springer, 2017.

[11] M. Hilal, C. G. Schuetz, and M. Schrefl.An OLAP endpoint for RDF data analysis using analysisgraphs.In Proceedings of the ISWC 2017 Posters &Demonstrations and Industry Tracks, 2017.

122/131

References VII

[12] M. Hilal, C. G. Schuetz, and M. Schrefl.Using superimposed multidimensional schemas andOLAP patterns for RDF data analysis.Open Computer Science, 8(1):18–37, 2018.

[13] J. Horkoff, D. Barone, L. Jiang, E. Yu, D. Amyot,A. Borgida, and J. Mylopoulos.Strategic business modeling: representation andreasoning.Software & Systems Modeling, 13(3):1015–1041, 2014.

123/131

References VIII

[14] C. Hurtado, C. Gutierrez, and A. Mendelzon.Capturing summarizability with integrity constraints inOLAP.ACM Transactions on Database Systems, 30:854–886,2005.

[15] Indyco.Attribute groups, 2015.http://indyco.freshdesk.com/support/solutions/

articles/1000212913-attribute-groups [Online;accessed 7-October-2018].

124/131

References IX

[16] J. Lechtenbörger and G. Vossen.Multidimensional normal forms for data warehouse design.

Information Systems, 28:415–434, 2003.

[17] E. Malinowski and E. Zimányi.A conceptual model for temporal data warehouses and itstransformation to the ER and the object-relational models.Data & Knowledge Engineering, 64(1):101–133, 2008.

[18] A. Maté, J. Trujillo, and J. Mylopoulos.Frequent patterns in ETL workflows: An empiricalapproach.Data & Knowledge Engineering, 108:30–49, 2017.

125/131

References X

[19] R. Morgan, G. Grossmann, M. Schrefl, M. Stumptner, andT. Payne.Vizdsl: A visual DSL for interactive informationvisualization.In J. Krogstie and H. A. Reijers, editors, CAiSE 2018,volume 10816 of LNCS, pages 440–455. Springer, 2018.

[20] S. Nalchigar and E. Yu.Business-driven data analytics: A conceptual modelingframework.Data & Knowledge Engineering, 117:359–372, 2018.

126/131

References XI

[21] T. Neuböck, B. Neumayr, M. Schrefl, and C. G. Schütz.Ontology-driven business intelligence for comparative dataanalysis.In E. Zimányi, editor, eBISS 2013, volume 172 of LNBIP,pages 77–120. Springer, 2014.

[22] B. Neumayr, M. Schrefl, and B. Thalheim.Hetero-homogeneous hierarchies in data warehouses.In S. Link and A. Ghose, editors, APCCM 2010, volume110 of CRPIT, pages 61–70. Australian Computer Society,2010.

127/131

References XII

[23] B. Oliveira and O. Belo.BPMN patterns for ETL conceptual modelling andvalidation, 2012.

[24] B. Oliveira, V. Santos, and O. Belo.Pattern-based ETL conceptual modelling, 2013.

[25] C. Ordonez, S. Maabout, D. S. Matusevich, andW. Cabrera.Extending er models to capture database transformationsto build data sets for data mining.Data & Knowledge Engineering, 89:38–54, 2014.

128/131

References XIII

[26] P. Russom.Hadoop for the enterprise.Technical report, TDWI, 2015.https://www.cloudera.com/content/dam/cloudera/

Resources/PDF/Reports/TDWI-Best-Practices-Report_

Hadoop-for-the-Enterprise.pdf (last access: 28 June2016).

[27] C. G. Schuetz, B. Neumayr, M. Schrefl, and T. Neuböck.Reference modeling for data analysis: The BIRDapproach.International Journal of Cooperative Information Systems,25(2):1–46, 2016.

129/131

References XIV

[28] C. G. Schuetz, S. Schausberger, and M. Schrefl.Building an active semantic data warehouse for precisiondairy farming.Journal of Organizational Computing and ElectronicCommerce, 28(2):122–141, 2018.

[29] R. Sherman.Business Intelligence Guidebook.Morgan Kaufmann, 2015.

[30] V. Theodoroua, A. Abelló, M. Thieleb, and W. Lehner.Frequent patterns in ETL workflows: An empiricalapproach.Data & Knowledge Engineering, 112:1–16, 2017.

130/131

References XV

[31] A. Vaisman and E. Zimányi.Data Warehouse Systems – Design and Implementation.Springer, 2014.

[32] S. Williams.Business Intelligence Strategy and Big Data Analytics.Morgan Kaufmann, 2016.

131/131

Recommended