73
(Open) data analysis for decision support: challenges and essentials Antonio Vetrò Technische Universität München, Germany 01 September 2014,Matera (Italy), RENA Summer school @phisaz [email protected] With examples from Open Coesione With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)

(Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Embed Size (px)

DESCRIPTION

This slideset was presented at the 2014 RENA Summer School on Good Government and Open Citizenship. It uses examples from an open dataset on EU fundings in Italy to show essentials and challenges in using open data to support decisions

Citation preview

Page 1: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

(Open) data analysis for decision support: challenges and essentials !

Antonio Vetrò Technische Universität München, Germany

01 September 2014,Matera (Italy), RENA Summer school

@phisaz

[email protected]

With examples from Open Coesione

With material from a joint work with: Lorenzo Canova, Marco Torchiano (PoliTO - Politecnico di Torino) Federico Morando, Raimondo Iemma (NEXA Center for Internet and Society - PoliTO) Aline Pennisi ( Ministero dell’ Economia e delle Finanze ) Feedback from Andrea Milan (United Nations University) Daniel Méndez Fernández (Technische Universität München)

Page 2: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

2

Page 3: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

2

Page 4: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

RENA Summer School 2014

Deciding and

implementing together

Monitoring togetherPlanning together

2

Page 5: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

3

Page 6: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

4

Page 7: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

5

Page 8: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

6

Page 9: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data

Source: Klaus Mainzer,Modern Aspects of Philosophy of Science in Informatics and its Applications,

Lehrstuhl für Philosophie und Wissenschaftstheorie Carl von Linde-Akademie Munich Center for Technology in Society Technische Universität München

Klaus Mainzer

Munich Center for Technology in Society Technische Universität München

Knowledge Representation : World, Model, and Formal Theory

World Model Theory

observation simulation deduction

approximation: {good, sufficient, insufficient}

interpretation: {true, false}

Figure: techrepublic.com

6

Page 10: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building

Falsification / support

Theory building

Study population

Deductive logicInductive logic

See also: Runeson et al. Case Study Research in Software Engineering: Guidelines and Experiments

7

Page 11: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

• Each empirical method…

• has a specific specific purpose • relies on a specific data type • has a specific setting !!

Purpose • Exploratory • Descriptive • Explanatory / confirmatory • Improving !

Data Type • Qualitative • Quantitative

Data analysis A philosophical perspective, empiricism

Observations / Evaluations

Questions / Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logic

Inductive logic

8

Page 12: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 13: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 14: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 15: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

9

Page 16: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Data analysis A philosophical perspective

Observations / Evaluations

(Tentative) Hypotheses

Theory/System of theories

Pattern building Falsification /

support

Theory building

Study population

Deductive logicInductive logic

Formal / conceptual analysis

Grounded theory

Exploratory – case/field studies – experiments – data analysis

Survey / interview research

Confirmatory – case studies – experiments – data analysis – …

Ethnographic studies

See also: Vessey et al A unified classification system for research in the computing disciplines

Deciding and

implementing togetherMonitoring together

Planning together

9

Page 17: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Outline

• Data analysis : a philosophical perspective, empiricism

• Data analysis challenges: examples with Open Data

10

Page 18: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Opportunities

Mike Lemansky, Open Data 11

Page 19: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Opportunities

Lab

Mike Lemansky, Open Data 11

Page 20: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

12

Page 21: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

& challenges

12

Page 22: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

& challenges

12

Page 23: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

13

Page 24: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

13

Page 25: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

13

Page 26: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Open Coesione

Dataset “progetti” - FSE 2007/2013 Snapshot 31/12/2013 Focus on funding and dates, 22/74 columns

> colnames(subsetProgetti) [1] "FINANZ_UE" "FINANZ_STATO_FONDO_DI_ROTAZIONE" [3] "FINANZ_STATO_FSC" "FINANZ_STATO_PAC" [5] "FINANZ_STATO_ALTRI_PROVVEDIMENTI" "FINANZ_REGIONE" [7] "FINANZ_PROVINCIA" "FINANZ_COMUNE" [9] "FINANZ_ALTRO_PUBBLICO" "FINANZ_STATO_ESTERO" [11] "FINANZ_PRIVATO" "FINANZ_DA_REPERIRE" [13] "FINANZ_TOTALE_PUBBLICO" "DPS_DATA_INIZIO_PREVISTA" [15] "DPS_DATA_FINE_PREVISTA" "DPS_DATA_INIZIO_EFFETTIVA" [17] "DPS_DATA_FINE_EFFETTIVA" "DPS_FLAG_CUP" [19] "DPS_FLAG_PRESENZA_DATE" "DPS_FLAG_COERENZA_DATE_PREV" [21] "DPS_FLAG_COERENZA_DATE_EFF" "DATA_AGGIORNAMENTO" 13

Page 27: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Milepost5 850 NE 81st Ave Portland, OR 97213 http://milepost5.net/galleries/

Gallery of challenges: Guided Tour

14

Page 28: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #1: Errors in data

15

Page 29: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 30: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 31: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

16

Page 32: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

16

Page 33: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

Errors can be inserted from:

- source (observation, sensor)

- manual insertion

- error from ETL*

!Be careful before claiming errors:

they might be “just” accuracy problems

* extraction, transformation, and loading

16

Page 34: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #2: accuracy

17

Page 35: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 36: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 37: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

18

Page 38: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

43 !

18

Page 39: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Page 40: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

43 !

18

Page 41: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Unione Europea 22.50Fondo di Rotazione 21.76Regione 0.74

———45.00

»Refer always to raw data

»If not possible, estimate accuracy on analysis (e.g., about 5% in the example above)

43 !

18

Page 42: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #3: missing data

19

Page 43: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 44: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 45: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

20

Page 46: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 47: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

NA in “finanziamenti”

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 48: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

NA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 49: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

21

Page 50: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

21

Page 51: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 52: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 53: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

No datesNA in “finanziamenti”

NA in “finanziamenti”, “pagamenti”,

missing values in Ateco and other columns

sub: analysis on 22 attributes int: analysis on the whole dataset general: both sub and int NA = NAs belongs to the domain

Completeness

pcc= percentage of complete cells pcrp = percentage of complete rows

Valu

e

In 89% of projects dates are present

Codes and descriptions

Ateco + other descriptions

Codes and descriptions

Ateco + other descriptions

No rows are complete

21

Page 54: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

What to do with missing data

1. Understand domain: - e.g., NA or 0 ?

2. Find motivation (e.g.. missing start date o.k. if project hasn’t started yet) 3. Understand how much they impact your analysis 4. You might also:

– exclude rows with missing values – use imputation techniques

– mean substitution – regression substitution – group mean substitution – hot deck imputation – multiple imputation

Source: A Mockus , Missing data in software engineering, Guide to advanced empirical software engineering, 200822

Page 55: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #4: outliers

23

Page 56: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

» Outliers can point to interesting facts

Challenge #4: outliers

23

Page 57: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

» … or to something which deserves a second look

Challenge #4: outliers

24

Page 58: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Valu

e

pcvc= percentage of cells with correct value

25

Page 59: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€Va

lue

pcvc= percentage of cells with correct value

25

Page 60: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

Valu

e

pcvc= percentage of cells with correct value

25

Page 61: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

pcvc= percentage of cells with correct value

25

Page 62: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

ca. 50000 fundings < 1€

ca.210000 <5€

ca. 360000<55€Va

lue

ca.430000<89€

pcvc= percentage of cells with correct value

25

Page 63: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

What to do with outliers

1. Retention – Check the distribution of data: if heavy tailed, keep

them but don’t apply techniques which require normality

2. Exclusion – Remove them in case you think is a measurement error

or an exceptional case 3. Sensitivity analysis

– compare results with and without outliers – reason on the motivations

26

Page 64: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #5: Drawing proper conclusions

27

Page 65: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Challenge #5: Drawing proper conclusions

» Knowledge is more than statistical significance

» Context and domain knowledge are fundamental

» Consider both qualitative and quantitative aspects

» Triangulate data with other sources27

Page 66: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Page 67: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Page 68: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Page 69: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Page 70: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Summing up and additional suggestions

28

Challenges : watch out to

!- Errors

- Data accuracy

- Missing data

- Outliers

- Drawing proper conclusions

Verify data collection !- sampling - reference time and context - appropriateness to goals - transformations

Check first “how data looks like” !Most programs (Excel, SPSS,STATA,R,…) offer predefined functions

Keep track of:

- modifications and reasons

- different versions

- raw data

Page 71: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Interesting readings

29

Page 72: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Gallery of challenges: Guided Tour End of Guided Tour

30

Page 73: (Open) data analysis for decision support: challenges and essentials - With examples from Open Coesione

Gallery of challenges: Guided Tour End of Guided Tour

30