Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics

Data Service Centre and Apache Spark

at Statistics Netherlands

2

Statistical process and DSC

WebsiteStatlineOpen dataArticlesBooks

DSCMicrodata

services

RIN

RIN

RIN

RIN

3

• Technical backend: Document management system Documentum (Open Text)

• Only statistical data that you can store in rows and columns (no documents, images etc.)

• Data stored as text files (csv, fixed-width): future proof• Primary focus was archiving, but now more and more on data

exchange• Retrieve data and process data in SPSS, R, Python, custom built

systems• Almost 14.000 datasets, mostly microdata• Covers all domains: social statistics, business statistics, national

accounts, health statistics, energy statistics, agriculturalstatistics etc. etc.

DSC not a traditional datawarehouse

4

DSC Catalogue

5

DSC Catalogue

10

• Subset of data in DSC• Highly coordinated• Mostly based on administrative sources, some surveys• ‘Backbones’ (persons, buildings, households, companies)• Linkable datasets• Widely used for statistical production and research:

longitudinal, small groups, intergenerational, networks• SSD tool set on top of DSC• https://www.cbs.nl/NR/rdonlyres/98BFF618-D7A7-4897-

85D6-6293CFB8EA75/0/systemofsocialstatisticaldatasets.pdf

System of Social statistical Datasets (SSD)

https://www.cbs.nl/NR/rdonlyres/98BFF618-D7A7-4897-85D6-6293CFB8EA75/0/systemofsocialstatisticaldatasets.pdf

11

Proof of concept ‘Data lake’

DSCRaw data Big dataOther SN data Other data

Data virtualisation (Denodo)

User User User User

Statistics Netherlands The ‘outside’

Metadata

Governance

Organisation

Governance+ Governance+

Organisation

14

BIG DATAis of all times

15ca. 1981–1975 B.C.

16

17

18

Contest: person who could process and tabulate the data fastest would earn a contract for Census 1890

Process:

Participant A: 144 hrs

Participant B: 100 hrs

Participant C: 72 hrs

1888 Hackathon US Census Bureau

Tabulate:

Participant A: 44 hrs

Participant B: 55 hrs

Participant C: 5 hrs

19

Herman Hollerith

1896 Tabulating Machine Company

1911 Computing-Tabulating-Recording Company

1924 International Business Machines Corporation

1908

20

2018: DSC contains about 14 thousand datasets (≈5 TB). Retrieving and processing data should go faster.

Can we build a tabulating machine based on contemporary technology?

21

Apache SPARK

22

23

24

25

Test case

26

DSC

Authentication

SPARK

Spark programming (PySpark)

Data control

Authorisation control

meta

data

27

After a CBS press release about average capital per municipality* a journalist asks whether the top 10 would be the same when one looks at average wage per municipality.

Top 10 average capital per municipality, 2016

Laren (NH.)

Blaricum

Bloemendaal

Wassenaar

Rozendaal

Heemstede

Bergen (NH.)

Alphen-Chaam

De Bilt

Westvoorne

*https://www.cbs.nl/nl-nl/nieuws/2018/06/vermogen-huishoudens-bijna-10-procent-hoger-in-2016

User story

https://www.cbs.nl/nl-nl/nieuws/2018/06/vermogen-huishoudens-bijna-10-procent-hoger-in-2016

28

29

30

DSC Datasets

SPOLIS2015all jobs in NL in 2015

GBAADRESOBJECT2015all addresses 2015

VSLGWB2015municipality-district-

neighbourhood code of alladdresses

SBASISLOON (wage), SREGULIEREUREN(hours)

Filter:SDATUMAANVANGIKO >= 20150101SDATUMAANVANGIKO <= 20150131

-

Filter:GBADATUMAANVANGADRESHUISHOUDING

<= 20150101GBADATUMEINDEADRESHUISHOUDING

>= 20150101

GEM, derived from GWBCODE2016 [1-4]

Link by:RINPERSOONSRINPERSOON

Link by:RINPERSOONSRINPERSOON

SOORTOBJECTNUMMERRINOBJECTNUMMER

Link by:

SOORTOBJECTNUMMERRINOBJECTNUMMER

10 mln records, 1.74 Gb61 mln records, 3.45 Gb110 mln records, 68.76 Gb

Aggregate on GEM (MUN)

UURLOON (HOURLYWAGE) = Sum(SBASISLOON) / Sum(SREGULIEREUREN)

31

User interface

32

33

34

35

Processing time syntax on Spark cluster: Approx. 1 minute

Other advantages:- Open source- Modern tool set- Syntax based- Sharing code- Visualisations- Commonly used, documentation

Disclaimer: data shown are for demo purposes only, they are not official outcomes

Documents

Data Service Centre and Apache Spark file• Covers all domains: social statistics, business statistics, national accounts, health statistics, energy statistics, agricultural statistics