Upload
voanh
View
212
Download
0
Embed Size (px)
Citation preview
Data Service Centre and Apache Spark
at Statistics Netherlands
2
Statistical process and DSC
WebsiteStatlineOpen dataArticlesBooks
DSCMicrodata
services
RIN
RIN
RIN
RIN
3
• Technical backend: Document management system Documentum (Open Text)
• Only statistical data that you can store in rows and columns (no documents, images etc.)
• Data stored as text files (csv, fixed-width): future proof• Primary focus was archiving, but now more and more on data
exchange• Retrieve data and process data in SPSS, R, Python, custom built
systems• Almost 14.000 datasets, mostly microdata• Covers all domains: social statistics, business statistics, national
accounts, health statistics, energy statistics, agriculturalstatistics etc. etc.
DSC not a traditional datawarehouse
4
DSC Catalogue
5
DSC Catalogue
10
• Subset of data in DSC• Highly coordinated• Mostly based on administrative sources, some surveys• ‘Backbones’ (persons, buildings, households, companies)• Linkable datasets• Widely used for statistical production and research:
longitudinal, small groups, intergenerational, networks• SSD tool set on top of DSC• https://www.cbs.nl/NR/rdonlyres/98BFF618-D7A7-4897-
85D6-6293CFB8EA75/0/systemofsocialstatisticaldatasets.pdf
System of Social statistical Datasets (SSD)
11
Proof of concept ‘Data lake’
DSCRaw data Big dataOther SN data Other data
Data virtualisation (Denodo)
User User User User
Statistics Netherlands The ‘outside’
Metadata
Governance
Organisation
Governance+ Governance+
Organisation
14
BIG DATAis of all times
15ca. 1981–1975 B.C.
16
17
18
Contest: person who could process and tabulate the data fastest would earn a contract for Census 1890
Process:
Participant A: 144 hrs
Participant B: 100 hrs
Participant C: 72 hrs
1888 Hackathon US Census Bureau
Tabulate:
Participant A: 44 hrs
Participant B: 55 hrs
Participant C: 5 hrs
19
Herman Hollerith
1896 Tabulating Machine Company
1911 Computing-Tabulating-Recording Company
1924 International Business Machines Corporation
1908
20
2018: DSC contains about 14 thousand datasets (≈5 TB). Retrieving and processing data should go faster.
Can we build a tabulating machine based on contemporary technology?
21
Apache SPARK
22
23
24
25
Test case
26
DSC
Authentication
SPARK
Spark programming (PySpark)
Data control
Authorisation control
meta
data
27
After a CBS press release about average capital per municipality* a journalist asks whether the top 10 would be the same when one looks at average wage per municipality.
Top 10 average capital per municipality, 2016
Laren (NH.)
Blaricum
Bloemendaal
Wassenaar
Rozendaal
Heemstede
Bergen (NH.)
Alphen-Chaam
De Bilt
Westvoorne
*https://www.cbs.nl/nl-nl/nieuws/2018/06/vermogen-huishoudens-bijna-10-procent-hoger-in-2016
User story
28
29
30
DSC Datasets
SPOLIS2015all jobs in NL in 2015
GBAADRESOBJECT2015all addresses 2015
VSLGWB2015municipality-district-
neighbourhood code of alladdresses
SBASISLOON (wage), SREGULIEREUREN(hours)
Filter:SDATUMAANVANGIKO >= 20150101SDATUMAANVANGIKO <= 20150131
-
Filter:GBADATUMAANVANGADRESHUISHOUDING
<= 20150101GBADATUMEINDEADRESHUISHOUDING
>= 20150101
GEM, derived from GWBCODE2016 [1-4]
Link by:RINPERSOONSRINPERSOON
Link by:RINPERSOONSRINPERSOON
SOORTOBJECTNUMMERRINOBJECTNUMMER
Link by:
SOORTOBJECTNUMMERRINOBJECTNUMMER
10 mln records, 1.74 Gb61 mln records, 3.45 Gb110 mln records, 68.76 Gb
Aggregate on GEM (MUN)
UURLOON (HOURLYWAGE) = Sum(SBASISLOON) / Sum(SREGULIEREUREN)
31
User interface
32
33
34
35
Processing time syntax on Spark cluster: Approx. 1 minute
Other advantages:- Open source- Modern tool set- Syntax based- Sharing code- Visualisations- Commonly used, documentation
Disclaimer: data shown are for demo purposes only, they are not official outcomes