Transcript
Page 1: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Scaling up research infrastructure to meet cross disciplinary data needsNathan CunninghamDirector of Innovation for Research Data & IT

Knowledge Exchange Symposium: 21 Century Data Infrastructure for Research

11th July 2017 : UCT

Page 2: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

New UK Landscape 2018/19

https://www.gov.uk/government/publications/uk-research-and-innovation-business-case

Page 3: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Drivers for changeUK Research and Innovation will be formed in April 2018 and will:• maximise the value and benefit from Government’s

investment of over £6 billion per annum in research and innovation;

• enable cross-cutting funds being held, managed and distributed at arm’s length from Government, while avoiding administrative overheads and working around current legal structures;

• eliminate duplication to ensure the new arrangements are efficient and effective, and to ensure all available funding is directed to support research, translation and

• innovation, not on administrative overheads; and establish a system that balances autonomy and independence with cross-cutting ability and flexibility, with decisions delegated to the experts best able to take them for the benefit of their research discipline or distinctive area of expertise.

Page 4: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

A working definition of Big Data

Presenter
Presentation Notes
I.e. not a RDBMS solution – but a distributed, cloud based massively scalable (in terms of volume) solution
Page 5: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Infrastructure Challenges• Infrastructure Sprawl

• Islands of investment and produces governance and maintenance challenges.

• Implications of running big data services:• Limited IT infrastructure resources and staff;• Relatively little IT experience and skillsets in Hadoop or Spark;• Increasing IT overhead for managing multiple environments;• The need to onboard multiple user with access to their own

dedicated Hadoop/Big Data environment.• Governance and Security

• Empowering end users across multiple teams.• Integrated data and metadata flows/modelling/services

Page 6: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Success Factors

Page 7: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Types of new “disclosive” data

(2013) OECD report on New Data for Understanding the Human Condition:

Category A: Data stemming from the transactions of government, for example, tax and social security systems.

Category B: Data describing official registration or licensing requirements.

Category C: Commercial transactions made by individuals and organisations.

Category D: Internet data, deriving from search and social networking activities.

Category E: Tracking data, monitoring the movement of individuals or physical objects subject to movement by humans.

Category F: Image data, particularly aerial and satellite images but including land -based video images.

Page 8: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Hybrid Cloud

Page 9: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Common problems for a hybrid cloud service

Page 10: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

High Level Data Flow of DSaaP

Data/Metadata Flatten Semantic

EnrichmentData

Products

Discover

Access

InformationProducts

Use/ReUse

End User Interactions

Page 11: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Tailoring our Public / Private Platform• PRODUCTS (i.e. applications) come with predefined functions that

narrow their ultimate breadth of scope. Conversely, PLATFORMS separate out the functions of applications so that an IT structure can be built for change. Because things change all the time.

• A coherent, simplified way of thinking about data services, not necessarily bounded by organisational structures (e.g. OAIS UKDA)

• Divides the business into 2 complementary entities: Services which depend on the Repository (0.5 PB on-prem and AWS)

• Producers and Consumers only interact with Services. • Services have 3 major platforms: Deposit, Discovery, Information. • Repository has 4: Semantics, Access, Data and Preservation• Not completely original idea: NSD (Norwegian Open Data Research

Infrastructure) are thinking in these terms, also see http://www.big-data-europe.eu

Page 12: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Reference Architecture of DSaaP• Open source because we can have meaningful common

conversations with the community• Hadoop is…..

Page 13: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Implementation Architecture of DSaaP

Preservation Platform

Deposit Platform Discovery Platform Information Platform

Access PlatformSemantic Platform

Data Platform

Services

Repository

Security

Consumers and Producers

SupportAnd

Maintenance

Page 14: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD
Page 15: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Data Services as a Platform

• Deposit - Apache Kylo, Apache NiFi, HDFS

• Data – HDFS, HBase, DDI4, Parquet, Spark

• Semantics – Protégé, SKOS, auto-classification

• Discovery - Logstash and ElasticSearch

• Access – SDC/Privacy/Secure Linkage

• Information – Zeppelin, Kibana, rich SPARQL-based querying (possibly ELDA)

Page 16: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

The ‘Five Safes’ of data access

Is the use of the data appropriate, lawful, ethical? Sensible? Safe projects

Are people likely to use if appropriately? Safe people

Is the environment in which it used appropriate? Safe settings

Is the data appropriate? Safe data

Are the outputs appropriate? Safe outputs

• Things to note• ‘appropriate’, not ‘right’• ‘safe’ is a scale, not a limit• explicitly subjective• multiple ways to achieve the same outcome: “safe data access”

Page 17: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Secure

SafeguardedOpen

DSaaP Hybrid Service Instances

Common Service Authentication (Kerberos)

AWSInstance

On premise Instance

On premise Instance

5 Safes at Scale

Page 18: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Deliveing joined up queries

Page 19: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Enabling Analytics

• Zeppelin provides dataframe API. • Dataframes present view of data similar to spreadsheet• Standard way of working for data science, routinely used

in packages such as R and Python• Zeppelin provide interpreters allowing users to query

with Python, R, Scala, SQL, Angular syntax• Dataframes facilitate subsetting, querying, filtering,

joining, statistical modelling, machine learning, and visualisation

Page 20: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Data in motion (federation at scale)

• Hortonworks distribution of NiFi - technology used at scale within the NSA for the last 8 years and made available to the Apache Software Foundation through the NSA Technology Transfer Program

• Why?:• Big data ingest - one tool for real-time data collection from all

sources and destinations• Digital Security – get rid of the tedium of custom scripts and

manual processes for managing dataflows• Real-time GUI and control of dataflows – provides fast drag and

drop GUI• Data provenance – easy access to comprehensive audit trails

Page 21: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Reproducibility - audit trail

Page 22: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Sample Data Provenance Event

Page 23: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

UK Data Service’s IT Strategy

Strategic Aim

IT Theme Nat

iona

l Dig

ital

Rep

osito

ry

Sing

le P

oint

of A

cces

s

Prov

ide

Supp

ort t

o O

wne

rs &

Use

rs

Dyn

amic

and

R

espo

nsiv

e

Exte

nd U

se o

f Dat

a H

oldi

ngs

Max

imis

e Im

pact

of

Dat

a H

oldi

ng

Advancing the Service ● ● ● ● ● ●

Data and Information Stewardship ● ● ● ●

Operational Excellence ● ● ● ●

IT Complexity Reduction ● ● ●

IT Functional Excellence

● ● ●

Page 24: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Key approach to embed new IT strategy• Driving IT complexity reduction to release resources to be

available for new initiatives and innovation. Scaling to tier 0,1, 2 & 3 data services with common security and governance.

• Securing a sustainable funding basis for DSaaP and HPC augmenting with winning new awards of financing, based on a commitment to open source innovation and asset services for science with disclosive data.

• Creating an Innovation Panel to encourage and steer risk‐taking IT service innovation initiatives for linking data. To scale out to UKRI / G Cloud e.g. with AWS Public Sector Cloud.

• Periodic regrouping and reskilling its resources to deliver the IT Strategy programme. E.g. Hadoop Stack, R, Scala, Python

Page 25: Scaling up research infrastructure to meet cross ... · • Integrated data and metadata flows/modelling/services. Success Factors. Types of new “disclosive” data (2013) OECD

Questions

[email protected]

https://aws.amazon.com/solutions/case-studies/uk-data-service/

https://www.cloudwick.com/our-use-cases/aws-ukds/


Recommended