40
Towards a Community-driven Data Science Body of Knowledge FAIR 2016 , Florence 14-15 November 2016 Andrea Manieri Engineering Ingegneria Informatica S.p.A. EDISON Education for Data Intensive Science to Open New science frontiers Grant 675419 (INFRASUPP-4-2015: CSA) Credits: Yuri Demchenko (UvA) Steve Brewer (SOTON) Kim Hee (GOETHE) Adam Belloum (UvA) Spiros Koulozis (UvA)

Towards a Community-driven Data Science Body of Knowledge – Data Management Skills and Competences

Embed Size (px)

Citation preview

Towards a Community-driven Data

Science Body of Knowledge

FAIR 2016 , Florence

14-15 November 2016

Andrea Manieri

Engineering Ingegneria Informatica S.p.A.

EDISON – Education for Data Intensive

Science to Open New science frontiers

Grant 675419 (INFRASUPP-4-2015: CSA)

Credits:

• Yuri Demchenko (UvA)

• Steve Brewer (SOTON)

• Kim Hee (GOETHE)

• Adam Belloum (UvA)

• Spiros Koulozis (UvA)

A sense of urgency – dated 2013

“Europe faces up to 700.000 unfilled ICT jobs and declining competitiveness. The number of

digital jobs is growing – by 3% each year during the crisis – but the number of new ICT

graduates and other skilled ICT workers is shrinking. Our youth need actions not words, and

companies operating in Europe need the right people or they will move operations

elsewhere”. EC press release 25, Jan 2013

Grand Coalition for Digital Jobs + EU eSkills strategy for 2020 becoming Digital Skills and

Jobs Coalition (conference launch 1st Dec 2016 in Bruxelles)

Data Scientist shortage:

- Gartner, 2012

- McKinsey, 2013

- Forbes, 2013 https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

Who need Data skills?

• As a student – I need recommendations

on Data-driven careers

• As a resercher – I need to cover gaps wrt

eScience

• As a Librarian – I want to promote my

competences

• As an employee – I want to reskill in data-

driven jobs

Who need Data skills?

• As Scholar/Lecturer

– I need to update my

background

• As training manager

– I want to innovate my

offering

• As course designer

– I have to define right

topics and know-how to

be taught

Who need Data skills?

• As HR manager

– I want to find the fit-for-

purpose candidates

• As team leader

– I need to cover know-

how and skills for a

task/project

• As employer

– I want to define re-skilling

plans for my workforce

EDISON CSA: serving customers base

VISION and Background

How all began

Visionaries and Drivers

The Fourth Paradigm: Data-Intensive Scientific Discovery.

By Jim Gray, Microsoft, 2009. Edited by Tony Hey, et al.

http://research.microsoft.com/en-us/collaboration/fourthparadigm/

Riding the wave: How Europe can gain from

the rising tide of scientific data.

Final report of the High Level Expert Group on

Scientific Data. October 2010.

http://cordis.europa.eu/fp7/ict/e-

infrastructure/docs/hlg-sdi-report.pdf

The Data Harvest: How

sharing research data

can yield knowledge,

jobs and growth.

An RDA Europe Report.

December 2014

https://rd-alliance.org/data-

harvest-report-sharing-data-

knowledge-jobs-and-growth.html

https://www.rd-alliance.org/

NIST Big Data Working Group (NBD-WG)

https://www.rd-alliance.org/ (since 2013)

ISO/IEC JTC1 Big Data Study Group (SGBD)

http://jtc1bigdatasg.nist.gov/home.php (2014)

EDISON & RDA

• 1st RDA Plenary meeting – 18-20 March 2013

– 1st BoF on Education and Skills Development in Data Intensive Science

– Attended by 16 representatives from universities, libraries, e-Science, data

centers, research coordination bodies

• 3rd RDA Plenary meeting – 26-28 March 2014, Dublin

– 3rd BoF on Education and Skills Development in Data Intensive Science

– EDISON (Education for Data Intensive Science to Open New science

frontiers) Initiative announced

• 4th RDA Plenary meeting – 22-24 September 2014, Amsterdam

– IG Education and Training on Handling of Research Data (ETHRD)

established

– EDISON Workshop – 21 Sept 2014, Science Park Amsterdam

– Decision to form a consortium and submit a proposal to IINFRASUPP-4-2015

call

• 8th RDA Plenary meeting – 15-17 September 2016, Denver, USA

– BoFs and IG meetings – now developing Certification and Accreditation proposal

EDISON Data Science Framework (EDSF)

CF-DS

DS-BoK MC-DS

Taxonomy and

Vocabulary

eLearning Platform

Datasciencepro.eu

Roadmap &

Sustainability

• Community

Portal (CP)

• Professional

certification

• Data Science

career & prof

development

DS Prof Profiles Data Science

Framework

Foundation & Concepts Services Biz Model

• EDISON Framework components

– CF-DS – Data Science Competence Framework

– DS-BoK – Data Science Body of Knowledge

– MC-DS – Data Science Model Curriculum

– DSP - Data Science Professional profiles definition

– Data Science Taxonomies and Scientific Disciplines Classification

Data Science

Body-of-Knowledge

A shared vision of knowledge corpus

• Based on the definition by NIST Big Data WG (NIST SP1500 -

2015)

• A Data Scientist is a practitioner who has sufficient knowledge in

the overlapping regimes of expertise in business needs, domain

knowledge, analytical skills, and programming and systems

engineering expertise to manage the end-to-end scientific method

process through each stage in the big data lifecycle – …Till the delivery of expected scientific and business value to science or

industry

• Other definitions to admit such features as – Ability to solve variety of business problems, tell “stories”, input to

decision making

– Optimize performance and suggest new services for the organisation

– Develop a special mindset and be statistically minded, understand raw

data and “appreciate data as a first class product”

Data Scientist Definition

• Data science is the empirical synthesis of actionable knowledge and technologies required to

handle data from raw data through the complete data lifecycle process.

• Big Data is the technology to build system and infrastructures to process large volume of

structurally complex data in a time effective way

[ref] Legacy: NIST BDWG

definition of Data Science

• Commonly accepted Data Science competences/skills groups include

– Data Analytics or Business Analytics or Machine Learning

– Engineering or Programming

– Subject/Scientific Domain Knowledge

• EDISON identified 2 additional competence groups demanded by

organisations

– Data Management, Curation, Preservation

– Scientific or Research Methods and/vs Business

Processes/Operations

• Other skills commonly recognized aka “soft skills” or “social/professional

intelligence”

– Inter-personal skills or team work, cooperativeness

• Important aspects of integrating Data Scientist into organisation structure

– General Data Science (and Data) literacy for all involved roles and management

– Common agreed and understandable way of communication and

information/data presentation

– Role of Data Scientist: Provide a kind of literacy advice and guidance to

organisation

Data Science Competence Groups

• Group 1: Skills/experience related to

competences

– Data Analytics and Machine Learning

– Data Management/Curation (both general

and scientific)

– Data Science Engineering (hardware and

software) skills

– Scientific/Research Methods or Business

Process Management

– Application/subject domain related (research

or business)

– Mathematics and Statistics

• Group 2: Big Data (Data Science) tools

and platforms

– Big Data Analytics platforms

– Mathematics & Statistics applications & tools

– Databases (SQL and NoSQL)

– Data Management and Curation platform

– Data and applications visualisation

– Cloud based platforms and tools

Data Science Skills/Experiences

Group 3: Programming and

programming languages and IDE

– General and specialized development

platforms for data analysis and statistics

Group 4: Soft skills or Social

Intelligence

– Personal, inter-personal communication, team

work, professional network

Comparing with relevant BoK

• ACM Computer Science Body of Knowledge (ACM CS-BoK)

• ICT professional Body of Knowledge (ICT-BoK)

• Business Analytics Body of Knowledge (BABOK)

• Software Engineering Body of Knowledge (SWEBOK)

• Data Management Body of Knowledge (DAMA-BoK) by Data

Management Association International (DAMAI)

• Project Management Professional Body of Knowledge (PM-

BoK)

• DS-BoK Knowledge Area Groups (KAG)

• KAG1-DSA: Data Analytics group including

Machine Learning, statistical methods,

and Business Analytics

• KAG2-DSE: Data Science Engineering group

including Software and infrastructure engineering

• KAG3-DSDM: Data Management group including data curation, preservation

and data infrastructure

• KAG4-DSRM: Scientific/Research Methods group

• KAG5-DSBP: Business process management group

• Data Science domain knowledge to be defined by related expert groups

Data Science BoK (DS-BoK)

Process Groups – knowledge at work

• Data Identification and Creation

– how to obtain digital information from in-silico experiments and instrumentations, how to collect and store in digital form,

any techniques, models, standard and tools needed to perform these activities, depending from the specific discipline.

• Data Access and Retrieval:

– tools, techniques and standards used to access any type of data from any type of media, retrieve it in compliance to

IPRs and established legislations.

• Data Curation and Preservation:

– includes activities related to data cleansing, normalisation, validation and storage.

• Data Fusion (or Data integration):

– the integration of multiple data and knowledge representing the same real-world object into a consistent, accurate, and

useful representation.

• Data Organisation and Management:

– how to organise the storage of data for various purposes required by each discipline, tools, techniques, standards and

best practices (including IPRs management and compliance to laws and regulations, and metadata definition and

completion) to set up ICT solutions in order to achieve the required Services Level Agreement for data conservation.

• Data Storage and Stewardship:

– how to enhance the use of data by using metadata and other techniques to establish a long term access and extended

use to that data also by scientists and researchers from other disciplines and after very long time from the data

production time.

• Data Processing:

– tools, techniques and standards to analyse different and heterogeneous data coming from various sources, different

scientific domains and of a variety of size (up to Exabytes) – it includes notion of programming paradigms.

• Data Visualisation and Communication:

– techniques, models and best practices to merge and join various data sets, techniques and tools for data analytics and

visualisation, depending on the data significant and the discipline.

Data Science Data Management Group

(DSDM)

KAG3-DSDM:

Data Management

group including

data curation,

preservation and

data infrastructure

DAMA-BoK selected KAs

(1) Data Governance

(2) Data Architecture

(3) Data Modelling and Design

(4) Data Storage and Operations

(5) Data Security

(6) Data Integration and

Interoperability

(7) Documents and Content

(8) Reference and Master Data

(9) Data Warehousing and Business

Intelligence

(10) Metadata

(11) Data Quality

General Data Management KA’s

Data Lifecycle Management

Data archives/storage

compliance and certification

New KAs to support RDA

recommendations and community

data management models (Open

Access, Open Data, etc.)

Data type registries, PIDs

Data infrastructure and Data

Factories

Data Science

Competence Framework

A bottom-up approach

• Professional

profiles groups

are defined in

compliance

with the ESCO

taxonomy

Data Science Professions Family

• Relevance of a

competence to a

DSP profile:

• 5 – high, 1 - low

Mapping DS-BoK GAs to DSP profiles

E - CO2 Classification

• Text Filtering

• Find overlapping terms

• Calculate TF-IDF of terms

• For each category vector calculate cosine similarity

• The output is a CSV with the similarity for each

category

Education offered vs. Market requests

DSDA: Data Science Analytics

DSDK: DS Domain Knowledge (DSDK)

DSEN: Data Science Engineering

DSRM: Scientific/ Research Methods

DSDM: Data Management

DSDA: Data Science Analytics

DSDK: DS Domain Knowledge (DSDK)

DSEN: Data Science Engineering

DSRM: Scientific/ Research Methods

DSDM: Data Management

CV vs. Job offering

Model Curricula

in Data Science

Supporting EU Academy in excelling

and matching the market needs

• Data Science Model Curriculum includes – Learning Outcomes (LO) definition based on CF-DS

• LOs are defined for CF-DS competence groups and for all

enumerated competences

– LOs mapping to Learning Units (LU) • LUs are based on CCS(2012) and universities best practices

• Data Science university programmes and courses inventory

(interactive) http://edison-project.eu/university-programs-list

– LU/course relevance: Mandatory Tier 1, Tier 2,

Elective, Prerequisite

– Learning methods and learning models (in progress) • Based on Bloom’s Taxonomy, Outcome Based Learning, etc

Data Science Model Curriculum (MC-DS)

Using the MC-DS

Some numbers (2015)

• A portfolio of more than 300 courses

• 200 traineers and experts

• 5 offices and 16 classroom

• 18.000 training person/hours

• New on-line platform

Aosta

Roma

Padova

Milano

Frosinone

Engineering IT & Management school

Data Science Master at Univ. of Perugia

http://masterds.unipg.it/en/index.html

Accreditation and Certification - RDA BoF

Aim: contribute to the sustainable development of the data

science profession.

Goal: deliver a report that presents a concise but

representative picture of the various accreditation and

certification schemes that exist around the world

Outcome: Need to develop 9 months working group proposal

centered on supporting the members of RDA to develop their

own professional career paths around their own skills, interests

and contexts.

Career development and reskilling

Mind the gap!

Get practical recommendations

Training and Education Inventory

Sharing education events and experiences

What’s next?

Putting theory into practice and

supporting service delivery

EDISON Community portal P

ow

ered an

d H

osted

By

What we can do with you

1. Improve and Validate EDSF

1. Identifying the “soft skills”: how to ask a research/business question?

2. Identifying the Community need: from stewards to scientists, any market, any discipline

3. Validate completeness of BoK, coverage of CF, usability of MC

4. Promote National workshop for bottom-up adoption of EDSF

2. Career Development

1. Specifications for DSP job positions in Data Management and Librarian teams and

Engagement mechanisms Employers/DSP candidates

2. Links and Recommendations for placing students for getting DSP work experience

3. Facilitate cross-institutional agreements on DSP career paths

4. Supporting Training through DataSciencePro.eu

5. Mapping and comparing career paths and Learning opportunities for Personal Competence Portfolio (PCP)

6. Advice Events, Courses and Tools for Community training

7. Develop Virtual Labs, re-usable and promoted further out of your Community

8. Certification: from badges to professions – the How-to of a Community-driven Data

Science Certification (RDA)

Promoting the Data Science Profession