41
1 Université Paris-Saclay / CNRS BALÁZS KÉGL Center for Data Science Paris-Saclay RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION

RAMP Data Challenge

Embed Size (px)

Citation preview

1

Université Paris-Saclay / CNRSBALÁZS KÉGL

Center for Data ScienceParis-Saclay

RAMP DATA CHALLENGES WITH

MODULARIZATION AND CODE SUBMISSION

• Senior researcher CNRS

• machine learning (20 years) interfacing with particle physics (10 years)

• Head of the Paris-Saclay Center for Data Science

• interfacing with biology, economy, climatology, chemistry, etc. (4 years)

2

WHO AM I?Balázs Kégl

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

3

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects

345 affiliated researchers, 50 laboratories

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 4

Data scientist

Data Value Architect

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

THE DATA SCIENCE ECOSYSTEMhttps://medium.com/@balazskegl

• Classification problem y = f(x)

5

WHAT DOES MACHINE LEARNING DO

x

f y‘Stomorhina’

f y‘Scaeva’

x

6

WHAT DOES MACHINE LEARNING DOX y

7Olga Kokshagina 2015

INTERMEDIARIES: THE GROWING INTEREST FOR « CROWDS » - > EXPLOSION OF TOOLS

!  Crowdsourcing !  is a model leveraging

on novel technologies (web 2.0, mobile apps, social networks)

!  To build content and a

structured set of information by gathering contributions from large groups of individuals

5

8

Most classical data challenges are HR and publicity events

9

We decided to turn them into a tool for

1. Collaborative prototyping 2. Teaching aid 3. Data science process management

10

Funded by Université Paris-Saclay and CNRS

11

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

12

Code submission

1. lets us deliver a working prototype 2. lets the participants collaborate 3. makes the backend challenging to

run (cloud management)

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAPID ANALYTICS AND MODEL PROTOTYPING

13

functional data, time series, data augmentation, deep learning, learning on simulations, nonstandard and

multi-objective losses

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CURRENT RAMPS

14

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

DATA SCIENCE THEMES

15

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

DATA DOMAINS

16

17

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

21 challenges 39 events 1205 users

5952 predictive models

18

RAMP.STUDIO DATA CHALLENGE WITH CODE SUBMISSION

12 hackathons 6 remote data challenges

11 course data camps

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA

19

chemotherapy drug in elastic pocket

laser spectrometer

molecular spectra

feature extractor 1

feature extractor 2

regressor

concentration

classifier

drug type

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 20

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

FORECASTING EL NINO SIX MONTHS AHEAD

21

… 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …

time series feature extractor

x (a fixed length feature vector)regressor

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 22

what you achieved with a well tuned deep net

the diversity gap

the human blender gap

competitive phase

collaborative phase

THE POWER OF THE (COLLABORATING) CROWD

23

THE DYNAMICS OF COLLABORATION

starting kit

the crowdearly influencers

inventors

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

OPEN PHASE LETS PARTICIPANTS CATCH UP THE GOAL OF TEACHING

24

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 25

COMMUNICATION AND REUSE

26

ONGOING CHALLENGES: CLASSIFY POLLENATING INSECTS

4.5K€ for the competitive phase 3K€ for the collaborative phase 50 GPU hours per participant

https://www.ramp.studio/events/pollenating_insects_3_JNI_2017

27

ONGOING CHALLENGES: DETECTING MARS CRATERS

28

ONGOING CHALLENGES: FAKE NEWS PREDICT THE TRUTHFULNESS OF NEWS

29

ONGOING CHALLENGES: KAGGLE SEGURO PREDICT INSURANCE CLAIMS

30

UPCOMING CHALLENGE PREDICT AUTISM FROM BRAIN SCANS

31

Setting up the RAMP is was

long and hard.

32

Separate workflow building and workflow optimization

33

Before solving the problem, set it up

(even put it into production)

34

Université Paris-Saclay / CNRSBALÁZS KÉGL

Center for Data ScienceParis-Saclay

RAMP DATA CHALLENGES WITH

MODULARIZATION AND CODE SUBMISSION

FRUGAL DATA SCIENCE PROCESS MANAGEMENT

35

data flowTHE DATA FLOW

X

data

con

nect

ors

CALFE CLF

predictive workflow

full automation production

dashboard decision support

ypred

36

POC report

expert labeler amazon turk

simulator

y

ypred score type

score

cross-validation schemeRAMP

full automation production

dashboard decision support

data flow“works on”

data

con

nect

ors

X

data scientists

CALFE CLF

workflow

business unit domain scientists

data value architects

data engineers

THE IDEAL SEQUENCE

• toolkit: https://github.com/paris-saclay-cds/ramp-workflow

• for designing workflows

• set of ready-made metrics, workflows, CV schemes, data readers

• unique command-line test script

• examples: https://github.com/ramp-kits

• a zoo of problems, experiments, workflows

• (at least) one initial solution

37

RAMP-WORKFLOW & RAMP-KITS

38

A SINGLE SCRIPT TO DEFINE THE BUNDLE

X ypred score type

score

cross-validation scheme

data

con

nect

ors

FE CLF

workflow

39

A SINGLE EXECUTABLE TO TEST THE SUBMISSIONS

• Keep your different submissions in a simple file structure

• Communicate them on git

• Execute them also from the notebook

40

You can

1. Participate in upcoming RAMPs 2. Use RAMP in teaching or training 3. Use the toolkit for your own workflows 4. Submit it to us if you want to run a data

challenge

41

toolkit: github.com/paris-saclay-cds/ramp-workflow

examples: github.com/ramp-kits

blogs: medium.com/@balazskegl

slack: ramp-studio.slack.com

frontend: www.ramp.studio

mail: [email protected]