32
Project acronym: EDSA Project full name: European Data Science Academy Grant agreement no: 643937 D2.2 Data Science Curricula 2 Deliverable Editor: Chris Phethean (SOTON) Other contributors: Elena Simperl (SOTON) Module curricula prepared by partners as noted in document: ODI, FRAUNHOFER, TU/E, PERSONTYLE, JSI, KTH Deliverable Reviewers: Shatha Jaradat (KTH), Angi Voss (Fraunhofer) Deliverable due date: 31/07/2016 Submission date: 29/07/2016 Distribution level: P Version: 1.0 This document is part of a research project funded by the Horizon 2020 Framework Programme of the European Union

D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Projectacronym: EDSA

Projectfullname: EuropeanDataScienceAcademy

Grantagreementno: 643937

D2.2DataScienceCurricula2

DeliverableEditor: ChrisPhethean(SOTON)

Othercontributors:

ElenaSimperl(SOTON)

Modulecurriculapreparedbypartnersasnotedindocument:ODI,FRAUNHOFER,TU/E,PERSONTYLE,JSI,KTH

DeliverableReviewers: ShathaJaradat(KTH),AngiVoss(Fraunhofer)

Deliverableduedate: 31/07/2016

Submissiondate: 29/07/2016

Distributionlevel: P

Version: 1.0

ThisdocumentispartofaresearchprojectfundedbytheHorizon2020FrameworkProgrammeoftheEuropeanUnion

Page 2: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

ChangeLog

Version Date Amendedby Changes

0.1 03/06/2016 ChrisPhethean Structure of document and initialcontent.

0.2 15/06/2016 ChrisPhethean LinkedDatamoduleadded

0.3 16/06/2016 ChrisPhethean Revisedv2curriculumdetailed

0.4 04/07/2016 ChrisPhethean New modules for M18 added anddetailed using contributions fromrelevantpartners.

0.5 05/07/2016 ChrisPhethean Curricula revisions to M6 modulesmade.

0.6 07/07/2016 ChrisPhethean Skills assignment added. Conclusionswritten.

0.7 07/07/2016 ChrisPhethean Changes and updates in advance ofreview.Finalcontentadded.

0.8 08/07/2016 ChrisPhethean Final version of skills classificationadded.

0.9 19/07/2016 ChrisPhethean Reviewercommentsaddressed

0.9a 19/07/2016 ElenaSimperl ScientificReview

1.0 27/07/2016 AnetaTumilowicz FinalQA

Page 3: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page3of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

TableofContents

ChangeLog..................................................................................................................................................................................2 

TableofContents......................................................................................................................................................................3 

ListofTables..............................................................................................................................................................................4 

1.  ExecutiveSummary......................................................................................................................................................5 

2.  Introduction.....................................................................................................................................................................5 

2.1  Recapofcurriculaversion1.............................................................................................................................5 

2.1.1ThecoreEDSAcurriculum..............................................................................................................................5 

2.1.2ModulesreleasedinYear1.............................................................................................................................6 

2.2  Deliveryoflearningresources........................................................................................................................7 

3.  Insightsfromdemandanalysis................................................................................................................................7 

4.  Newcurriculareleasedforversion2....................................................................................................................8 

4.1  Overviewofversion2release.........................................................................................................................8 

4.2  Revisedmodulesinversion2..........................................................................................................................9 

4.2.1Foundationsofdatascience(SOTON).......................................................................................................9 

4.2.2EssentialsofDataAnalyticsandMachineLearning(Persontyle)................................................12 

4.2.3Distributedcomputing(KTH).....................................................................................................................15 

5.  Newmodulesinversion2.......................................................................................................................................17 

5.1  Statisticalandmathematicalfoundations–JSI.....................................................................................17 

5.1.1LearningObjectives.........................................................................................................................................17 

5.1.2SyllabusandTopicDescriptions................................................................................................................17 

5.1.3ExistingCourses................................................................................................................................................18 

5.1.4ExistingMaterials.............................................................................................................................................18 

5.1.5DescriptionsofExercises..............................................................................................................................19 

5.1.6FurtherReading................................................................................................................................................19 

5.2  Datamanagementandcuration‐TU/E....................................................................................................20 

5.2.1LearningObjectives.........................................................................................................................................20 

5.2.2SyllabusandTopicDescriptions................................................................................................................20 

5.2.3ExistingCourses................................................................................................................................................20 

5.2.4ExistingMaterials.............................................................................................................................................21 

5.2.5ExampleQuizzesandQuestions.................................................................................................................21 

5.2.6DescriptionofExercises.................................................................................................................................21 

5.2.7FurtherReading................................................................................................................................................21 

5.3  Bigdataanalytics‐Fraunhofer....................................................................................................................22 

5.3.1LearningObjectives.........................................................................................................................................22 

5.3.2SyllabusandTopicDescriptions................................................................................................................22 

Page 4: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page4of32EDSAGrantAgreementno.643937

5.3.3ExistingCourses................................................................................................................................................23 

5.3.4ExistingMaterials.............................................................................................................................................23 

5.3.5ExampleQuizzesandQuestions.................................................................................................................24 

5.3.6FurtherReading................................................................................................................................................24 

5.4  Datavisualisationandstorytelling‐SOTON/ODI................................................................................25 

5.4.1LearningObjectives.........................................................................................................................................25 

5.4.2SyllabusandTopicDescriptions................................................................................................................25 

5.4.3ExistingCourses................................................................................................................................................26 

5.4.4ExistingMaterials.............................................................................................................................................26 

5.4.5ExampleQuizzesandQuestions.................................................................................................................26 

5.4.6DescriptionofExercises.................................................................................................................................26 

5.4.7FurtherReading................................................................................................................................................26 

5.5  LinkedDataandtheSemanticWeb‐SOTON.........................................................................................27 

5.5.1LearningObjectives.........................................................................................................................................27 

5.5.2SyllabusandTopicDescriptions................................................................................................................27 

5.5.3ExistingCourses................................................................................................................................................28 

5.5.4ExistingMaterials.............................................................................................................................................28 

5.5.5ExampleQuizzesandQuestions.................................................................................................................28 

5.5.6DescriptionofExercises.................................................................................................................................28 

5.5.7FurtherReading................................................................................................................................................28 

6.  Datascienceskillgroups.........................................................................................................................................29 

7.  Conclusion.....................................................................................................................................................................32 

ListofTablesTable1:CoreEDSACurriculum,version1‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐5 

Table2:TopicScheduleandPartnerAllocation‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐6 

Table3:RecommendationsforEDSAcurriculumdevelopment,originallypresentedinD1.4‐‐‐‐‐‐‐‐‐7 

Table4:CoreEDSACurriculum,version2.0‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐9 

Table5:ExistingcoursesthatincludeamodulesimilartoStatisticalandMathematicalFoundations18 

Table6:Skillsthateachmodulecurriculatargets‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐30 

Page 5: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page5of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

1. ExecutiveSummaryThisdeliverablepresentsthesecondversionoftheEDSAcurriculum,whichcomprisesof15coredatasciencetopicsthatarebeingdeliveredbythe3‐yearproject.Previously,wereleasedthefirstversionofthecurriculumcontaining6of15modules,eachofwhichhadaspecificcurriculumcontaininglearningobjectives, topicdescriptionsandvarious resourcesandmaterials thatbeen identified.Modules areseparated into four stages: foundations, data storage and processing, data analysis and datainterpretationanduse.Thesecategoriesremaininversion2andprovidestructuretoourcourses.Wepresentnewmodulesinthisdeliverable,reflectinganumberofchangestothecurriculum’slistoftopics,alongwithrevisionstothepreviouslyreleasedmodulesbasedonlearnerfeedback,teachingexperienceandincreasedamountsofdemandanalysis.

2. IntroductionBeforepresentingthenewversionoftheEDSAcurriculum,wefirstrevisittheinitialversionreleasedinJuly2015basedonpreliminarydemandanalysis.Thissectionrecapsthemaincurriculum,comprisedoffifteentopics,anddiscussestheprogressmadetodateregardingtheimplementationofthemodulesreleasedinYear1throughlearningresourcesnowavailableontheEDSACoursesportal

2.1 Recapofcurriculaversion1InD2.1(DataScienceCurricula1),wereleasedthefirstversionoftheEDSAcurriculum.15coretopicswerepresented,withtheintentiontodelivertheseingroupsoverthecourseofthe3‐yearproject.Atthetimeofwriting(Month6), the first6moduleswerereleased,eachprovidedbyadifferentEDSApartner.Thefollowingsectionoutlinesanoverviewofthecurriculum(2.1.1.)andthenrevisitsthe6modulesthatwerereleasedinversion1(2.1.2).

2.1.1ThecoreEDSAcurriculum

We presented the 15 topics that make up the core data science curriculum for EDSA, which arereproducedbelow:

Table1:CoreEDSACurriculum,version1

Module Topic Stage

1 FoundationsofDataScience Foundations

2 FoundationsofBigData Foundations

3 Statistical/MathematicalFoundations Foundations

4 Programming/ComputationalThinking(RandPython) Foundations

5 DataManagementandCuration StorageandProcessing

6 BigDataArchitecture StorageandProcessing

7 DistributedComputing StorageandProcessing

8 StreamProcessing StorageandProcessing

9 MachineLearning,DataMiningandBasicAnalytics Analysis

10 BigDataAnalytics Analysis

11 ProcessMining Analysis

12 DataVisualisation InterpretationandUse

13 VisualAnalytics InterpretationandUse

14 FindingStoriesinOpenData InterpretationandUse

15 DataExploitationincludingdatamarketsandlicensing InterpretationandUse

Page 6: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page6of32EDSAGrantAgreementno.643937

Moduleswere structuredand classifiedaccording to the stageor theme in thegeneraldata scienceprocessthattheyweremostsuitedto(foundations,storageandprocessing,analysisandinterpretationanduse).Bycoveringthesestagesequally,weensureabroadfocusthatallowsdifferentaudiencesandstakeholderswithvaryingneedstoaccessourmaterialsandfindrelevanceandvalueinthem.

2.1.2ModulesreleasedinYear1

Table2:TopicScheduleandPartnerAllocation

Topic Schedule AllocatedPartner

FoundationsofDataScience M6 Soton

FoundationsofBigData M6 JSI

BigDataArchitecture M6 Fraunhofer

DistributedComputing M6 KTH

MachineLearning,DataMiningandBasicAnalytics M6 Persontyle

ProcessMining M6 TU/e

Statistical/MathematicalFoundations M18 JSI

DataManagementandCuration M18 TU/e

BigDataAnalytics M18 Fraunhofer

DataVisualisation M18 Soton

FindingStoriesinOpenData M18 ODI

Programming/ComputationalThinking(RandPython) M30 Soton

StreamProcessing M30 KTH

VisualAnalytics M30 Fraunhofer

DataExploitationincludingdatamarketsandlicensing M30 ODI

FollowingthescheduleoutlinedinTable2,curriculaforthefirstsixmoduleswerereleasedinM6:

● FoundationsofDataScience● FoundationsofBigData● BigDataArchitecture● DistributedComputing● MachineLearning,DataMiningandBasicAnalytics● ProcessMining

Eachcurriculumcontainsthefollowingdetails:

● LearningObjectives● SyllabusandTopicDescriptions● ExistingCourses● ExistingMaterials● ExampleQuizzesandQuestions● DescriptionofExercises● FurtherReading

Eachmoduleisreleasedasasyllabuswhichcanthenbeusedtodelivervarioustypesofcourseware,invariousdeliveryformats.Thismeansthatpartnerscantargetdifferentaudiences,sectorsorlevelsofexperiencebydeliveringdifferenttypesofcourses,forexampleaface‐to‐facetrainingcourse,anonlineMOOC,orasasetofself‐studylearningresources.

Page 7: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page7of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

2.2 DeliveryoflearningresourcesAccompanyingthereleaseofthefirstversionofthecurriculuminM6,partnersthenproducedlearningresourcestoreleaseforthefirstsixmodulesinM12,asoutlinedinD2.4.Wehavemadethesecoursesavailable on the EDSA courses portal (http://courses.edsa‐project.eu), where there are a range ofdeliverychannelsandformatsavailable:

Self‐studycourses,allowinglearnerstostudyopeneducationalresourcesattheirownpace. MOOCs, hosted on external platforms such as FutureLearn and Coursera, allowing social

learninginlarge,onlinecourses. Blendedcourses,combiningface‐to‐faceelementswithonlinematerials,whichareofferedby

EDSApartnersandassociates. Face‐to‐facecourses,whicharecoursesdeliveredinpersonbyEDSApartnersandassociates.

EachoftheM6moduleshasatleastaself‐studycourseoptionavailableonthesite,whilecoursessuchasTU/e’s ‘ProcessMining’modulealsohavefurtheroptionssuchasMOOCs.ThecoursesportalalsoincludesadditionalcoursesfromtheprojectpartnersthatarerelatedtotheEDSAcurriculum.

3. InsightsfromdemandanalysisFollowingthedemandanalysiscarriedoutinWP1andreporteduponinD1.4,anumberofinsightshavebeengained intothecurriculumwhichthisdeliverablebuildsupon.Assummaries inD1.4, itcanbearguedthattheEDSAcurriculumpresentedinM06isalreadygenerallyinlinewithindustrydemandsacross Europe. As such, the current focus on comprehensively training the technical and analyticalaspectsofdatascienceshouldbemaintained.

Where the demand analysis revealed there was scope for innovation was through integrating thistechnicalandanalyticaltrainingintoamoreholisticapproachtoskillsdevelopment,expandingtocoversupporting skills that can increase the impact of data science within organisations. Therecommendationsfromthedemandanalysis,initiallypresentedinD1.4arereproducedinTable3.Thehighlighted rows (recommendations 2‐4 inclusive) represent thosewhich influence the curriculumdesign,someofwhicharereflectedinthisdeliverablewhiletheotherswillguideourrevisionsoverthenextyearbeforethefinalreleaseinM30.

Table3:RecommendationsforEDSAcurriculumdevelopment,originallypresentedinD1.4

Title Interventionlevel SummarydescriptionHolistictrainingapproach

Generaltrainingapproach

RefinetheEDSA’strainingapproachandcurriculumcycletostrengthenskillsalongthefulldataexploitationchain.

Opensourcebasedtraining

Existingcurriculumdesign

Continuecurrenttechnicalandanalyticaltrainingbasedonopensourcetechnologies;applycross‐toolfocustodeliveroverarchingtraining.

Softskillstraining

Expansionofcurriculum

Implementsoftskilltrainingtoincreaseperformanceandorganisationimpactofdatascientists/datascienceteams.

Basicdataliteracytraining

Expansionofcurriculum

Developbasicdataliteracyanddatasciencetrainingfornon‐datascientiststoimprovebasicskillsacrossorganisationsandfacilitateuptakeofdata‐drivendecisionmakingandoperations.

BlendedtrainingCoursedelivery Developblendedtrainingapproachesincludingsector‐specificexercisesandexamplestoincreaseeffectivenessoftrainingdelivery.

Datascienceskillsframework

Trainingapproachanddelivery

Implementdatascienceskillsframeworktostructureskillsrequirements,assessskillsofdatascientists,andidentifyindividualskillsneeds.

Navigationandguidance

Trainingmarket Developqualityassessmentofthirdpartycourses;providenavigationsupporttoidentifyrelevanttrainingsfromEDSAandthirdparties.

Page 8: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page8of32EDSAGrantAgreementno.643937

4. Newcurriculareleasedforversion2The following sections discuss the revisions to the curriculum based on the feedback discussed inSection3,andanychangesthathavebeenmadetotheindividualmodulesfollowingtheirinitialrelease.

4.1 Overviewofversion2release

Building on our experiences of having released the first version of the curriculum, in addition toadditionalstudyingofthedatasciencelandscapeinEurope,anumberofmodificationshavebeenmadetothecurriculum.

LinkedData

After reviewing the curriculum and key technological developments around data science, it wasobservedthattherewasnocontentcoveringlinkeddataorsemanticwebtechnologies.Havingassessedtheavailabilityoflearningresourcesinthisarea,itwasdecidedthataswellasimprovingourcoveringofdatamanagementskills, this topiccouldbeusedasan initialmodule to testout theFutureLearnplatform for delivering online MOOC courses. Using the previous EU project ‘EUCLID’1, whoseconsortium includedanumberof theEDSApartners,a furthermodulewasadded to theEDSAdatasciencecurriculum,adaptedfromthesyllabusandlearningresourcescreatedinEUCLID.

Thelinkeddatamodule‐titled‘AnIntroductiontoLinkedDataandtheSemanticWeb’‐bothfillsagapinourcurriculumandallowedustotrialtherunningofacourseonFutureLearnbasedonexisting,high‐quality,openlearningmaterials.ThismeantthatwewereabletoquicklyadaptthecontentformtheEUCLIDiBooktobereleasedasaMOOConFutureLearn inApril2016andprovide insights intotheplatformtotherestoftheprojectbeforemovingforwardwithusingitfortheothermodules.Statisticsregarding the number of learners who registered on the course are presented in D3.2 (Report ondeliveryofvideolectures,webinarsandface‐to‐facetrainings2).

Modificationstocurriculumstructure

Accordingtotheoriginalschedule,theM18curriculumwouldincludethefollowingnewmodules:

● Statistical/MathematicalFoundations● DataManagementandCuration● BigDataAnalytics● DataVisualisation● FindingStoriesinOpenData

After reviewing the intended aims for the ‘Data Visualisation’ and ‘Finding Stories in Open Data’modules,asignificantoverlapwasidentifiedwhichmeantthatmanyoftheconceptswouldberepeatedacross the twomodules in order to convey their narrative effectively. As such, the leaders of thesemodules(Southamptonfordatavisualisation,andODIforFindingstories),collaboratedondesigningamerged curricula for ‘Data Visualisation and Storytelling’. Southampton and the ODI will releaseresourcesincollaborationasaself‐studyoptionforthiscourseasplannedinM24,andadditionallywillcollaboratetoprovideeitheranonlinecourseoreBookcoveringtheentirecurricula.

Inaddition,the‘VisualAnalytics’moduleduetobedeliveredbyFraunhoferinM30wasreplacedwith‘SocialMedia Analytics’, in order to teach crucial expertise on how to extract value from the ever‐growingamountsofsocialdata.Thisbroadensourcurriculumtoensurethatwecoveragreaterrange

1http://euclid‐project.eu/

Page 9: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page9of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

ofessentialdatasciencetopics,ratherthanfocusingintentlyontopicswhereoverlapwouldhavebeenunavoidable.

Revisedcorecurriculum

Giventhechangesdiscussedabove, thisdeliverablesees thereleaseofversion2.0of thecoreEDSAcurriculum,whichispresentedbelow.Asthedatavisualisationandfindingstoriesmoduleshavebeenmergedintoonetopic,andbecausewehaveaddedthelinkeddatatopic,weremainwithacurriculumof15modules.

Table4:CoreEDSACurriculum,version2.0

Module Topic Stage StatusasofD2.2

1 FoundationsofDataScience Foundations Releasedandrevised

2 FoundationsofBigData Foundations Released

3 Statistical/MathematicalFoundations Foundations NewlyReleased

4 Programming/ComputationalThinking(RandPython) Foundations TobereleasedM30

5 DataManagementandCuration Storage andProcessing

NewlyReleased

6 BigDataArchitecture Storage andProcessing

Released

7 DistributedComputing Storage andProcessing

Releasedandrevised

8 StreamProcessing Storage andProcessing

TobereleasedM30

9 LinkedDataandtheSemanticWeb Storage andProcessing

(Released)*

10 MachineLearning,DataMiningandBasicAnalytics Analysis ReleasedandRevised

11 BigDataAnalytics Analysis NewlyReleased

12 ProcessMining Analysis Released

13 SocialMediaAnalytics Analysis TobereleasedM30

14 DataVisualisationandStorytelling Interpretation andUse

NewlyReleased

15 DataExploitationincludingdatamarketsandlicensing Interpretation andUse

TobereleasedM30

*ModulebasedonEUCLIDcurriculum,andreleasedasFutureLearnMOOCinApril2016.

4.2 Revisedmodulesinversion2

This section outlines changes to the topics covered inmodules that have been updated since theircurriculawerereleased inM6: ‘FoundationsofDataScience’, ‘DistributedComputing’, and ‘MachineLearning,DataMiningandBasicAnalytics’.

4.2.1Foundationsofdatascience(SOTON)

The syllabus for foundations of data science was revised to incorporate a small change of topicsfollowingtheexperiencesofdeliveringthecourseasaface‐to‐facemodulewithinSouthampton’sMScDataScienceprogramme.Topicswerealsore‐organisedtofitwithinfourcategoriesthatcorrespondtostagesofthetypicaldatasciencepipeline:

Page 10: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page10of32EDSAGrantAgreementno.643937

● TheFoundationsofDataScience○ Introduction

■ This section covers the introduction to the course, and details the learningoutcomesasoriginallylaidoutinD2.1.

○ DataScienceinanutshell○ Terminology○ Thedatascienceprocess○ Adatasciencetoolkit○ Typesofdata○ Exampleapplications

● DataCollectionandManagement○ Datacollectionasthefirststageofdatascience○ Wheredatacomesfrom○ Pre‐processing○ Cleaningandfillinginmissingdata

■ Fixingdata■ Noisydata

○ Dataintegration○ Usingthecloudandprinciplesofcloudcomputing○ Documentrelatedstorage

■ MongoDB● FundamentalsofStatisticsandDataAnalysis

○ Introductiontodatamining■ Disciplinarycomponents■ Machinelearningandstatistics■ Applicationsofdatamining

● Real‐worldexamples○ Introductiontostatistics

■ Natureofstatisticsandintroduction■ Centraltendenciesanddistributions■ Variance■ DistributionPropertiesandarithmetic■ Samples/CLT■ Linearregression

○ UsingtheRstatisticalpackage○ IntroductiontoMachineLearning

■ Whatismachinelearning?Whatcanitdo?● Exampleapplications

■ Supervisedvunsupervisedlearning■ PrerequisitesforMachineLearning■ Regressionproblems■ Classificationproblems

● Linearlyseparablevnonlinearlyseparable● NaiveBayes● SupportVectorMachine

● DataVisualisation○ Introductiontodatavisualisation○ Typesofdatavisualisation

■ Exploratory■ Explanatory

○ Dataforvisualisation■ Datatypes■ Dataencodings■ Retinalvariables

Page 11: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page11of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

■ Mappingvariablestoencodings■ Visualencodings

○ Chartjunkandthebeautyparadox○ Technologiesforvisualisation

■ WebTechnologiesandtheirroleinvisualisation● Hypertext● CSS● Javascript

■ D3.js● D3chainsyntax● D3Libraries● AlternativestoD3:Python,high‐leveltoolsetc.

○ Data‐drivenjournalismandStorytelling

Page 12: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page12of32EDSAGrantAgreementno.643937

4.2.2EssentialsofDataAnalyticsandMachineLearning(Persontyle)

ThismoduleisarenamedandrevisedversionofPersontyle’s‘MachineLearning,DataMiningandBasicAnalytics’modulethatlaunchedinM6,basedontheexperiencesofdevelopingthelearningmaterialsanddeliveringthecourse.The listof topicsandtheirassociateddescriptionshavebeenmodifiedasfollows:

Lesson1:RREFRESHER

o ThismoduleprovidesanoverviewoftheRprogramminglanguage.ItisintendedtobemuchmorethanaquickintroductionorrefresherinR.ItisdesignedtofunctionasasourcethatyoucanreferencewhenworkingwithRontheexamplesandexercisesintheremainderoftheguide.Itcontainsnotonlyalotof informationabouttheRprogramminglanguage,butalsogeneralinformationthatisimportanttoknowwhenprogramminginanylanguage.

Lesson2:FEATUREINITIALIZATION

o Webeginourpre‐processingmodulesby considering the initial steps thatmustbe taken totransformtherawdata,whichhasbeencollectedtoasetoffeatureswhicharecapableofbeingexaminedandtransformedinthefurtherpre‐processingsteps,andeventuallyofbeingusedinstatisticalmodelling.

Lesson3:FEATURESELECTION

o Once quantitative data has been generated we should examine the features that we haveavailable.Inmachinelearningtasks,weoftenhaveaccesstofarmorefeaturesthanwecanhopetouse,andanimportantsub‐problemisselectinganinformationrichsubsetoffeaturestousetobuildourstatisticalmodels.Itisimportanttonotethatmorefeaturesisnotnecessarilybetter:featuresmayormaynotcontaininformationaboutthetargetvariable,buttheywillcertainlycontainnoise.Further,workingwithadditionalvariablestendstoentailadditionalparametersinourmodels,whichcanincreasethepotentialforoverfitting.

Lesson4:FEATURETRANSFORMATION

o Featuretransformationisthenamegiventoreplacingouroriginalfeatureswithfunctionsofthesefeatures.Thefunctionscaneitherbeofindividualfeatures,orofgroupsofthem.Theresultofthesefunctionsisitselfarandomvariable–bydefinitionthefunctionofanyrandomvariableisitselfarandomvariable.Utilizingfeaturetransformationsisequivalenttochangingthebasesofourfeaturespace,anditisdoneforthesamereasonthatwechangebasesincalculus:Wehopethatthenewbaseswillbeeasiertoworkwith.

Lesson5:MISSINGDATA

o Thismodulecontainsmanyreferencestolatermodules.Thisisbecauseimputingmissingdatagenerallyrequirestheuseofstatisticalmodelswithwhichtodotheimputation.Itisincludedhereasitispartofpre‐processing,butreadersmaywishtoreturntothismoduleafterreadingthestatisticalmodelingmodules.

Lesson6:BASICREGRESSIONMODELS

o Inthismodulewewilllookatsomeofthesimplestregressionmodels:Whattheyare,howtheywork,howtheycanbebuiltfromdataandhowtheycanbeevaluated.

Lesson7:MODELEVALUATION,SELECTIONANDREGULARIZATION

o Whencreating,selectingandevaluatingstatisticalmodels,youshouldthinkintermsofathreestepprocess:

Training:Trainmodelsusingdedicatedtrainingdata.

Evaluation/Validation:Selectthebestperformingmodel,eitherusingdedicatedvalidationdataorstatisticalmethods.Wewillcallthisstepevaluationandreservethetermvalidationfortheevaluation/testingtechniquediscussedinthefollowingsections.

Page 13: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page13of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

Testing:Obtainunbiasedestimatesoftheperformanceoftheselectedmodelusingdedicatedtestdataorstatisticalmethods.

o Theaimofthismoduleistoclarifywhythesethreestepsarerequiredandprovideyouwiththemeansofperformingthesecondandthirdsteps.

Lesson8:BASICCLASSIFICATIONMETHODS

o Inthismodulewelookatanumberofbasicclassificationmethods.Wefirstlookatcaseswheretheindependentvariablesarereal,andexaminethefollowingalgorithms:

LinearandQuadraticDiscriminantAnalysis

ThePerceptronAlgorithm

Logistic

o Wethenlookatthecasewhereboththeindependentandtargetvariablesarediscrete,whichwewilltermdiscrete‐discreteclassification.

Lesson9:ANALYSISOFCLASSIFICATIONMODELS

o Weintroducedmisclassificationerrorinmodule8,wherewealsodiscussedtheuseofmeansquarederrorforprobabilityestimates.Thesearethemostcommonerrorscoresusedforfittingparameters.Youshouldnotethatmisclassificationisthesimpleaccuracystatisticdiscussedinthenextsection,andhencesharesitsproblem.

Lesson10:LOCAL/NON‐PARAMETRICMETHODS

o Localmethodsareso‐namedbecausewhenestimatinganewcasetheyworkwithasubsetofthetrainingdatafoundtobeneartothenewcase,usingsomedistancemetricorspatialdivision.

Lesson11:BASICKERNELMETHODS

o Theorderoftopicsisasfollows:Wewillprovideabasicoverviewofthekerneltrick.Wethenlistsomecommonkernels, followedbyanexampleofanapplicationof thekernel trickwithkernel regression.We thenprovidebackgroundof the theorybehind thekernel trick,whichdelvesintothequitecomplexmathematicsandwhichmaybeskippedbytheintimidatedreader.Finally,weexaminekernelPCA.

Lesson12:SPLINES

o Splinesareacleverdevelopmentofthemorebasicregressiontechniquesthatcanbeunderstoodasimplementingtwoideas:

Instead of generating a single model for the entire feature space, we can divide it intodifferentsectionsandmodeleachofthesewithanindividualmodel.

Tocombinethesemodelselegantlyandplausiblywewillrequirethattheregressionlinesmeetwithadegreeofsmoothnessatthesectionboundaries.

Lesson13:GAUSSIANPROCESSES(NewlessontobedeliveredinAugust2016)

o Astochasticprocessisacollectionofrandomvariables,oftenrepresentingtheindeterminateevolutionofsomesystemovertime.

o Stochasticprocessesare:

Infinitevariateextensionsofmultivariatedistributions

Modelsofstochasticevolutionofsystemsovertime

Distributionsoverfunctions

Page 14: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page14of32EDSAGrantAgreementno.643937

Lesson14:TREES&ENSEMBLEMETHODS(NewlessontobedeliveredinAugust2016)

o Inthismoduleweintroducedecisiontreebasedmethodsandusethemtoexamineanumberofensemblemethods.Thiscombinationoftopicsisnotaccidental–treemodelsarealmostalwaysusedaspartofensembles,andthemostpopularensemblemethodsaretreebased.

o Treebasedmethodsaresometimesdescribedasradicallynon‐linear. In fact, theyperformaformofdiscretization,wherethefeaturespaceisdividedintoregionsandavalueestimatedforthetargetvariableineachregion.

Lesson15:SUPPORTVECTORMACHINES(NewlessontobedeliveredinAugust2016)

o Inthismoduleweintroduceanotherwell‐knownandveryhighperformingmachinelearningtechnique: support vector machines (SVMs). We will explain SVMs in terms of solving theoptimization problem for linear classifiers known as finding the maximally separatinghyperplanewhenclassesareinseparable.Weexaminethebasiccase,knownassupportvectorclassifiers,beforelookingathowwecanobtainamorepowerfulprocedurebytransformingthevariablesweworkwithusingkernels.

Lesson16:NEURALNETWORKSANDDEEPLEARNING

o Artificialneuralnetworks(ANNsorjustNNs)areperhapsthemostfamousmachinelearningtechnique. They have ridden repeatedwaves of hype, and disappointment, since first beingformulated in the 1940s.With the development of deep neural networks (DNNs) they haveproventoperformverywellonalargenumberoftasksandwearecurrentlyinanotherperiodof strong interest in neural network techniques. This interest can be negative – too manystudentsanddata scientistshavebeencaughtup in the currenthypecycleandplaceundueinterest inwhat ismerely one tool amongmany.Nonetheless, they are a powerful tool anddeserveaplaceinanydatascientist’sarsenal.

Page 15: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page15of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

4.2.3Distributedcomputing(KTH)

KTHrevisedthesyllabusfordistributedcomputingbasedonexperiencesofteachingcoursesbasedontheM6version.Theexistingtopiclistremainedthesame,howeverathirdpartwasaddedtocovernewandfurthertopics:

PartI:DistributedServicesandDistributedAlgorithms

● UnchangedfromD2.1PartII:Clouds,Peer‐to‐PeerandBigData

● UnchangedfromD2.1PartIII:Follow‐upcoursesintheDistributedComputingmodule

A. DistributedArtificialIntelligenceandIntelligentAgents● IntroductiontoAgents

○ Thispartintroducesthenotionofintelligentagentandconsidersindividualandgroupperspectivesofagenttechnology

● Negotiation○ Inthispartwillconsidernegotiationprinciples,negotiationprotocols(voting,

monotonicconcession,Clarktax,auctions)andnegotiationmechanisms(organizationalstructures,meta‐levelinformationexchange,multi‐agentplanning)

● Coordination○ Thispartintroducescoordinationprinciplesandconsiderscoordinationmechanisms

(organizationalstructures,meta‐levelinformationexchange,multi‐agentplanning,normsandlowsandcooperativework)

● Interoperability○ Hereweconsiderapproachestosoftwareinteroperation,speechactsandAgent

CommunicationLanguages(ACL),KQML,FIPA.● Multi‐AgentSystemsarchitectures

○ Inthispartwediscusslow‐levelarchitecturesupport,DAItestbeds,Middle‐agentbasedarchitecturesandagentmarketplaces

● SoftwareEngineeringofMulti‐AgentSystems○ Thispartconsidersbasicapproachestoagentsystemsengineering,asweeasGAIAand

agent‐UMLapproaches● Agenttheory

○ Thispartintroducesagenttheory,basicsofmodallogicandBDI‐architecture● Agentarchitectures

○ Herewediscussdeliberative,reactiveandhybridagentarchitecture● Agentmobility

○ Weconsiderrequirements,implementationandsecurityformobileagentsaswellasenvironmentsformobileagents.

B. ProgrammingwebServices● Introduction

○ ThisconsidershistoricalperspectiveofservicetechnologyandbasicconceptsofWebservices.

● Markuplanguages○ HerebasicsofmarkuplanguagesandXMLareconsidered.

● XMLmessaging○ ThisconsidersapproachestoXMLmessagingaswellasSOAPandJSON

● Servicedescription

Page 16: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page16of32EDSAGrantAgreementno.643937

○ Herefunctionalandnon‐functionalcomponentsofservicedescriptionareconsideredtogetherwithWebservicedescriptionlanguage.

● Servicediscovery○ Thisconsidersapproachestoservicediscovery(registry,indexandpeer‐to‐peer)and

specificationsforservicediscovery● Servicescoordination

○ Basic components of the service coordination environments are considered togetherwithspecifications.Atomictransactionandbusinessactivityarediscussed.

● Servicecomposition○ Herebasicsandapproachestoservicecompositionareconsidered,languageforservice

compositionispresented● Servicessecurity

○ BasicelementsofWebservicesecurityarepresented,mainsecurityspecificationsarediscussed

● Webservicesandstatefulresources○ ApproachesandspecificationsfordescriptionofstatefulresourcesinWebservicesare

considered● SemanticWebServices

○ Here basics of semantic Web and its specification for semantic Web services arediscussed

Page 17: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page17of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

5. Newmodulesinversion2Followingtherevisionsdetailedabove,weprogressedwithdesigningcurriculaforthe4modules inM18(withdatavisualisationandstorytellingcomprisedofboththeoriginalDataVisualisationmoduleandtheFindingStoriesinOpenDatamodule,asdiscussedabove).AsinD2.1,weprovideasummaryofeachmodule’s content, and a list of thematerials that already exist, inpreparation for the learningresourcestobedevelopedandreleasedforM24.

5.1 Statisticalandmathematicalfoundations–JSI

5.1.1LearningObjectives

Bytheendofthismodule,learnerswill:

● Befamiliarwithconcepts,definitionsandmethodsfromthefieldofmathematicsandstatistics.● HaveknowledgeontheuseofthemathematicalandstatisticalconceptsinDataScience.● HavegainedexperienceonmathematicalandstatisticalmethodswithRandQMinertools.

5.1.2SyllabusandTopicDescriptions IntroductiontoMathematics

o ThistopicdescribesthenotionofMathematicscourse,providessyllabus,aimsforthecourseandrecommendationsforthelearners.

FoundationsofMathematics

o ThistopicgivesthebasicdefinitionsfromthefieldofMathematics,suchasarithmeticandalgebraconcepts, linearequations,quadraticequations, functions,differentiationandintegration.

LinearAlgebra

o ThistopicprovidesanumberofconceptsfromthefieldofLinearAlgebra,suchasvectorspaces, orthogonality, dot product, norm, matrices, determinant, matrixdecompositions.

Optimization

o This topic gives insights into the field of Optimization and definitions related tooptimization,suchasminimum,maximum,saddlepoints,convexfunctions,primalanddualproblems,Lagrangemultipliers.

IntroductiontoStatistics

o ThistopicdescribesthenotionofStatisticscourse,providessyllabus,aimsforthecourseandrecommendationstothelearners.

FoundationsofStatistics

o ThistopicprovidesthebasicdefinitionsfromthefieldofStatistics,suchasprobability,randomexperiments,definitionofprobability,conditionalprobability, independence,interpretationofprobability,Bayes’theorem.

o The learners will get familiar with random variables, probability density functions,expectationandexpectationvalues.

o The topic also covers probability distribution, discrete/continuous distributions,samplingandsamplingdistributions.

Page 18: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page18of32EDSAGrantAgreementno.643937

ExploratoryDataAnalysiso ThistopicprovidesaviewonExploratoryDataAnalysisanditsmethods.Inparticular,

the topic covers descriptive/summary statistics – central tendency (mean, median,mode), variability (standard deviation and variance), skew, kurtosis, histograms,frequencypolygons,box‐plots,quartiles,scatterplots,heatmaps.

RegressionandInferentialStatistics

o This topic describes the notion of Regression and such important concepts, ascovarianceandcorrelation.

o The learners will learn about the null hypothesis significance tests (NHST) andconfidenceintervals.

DataanalyticswithR

o This topicwillprovidea setofpracticalexamples fromtheareaofMathematicsandStatisticsusingR‐aprogramming languageandsoftwareenvironment forstatisticalcomputingandgraphics.

DataanalyticswithQMiner

o This topic provides some practical insights on mathematical implementations andstatistical data analysis using QMiner. QMiner implements a comprehensive set oftechniquesforsupervised,unsupervisedandactivelearningonstreamsofdata.

5.1.3ExistingCourses

Table5:ExistingcoursesthatincludeamodulesimilartoStatisticalandMathematicalFoundations

Institution

Course Module URL TargetAudience LanguagesandDataTypes

JSI DataAnalyticswithQMiner

DataanalyticswithQMiner

http://videolectures.net/sikdd2014_rei_large_scale/?q=qminer

Computerscientists,mathematicians,statisticians.

QMiner

PrincetonUniversity

StatisticsOneMOOC(Coursera)

StatisticsOne

https://www.coursera.org/course/stats1

Computerscientists,mathematicians,statisticians,other.

R

5.1.4ExistingMaterials

AnumberofrelevantmaterialshavebeenidentifiedonVideolectures:

● StatisticalMethodshttp://videolectures.net/acai05_taylor_sm/?q=statistics● IntroductiontoStatistics

http://videolectures.net/cernstudentsummerschool2010_cowan_statistics/?q=statistics

● Lecture15:StatisticalThinkinghttp://videolectures.net/mit600SCs2011_guttag_lec15/?q=statistics

● IntroductiontoStatisticshttp://videolectures.net/cernstudentsummerschool09_cowan_is/?q=statistics

Page 19: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page19of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

● OntheComputationalandStatisticalInterfaceand"BIGDATA"http://videolectures.net/colt2014_jordan_bigdata/?q=foundations%20of%20statistics

● ProbabilityandMathematicalNeedshttp://videolectures.net/bootcamp2010_anthoine_pmn/

Additionally,materialhasbeenidentifiedinthefollowingresources:

● JamesWardandJamesAbdey.Foundationcourse:MathematicsandStatisticshttp://www.londoninternational.ac.uk/sites/default/files/programme_resources/ifp/fp0001_maths_and_stats_taster_2013.pdf

● MasterStatisticswithRhttps://www.coursera.org/specializations/statistics● BasicStatisticshttps://www.coursera.org/learn/basic‐statistics● InferentialStatisticshttps://www.coursera.org/learn/inferential‐statistics‐intro● FoundationsofDataAnalysis‐Part1:StatisticsUsingR

https://www.edx.org/course/foundations‐data‐analysis‐part‐1‐utaustinx‐ut‐7‐10x

5.1.5DescriptionsofExercisesExerciseswillbedesignedtoallowstudentstocarryoutdataanalyticswiththeQMinertool:http://qminer.ijs.si/

5.1.6FurtherReading● G.Cowan,StatisticalDataAnalysis,Clarendon,Oxford,1998seealso

www.pp.rhul.ac.uk/~cowan/sda● R.J.Barlow,Statistics,AGuidetotheUseofStatisticalinthePhysicalSciences,Wiley,

1989seealsohepwww.ph.man.ac.uk/~roger/book.html● L.Lyons,StatisticsforNuclearandParticlePhysics,CUP,1986● F.James,StatisticalMethodsinExperimentalPhysics,2nded.,WorldScientific,2006;

(W.Eadieetal.,1971).● S.Brandt,StatisticalandComputationalMethodsinDataAnalysis,Springer,NewYork,

1998(withprogramlibraryonCD)● W.‐M.Yaoetal.(ParticleDataGroup),ReviewofParticlePhysics,● J.PhysicsG33(2006)1;seealsopdg.lbl.govsectionsonprobabilitystatistics,Monte

Carlo

Page 20: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page20of32EDSAGrantAgreementno.643937

5.2 Datamanagementandcuration‐TU/E

5.2.1LearningObjectives

Theobjectiveofthismoduleistointroduceandexplainconceptsandpracticesforthemanagementofresearchdata assets, including the creation of a plan, organisation, file formats, and the creationofmetadata.

Italsointroducesknowledgerequiredtomanagethesafeaccesstodata,fromstoragetoaccessrightsandlicensing,includingnotionsaboutprivacyissues.

5.2.2SyllabusandTopicDescriptions● Dataanddatamanagement

○ ResearchdataexplainedThisunitaimsatbeingawareofdifferenttypesofresearchdata,andunderstandinggeneralideasaboutbigdatausefulforthepurposeofthiscourse.

○ DatamanagementplansThis unit aims at explaining what data management is, and how a data management plan can bedesigned.

● Organizationanddocumentation○ Organizationofdata

This unit aims at introducing the concepts of data organization, as well as file versioning, namingconventions,andbestpractices.

○ FileformatsandtransformationThisunitpresentsdataformatting,compression,andothertransformations.

○ Documentation,metadata,andcitationThisunitexplainstheuseofdocumentingdata,aswellasmetadata.

● Dataanddatamanagement○ Storageandsecurity

Thisunitintroducesconceptsofsafetyofstorageandsafetyguidelines.Italsoexplainsdataencryption,anddestroyingsensitivedata.

○ Dataprotection,rights,andaccessThisunitaimsatdescribingdataprotection,rights,andaccess, fromprivacyto intellectualpropertyrights.

○ Sharing,preservation,andlicensingThisunitpresentsthepracticalimplicationsofsharingdata,aswellaslicensing,inparticularthroughdifferentkindsofopendatalicences.

5.2.3ExistingCourses● IntroductiontoResearchDataManagementandSharing,freecourseonCoursera,byThe

UniversityofNorthCarolinaatChapelHillandTheUniversityofEdinburgh.○ https://www.coursera.org/learn/data‐management

● DataManagement,distancelearningcoursebyTheOpenUniversity.○ http://www.open.ac.uk/postgraduate/modules/m816

● DigitalCuration,tutorialwithextensivelearningmaterialsonline,bytheDigitalCurationCentre.

○ http://www.dcc.ac.uk/training/train‐the‐trainer/dc‐101‐training‐materials● NewEnglandCollaborativeDataManagementCurriculum

○ http://library.umassmed.edu/necdmc/modules

Page 21: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page21of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

5.2.4ExistingMaterials

● Research Data Management Training, free course called "MANTRA", by the University ofEdinburgh.

○ http://datalib.edina.ac.uk/mantra/● NewEnglandCollaborativeDataManagementCurriculum

○ http://library.umassmed.edu/necdmc/modules● DigitalCuration,tutorialwithextensivelearningmaterialsonline,bytheDigitalCurationCentre.

○ http://www.dcc.ac.uk/training/train‐the‐trainer/dc‐101‐training‐materials

5.2.5ExampleQuizzesandQuestionsQuizzescanincludequestionsthattesttheunderstandingofconcepts,terminology,regulations,andmethods.

5.2.6DescriptionofExercises

● Providerulesthatimplementasamplepolicy.● Provideadescriptionofaninfrastructurerespondingtoasamplesituation.● Evaluatetheopportunitytopreservesomedata,inasamplesituation..

5.2.7FurtherReading● ‘ManagingResearchData’,FacetPublishing(2012)● ‘PrinciplesofDataManagement’,KeithGordon(2013)● ‘PreparingtheWorkforceforDigitalCuration’,NationalAcademiesPress(2015)

Page 22: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page22of32EDSAGrantAgreementno.643937

5.3 Bigdataanalytics‐Fraunhofer

ThismodulewasdevelopedbyFraunhoferandprovidesanoverviewofapproachesfacilitatingdataanalytics on huge datasets. Different strategies are presented including sampling to make classicalanalyticstoolsamenableforbigdatasets,analyticstoolsthatcanbeappliedinthebatchorthespeedlayerofalambdaarchitecture,streamanalytics,andcommercialattemptstomakebigdatamanageablein massively distributed or in‐memory databases. Learners will be able to realistically assess theapplicationof bigdata analytics technologies fordifferentusage scenarios and startwith their ownexperiments.

5.3.1LearningObjectives

Learnerswithexperienceinprogramming,dataanalyticsandthearchitectureofbigdatasystemsgettoknowmethodsandtoolstoanalyzebigdata.Theywillbeabletorealisticallyassesstheapplicationofbigdataanalyticstechnologiesfordifferentusagescenariosandstartwiththeirownexperiments.

5.3.2SyllabusandTopicDescriptions

● Introduction:○ Bigdata analytics solutionsask for skills from twodifferent fields. First, information

technologyskillsareneeded toprovidehorizontally scalablesystems for storingandprocessing data. Second, data analysis skills are needed and, in particular, tools andmethods for the mathematical modelling of data. Collaborative filtering provides arunning example for all aspects of big data analytics. The use‐case is to providerecommendationsforproductsbasedonuserpreferences.Aconcreteexampleisgivenbythetaskofmusicrecommendationbasedonthelast.fmdataset.Thelatterprovidesuser/artist pairings based on listening events and can be used to generate artistrecommendations.Thetaskmakesitnecessarytodefinesimilarityofdisparateitems.ThisisbasedonutilitymatricesandtheJaccardsimilarityresp.distance.Onceadistancemeasure isdefined,data canbeanalysedbyclustering.On theonehand, this canbeachievedusingsamplingtofitbigdatasetsintoclassicalanalyticstools.Ontheotherhand,bigdataanalyticssolutionscanbeintegratedinalambdaarchitecture.

● Collaborativefilteringinthelambdaarchitecture:○ Lingual provides an ANSI SQL interface for ApacheHadoop. This is applied for data

understanding in the collaborative filtering use‐case. Building the utility matrix forcollaborativefilteringisatypicalbatchprocessingtask.ThiscanbemodelledontopofHadoopusingCascading.Classicaldataanalyticssoftwarecannothandlehugeamountsofdata.ApossiblesolutionisgivenbysamplingasubsetofthedatasetwhichcanthenbeanalysedinclassicaltoolsusingRHadoop.Theactualclustermodelforthebatchviewis computed in R whereas the application of this model is again performed usingCascading.ThesizeofthefinalmodelissmallenoughsuchthatvalidationispossibleinRviaLingual.

● Generatingreal‐timerecommendations:○ Stream‐processingcanbeusedtoupdatetheresultsfromcollaborativefilteringfornew

entitieswhichhavenotbeenpartof thebatchprocessing,yet.For largestreams, thetechniqueof streamsynopsesallows tokeep thedatasizemanageable. Inparticular,count sketches can be used to efficiently approximate the distance computationnecessary for collaborative filtering. Such a solution canbe integrated in the lambdaarchitectureusingApacheStorm.

● BigdataanalyticsinSpark:○ Sparkisoneofthefastestdevelopingplatformsforbigdataanalytics.Itprovidesmeans

toovercomethediscI/OoverheadoftenseeninMapReducebasedprocessing.ConceptslikedataframesandSparkSQLmakeitconvenienttoworkwithstructureddatainsideSparkprograms.Inparticular,itprovidestheSparkMLlib,alibrarythatcontainsmanydistributedmachinelearningapproachessuitableforbigdataanalytics.PySparkallowsthe powerful combination of Spark with Python and thus to combine distributed

Page 23: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page23of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

processingandthewiderangeoflibrariesfordataanalyticsandvisualisationinPython.Linearregression,collaborativefilteringwithalternatingleastsquares(ALS),K‐Meansclustering,andpoweriterationclusteringprovidethemeanstoimplementlargescalecollaborativefilteringinSpark.Sparkoffersaframeworkformachinelearningpipelines.These make it easier to combine multiple algorithms into a single workflowwith acommon interface to parameters. Linear regression is available in both Spark andPython.Itmaybebeneficialtouseeithertheoneortheother,dependingondatasize.

● ComplexeventprocessingwithProton:○ Thebasicprincipleofcomplexeventprocessingistoderivecomplexeventsonthebasis

ofapossiblylargenumberofsimpleeventsusinganeventprocessinglogic.ProtononStormallowsrunninganopensourcecomplexeventprocessingengineinadistributedmanner on multiple machines using the STORM infrastructure. Event processingnetworksprovideaconceptualmodeldescribingtheeventprocessingflowexecution.Suchanetworkcomprisesacollectionofeventprocessingagents,eventproducers,andeventconsumersthatarelinkedbychannels.Frauddetectiononcalldetailrecordsisanillustrativeuse‐caseexampleshowinghowtheseconceptscanbeimplemented.

● SQLoperatorsforMapReducewithTeradata:○ Databasemanagementsystemprovidersseektoenhancetheirtraditionaldatabaseand

makethemapplicabletobigdatause‐cases.Abasicconcepttoachievethisisgivenbypartitioning of tables, leading tomassively parallel databases. Table operators allowmakinguseofthepartitioningfordistributedalgorithmsusingMapReduce.AselectedcommercialtoolofferingtheseapproachesistheTeradataAstersolution.

● In‐memoryprocessing:○ Moreandmoremainmemorybecomesavailableatareasonableprice.Asaccessspeed

isreducedsignificantlyoncedataoutsideofmainmemoryisaccessed,highperformanceapplicationsfocusonkeepingasmuchdataaspossibleinmainmemory.Thereisawidevarietyofin‐memorydatabasesystemsavailable.Centralperformanceandapplicabilitymeasurestobekeptinmindwhenchoosingsuchasystemcompriseoperatingsystemcompatibility,hardwarerequirements,licenseandsupportissues,runtimemonitoringcapabilities,memoryutilisation,databaseinterfacestandards,extensibility,portability,integration of open source big data technologies, local and distributed scaling andelasticity,availableanalyticsfunctionality,persistence,availability,andsecurity.

5.3.3ExistingCourses

● Big data analytics, a face‐to‐face training offered by Fraunhofer IAIShttp://www.iais.fraunhofer.de/bigdataanalytics.html

5.3.4ExistingMaterials● Data Mining and Applications Graduate Certificate at Stanford Center for Professional

Development,http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=1209602

● Machine Learning With Big Data at Courserahttps://www.coursera.org/specializations/bigdata

● MiningMassiveDataSets,StanfordUniversityhttp://web.stanford.edu/class/cs246/● DATA SCIENCE AND BIG DATA ANALYTICS by EMC

https://education.emc.com/guest/campaign/data_science.aspx● Spark Fundamentals I by BigData University

http://bigdatauniversity.com/courses/spark‐fundamentals/

Page 24: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page24of32EDSAGrantAgreementno.643937

5.3.5ExampleQuizzesandQuestions

Multiplechoicequestionsconcerning

● TheJaccarddistancetobuildtheutilitymatrixforcollaborativefiltering● Locatingcollaborativefilteringinthelambdaarchitecture● Theroleofclassicaldataanalyticsforbigdata● ThecountdistinctSketchalgorithm● Thealternatingleastsquaresalgorithm● Theparallelizatioinofk‐meansclustering● BigdataanalyticsinSpark● In‐memoryprocessing

5.3.6FurtherReading

● http://github.com/ishkin/Proton/tree/master/IBM%20Proactive%20Technology%20Online%20on%20STORM

● http://www.teradata.de/products‐and‐services/analytics‐from‐aster‐overview● http://en.wikipedia.org/wiki/List_of_in‐memory_databases● http://s.fhg.de/in‐memory‐systems

Page 25: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page25of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

5.4 Datavisualisationandstorytelling‐SOTON/ODI

Asdiscussedabove,thismoduleisacombinationofthe‘Datavisualisation’and‘Findingstoriesinopendata’ modules that were listed in the first version of the curriculum. We present here an overallcurriculumforthenewmodulethatcombinesthetopicsoftheoriginalmodules.

5.4.1LearningObjectives

Bytheendofthecourse,learnerswillbeableto:

● Explaintheroleofdatavisualisationindatascience● Describewhyparticularvisualisationsareeithergoodorbad● Designvisualisationsthattellacoherentstory● Explainhowthebrainperceivesvisualinformation● Createstaticandinteractivevisualisations

5.4.2SyllabusandTopicDescriptions

Themainsyllabusisseparatedintofour ‘stages’: fundamentals,planning,storytellingandproducingvisualisations.

Fundamentals

● Importanceofvisualisationwithindatascience:thissectiondiscussesthecurrentdatasituationandhowthevolumeofdatameansthereisagrowingneedforactionableinsightsthatcanberevealedthroughvisualisation.

● Dashboardsandinfographics:wewillreviewtheriseofdashboardsandinfographicsandhowtheseareusedtotellcoherentstoriesaboutdata

● Originofgraphs:thiscoversabriefhistoryofvisualisation, lookingatwhycertaingraphsorchartswereintroducedinordertomakesenseofgrowingamountsofdata.

● Europeanand local laws: this topicwill coverhow law isused toprotect freedomof speechwithinjournalism

● Ethicsincommunicatingstories:whataretheethicalconsiderationsrequiredwhencontentcanbesharedrapidlytoproliferateacrosstheWeb?

Planning

● Audience‐Story‐Action‐thissectionwillcoverthreekeyconsiderationsthatadesignermustmakebeforecreatingavisualisation

● Headlinewritingandstoryproduction‐lookingathowtoscopeadatajournalismproject,wewillcoverhowtoeffectivelycommunicateinsightbasedonathoroughunderstandingofyouraudience

● 80:20rule‐howthisappliesindatajournalismandhowyoucanreducetimespentonearlystagesofthejournalismprocess.

Storytelling

● Visualperceptionandinformationdesign:wewilllookatthecognitiveprocessbehindhowthebraininterpretsvisuals,includingpsychologicaltheoriessuchas:gestalttheory,pre‐attentiveprocessing,elementaryperceptualtasks.Wewillshowhowanunderstandingoftheseallowsustonoticehowthebraincanbetricked,sothatbaddesigncanbeavoided.

● Colourandinteractiontodrawattentiontodata● Chartjunkandbaddesign ‐wewill exploreTufte’s theoriesofminimalismand chartjunk as

guidelinesforproducingeffectivegraphics.ProducingVisualisations

● Obtainingdata:thistopiclooksathowtoobtainthedatarequiredtotellacompellingstory,andwhatdatasetsarerequiredtotelldifferenttypesofstory.

● Cleaningdata:thislooksathowtopreparedataforanalysisonceithasbeengathered,beginningwiththeessentialprocessofdatacleansing

○ Identifyingoutliers,fixtypography,correctmissingdata

Page 26: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page26of32EDSAGrantAgreementno.643937

○ Enrichingdata● Typesofchart:whatchartscanbeusedforwhatdata,andwhichstoriescanthesetell?● BuildingvisualisationswithD3:Thistopiclooksatthepracticalwaystovisualisedatatotella

story. Technologies covered include HTML5, D3.js and DC.js to produce compelling onlinegraphics.

5.4.3ExistingCourses● DataVisualizationandD3.jsMOOC(Udacity)‐ZipfianAcademy

○ https://www.udacity.com/course/data‐visualization‐and‐d3js‐‐ud507

● BigData:DataVisualisationMOOC(Futurelearn)‐QueenslandUniversityofTechnology

○ https://www.futurelearn.com/partners/queensland‐university‐of‐technology

● Datavisualisationone‐dayworkshop(TheGuardian)

○ https://www.theguardian.com/guardian‐masterclasses/2015/aug/07/data‐

visualisation‐a‐one‐day‐workshop‐tobias‐sturt‐adam‐frost‐digital‐course

● Datavisualisation(moduleonMaster’sdegreeonDataScience)‐BarcelonaGraduateSchoolof

Economics

○ http://www.barcelonagse.eu/sites/default/files/course‐d007‐data‐visualization.pdf

● DataAnalysis,VisualisationandCommunication(MSccourse)‐UniversityofAberdeen

○ http://www.abdn.ac.uk/study/postgraduate‐taught/degree‐programmes/46/data‐

analysis‐visualisation‐and‐communication/

5.4.4ExistingMaterials

● FindingStoriesinData‐CourseMaterials‐http://training.theodi.org/FindingStories/AdditionalresourceswillbedevelopedbybothSOTONandODI.

5.4.5ExampleQuizzesandQuestions

Quizzeswillbeintheformofmultiple‐choicequestionstotestknowledgeof:

● Theimportanceofdatavisualisationandstorytelling● Howshouldparticulardatabeencodedintovisualdimensions● Thestagesofplanningavisualisation● Chartjunkandminimalisminvisualisationdesign

5.4.6DescriptionofExercises

● Visualising data ‐ students will guided through how to create an online visualisation, withparticipantsdiscoveringthechallengesinvolvedintheprocess.

5.4.7FurtherReading

● DataJournalismHandbook‐http://datajournalismhandbook.org/● Cairo,A.TheFunctionalArt● Tufte,E.Thevisualdisplayofquantitativeinformation● Few,S.Showmethenumbers

Page 27: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page27of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

5.5 LinkedDataandtheSemanticWeb‐SOTON

Thismoduleisanewadditiontothecurriculumandfulfilsagapinthepreviousversionregardingthecoverageofsemantictechnologiesanddatamanagement.

5.5.1LearningObjectivesThemoduleisintendedforanyonewantingtogainabasicunderstandingoflinkeddata,beforeseekingoutfurtherstudyopportunities,oraspartoftheirexistingtraining.Itwillappealtomanycomputerscience,web science, or data science graduates, students and practitionerswishing to expand theirknowledge.

Linked data allows the structured publication on the Web which facilitates the exposure of andinterlinkingofdatasetssothatdatamaybeexchanged,reusedandintegrated.Increasingnumbersoforganisations and initiatives are starting to appreciate the power of linked data, and it providessignificantpowerandpotentialtothegrowingdomainofdatascienceandwebscience.

Thismodulewillexposeyoutotheknowledgeandskillsaroundusinglinkeddata–skillsthatareinincreasingdemandas theWebevolves fromaWebofDocuments toaWebofData. ItwillhelpyouunderstandhowyoucanuselinkeddatatechnologiesandhowtowritequeriesinSPARQL,allowingyoutoexploitthesetechnologiesinyourownwork.

BytheendofthisprimerinLinkedData,learnerswillbeableto: Describevariousbackgroundtechnologiesandstandardsbehindlinkeddata; Describetheroleoflinkeddataindatascienceapplications Understandhowlinkeddataapplicationsaredesignedandused UseSPARQLtoprocesslinkeddata.

5.5.2SyllabusandTopicDescriptions

WeekOne–LinkedDatafundamentals.o Bytheendofthisweekyouwillbeableto: ReflectonwhylinkeddataisimportantforstructuringtheWebofData Summarisethebackgroundtechnologiesbehindlinkeddata Outlinethevariousbackgroundstandardsbehindlinkeddata Explainwhatlinkeddatais Summarisetheprinciplesbehindlinkeddata Explainwhat5‐starlinkedopendatameans Evaluatedifferentlinkeddatatools

o This part of the course covers the basic topics required to understand linked data and thesemanticweb.ItbeginswithreviewingthetechnologiesandstandardsbehindtheWebitselfandhowtheseareevolvingtofacilitatelinkeddata.Studentswillgainanunderstandingofwhatlinkeddatais,beforewecoveranumberofprinciplesbehinditandthe5‐starmodelofratingdataopenness.

WeekTwo–SPARQLQueries.o Bytheendofthisweekyouwillbeableto: DescribethebasicconceptsofSPARQL RecallthedefinitionsofSPARQLterminology ApplyknowledgeofSPARQLtoformulateSPARQLqueries DetectdifferenttypesofSPARQLqueries ConstructSPARQLqueries

o This sectionof the course introduces the learners to SPARQL, the language forquerying thesemanticweb.We cover general principles such as how to structure a query, and providedefinitionsfortherequiredterminology,beforeintroducingbasicqueries,suchasASKandmorepredominantlySELECT.

Page 28: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page28of32EDSAGrantAgreementno.643937

WeekThree–MoreSPARQLQueries.o Bytheendofthisweekyouwillbeableto: DesignmoreadvancedSPARQLqueries DetectmoretypesofadvancedSPARQLquery DescribethepracticalapplicationsforSPARQLandLinkedData

o ThethirdpartofthecoursemovesontomoreadvancedSPARQLqueries,particularlyDESCRIBE

and CONSTRUCT.We cover the essential syntax and structure of these queries, building onpreviouslygainedknowledgefromearlierinthecourseaboutqueryinglinkeddata.Finallywecoverthevariousapplicationsoflinkeddataandhowthistechnologycanbeusedtodeveloprichtoolsthatprovideincreasedvaluethroughthelinkingofdata.

5.5.3ExistingCourses

Thiscourse takes inspiration fromtheEUCLIDproject’s iBookwhich isopenlyavailableandcontainsquizzesandexercisestotestknowledgeinaself‐studyformat.

5.5.4ExistingMaterials

The EUCLID iBook is available to download from the Apple iBooks store, and from the EUproject’swebsiteathttp://euclid‐project.eu/

5.5.5ExampleQuizzesandQuestions

Quizzestestthestudents’knowledgeofvariousconceptssuchas:

Thesemanticwebtechnologystack

RDFgraphs

SPARQLqueryresults

5.5.6DescriptionofExercises

Interactiveexerciseswillprovidestudentswithhands‐onexperienceofusinglinkeddataandSPARQL.Twovariationsareprovided:thefirstallowsthestudenttowritetheirownRDFstatementsandthenuseSPARQLtoverifythatitiscorrectlyformed,whilethesecondusealiveSPARQLendpointtoprovideamorecomplexdatasetonwhichthequeriesoutlinedinthecoursecanbetestedandexperimentedwith.

5.5.7FurtherReading

● TheEUCLIDprojectiBook:○ Simperl, E., Norton, B., Acosta, M., Maleshkova, M., Domingue, J., Mikroyannidis, A.,Mulholland,P.andPower,R.,2013.Usinglinkeddataeffectively.

● T.Berners‐Lee,J.HendlerandO.Lassila(2001)"TheSemanticWeb".ScientificAmericanvol.284number5,pp34‐43.Availableon‐lineathttp://www.scientificamerican.com/article.cfm?id=the‐semantic‐web.

● P.Hitzler,M.Krötzsch,andS.Rudolph(2010)."QueryLangauges:FoundationsofSemanticWebTechnologies".CRCPress

Page 29: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page29of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

6. DatascienceskillgroupsInTable3above,wepresentinsightsfromthedemandanalysiscarriedoutinWP1.OneoftheserelatestotherecommendationthatEDSAshoulddevelopadatascienceskillsframeworkinordertostructuretheskillsrequirementsofparticularjobs.Additionally,thiswouldallowdatascientiststhemselvestoassesstheirskillsprofiles,comparethemwiththoserequiredfordatajobs,andidentifyareasinwhichthey need to seek further training. Linking with the WP1 demand dashboard would facilitate theautomaticsearchofrelevanttrainingcoursesforajobthatausermayfindwhichtheyneedtoskillfor.

Inthissectionwepresentaninitialmatchingofmodulestoskills,basedonthecurriculareleasedinthisandthepreviousdeliverable.Forthepurposesofthisskillsassignment,weusealistofskillsthathasbeenusedthroughouttheEDSAproject.TheassignmentofskillstomodulesisshowninTable6.

Inaddition,additionalskills thatwerenot inthecurrentvocabularyandwhichrelatedtoparticulartools or technologies were extracted from the curricula in order to enhance the accuracy of thismapping:

FoundationsofBigData

o QMiner

o MapReduce

BigDataArchitecture

o MapReduce

o ApacheStorm

o Cassandra

o Hadoop

DistributedComputing

o Hadoop

EssentialsofDataAnalyticsandMachineLearning

o R

ProcessMining

o ProM

StatisticalandMathematicalFoundations

o R

o QMiner

BigDataAnalytics

o Hadoop

o R

o ApacheStorm

o Spark

o Python

o MapReduce

Page 30: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page30of32EDSAGrantAgreementno.643937

Table6:Skillsthateachmodulecurriculatargets

(continuedonnextpage)

FoundationsofDataScience

FoundationsofBigData

BigDataArchitecture

DistributedCom

puting

EssentialsofDataAnalytics

andMachineLearning

ProcessMiningTU/E

Statisticalandmathematical

foundations

Datamanagem

entand

curation

Bigdataanalytics

Datavisualisationand

storytelling

LinkedDataandtheSemantic

Web

TotalCoverage

advancedcom

puting

python 0

advancedcomputing x X X 3

programming X X X 3computationalsystems x X 2

coding X X 2

cloudcomputing X X 2

dataskills

databases X X 2

datamanagement X X X X 4

dataengineering 0

datamining X X x 3

dataformats X 1

linkeddata X 1informationextraction X X 2

streamprocessing x X 2

dom

ainexpertise

enterpriseprocess 0

businessintelligence X X 2

dataanonymisation X 1

semantics 0

schema 0

datalicensing X 1

dataquality 0

datagovernance X 1

general datascience X x 2

bigdata X X X X 4opendata

0

Page 31: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

D2.2DataScienceCurricula2Page31of32

2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.  

machinelearning

machinelearning x X X 3socialnetworkanalysis 0

inference X 1

reasoning X 1

processmining X 1

maths&statistics

linearalgebra X X X 3

calculus X 1

mathematics X X 2

statistics X X X 3

probability X 1

Rstudio 0

dataanalytics X X X 3

dataanalysis X X X 3

visualisation

datavisualisation X X 2

infographics X 1

datamapping X 1

datastories X 1

datajournalism X 1

d3js X 1

tableau 0

Thismappingrepresentsaninitialcategorisationofthemodulesinourcurriculumbasedontheskillsthattheytrain.Inthecomingmonths,furtherrefinementwillbemade–inparticulartheback‐endworkof thedashboard is also producing anupdated list of skills basedon their increasing occurrence inEuropeanjobpostsovertime.Usingthiswillprovideamoregranularandaccurateassessmentoftheexactskillscoveredineachmodule.

Additionally,themappingiscurrentlyatacurriculumlevel,indicatingwhichtopicsarecoveredbythesyllabusofeachmodule.Furthercomplexityariseswhendifferentinstancesordeliverymechanismsofeachcourseareconsidered:forexampleapartnermayproduceanonlinecoursefortheirmoduleandindoingsobaseitaroundaspecificlanguageortool,whereasforaface‐to‐faceofferingtheymightuseacombinationoftoolsinstead.Thisrequiresaclassificationnotonlyatthecurriculumlevel,butatthecourselevel,too.

Page 32: D2.2 Data Science Curricula 2edsa-project.eu/edsa-data/uploads/2015/02/EDSA-2016-P-D...Change Log Version Date Amended by Changes 0.1 03/06/2016 Chris Phethean Structure of document

Page32of32EDSAGrantAgreementno.643937

7. ConclusionFollowingthereleaseofthefirstversionoftheEDSAcurriculuminM6,wehavenowsincehadtimetodevelopandrunanumberofEDSAmodulesandreceivebothstudentandteacherfeedbackfromthese.Additionally, insights fromourdemandanalysisarebeginningtoshapechanges to themodulesandcurriculumstructureitself,ensuringthattheEDSAcurriculumremainsanintegralresourcefordatascientiststodiscovernewopportunitiesfortrainingandwork.

This deliverable has reviewed those changesmade to date, aswell as presenting thenext group ofmodulesfromthecurriculumtobereleased.AtM30,wewillpresentthefinalversionofthecurriculumwhichwillincludefewerbrandnewmodules,andinsteadwillfocusonbringingalltheinsightsfromour trainingdelivery,demandanalysis and learning analytics together.Eachmodule andassociatedcoursewillbetaggedwiththeskillswhichastudentwilllearn.CoursesreleasedatM24andM36willthereforerelatedirectlytotheseskillstojoinupthedemandanalysisworkwiththecoursedeliveryofferings.Aperspectivedatascienceworkerwill thereforebeable to locate jobs that interest them,identifytheirownskillsshortages,andthenlocatetrainingmaterialsinordertohelpthemmeettherequirementsofthejobpost.