Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Projectacronym: EDSA
Projectfullname: EuropeanDataScienceAcademy
Grantagreementno: 643937
D2.2DataScienceCurricula2
DeliverableEditor: ChrisPhethean(SOTON)
Othercontributors:
ElenaSimperl(SOTON)
Modulecurriculapreparedbypartnersasnotedindocument:ODI,FRAUNHOFER,TU/E,PERSONTYLE,JSI,KTH
DeliverableReviewers: ShathaJaradat(KTH),AngiVoss(Fraunhofer)
Deliverableduedate: 31/07/2016
Submissiondate: 29/07/2016
Distributionlevel: P
Version: 1.0
ThisdocumentispartofaresearchprojectfundedbytheHorizon2020FrameworkProgrammeoftheEuropeanUnion
ChangeLog
Version Date Amendedby Changes
0.1 03/06/2016 ChrisPhethean Structure of document and initialcontent.
0.2 15/06/2016 ChrisPhethean LinkedDatamoduleadded
0.3 16/06/2016 ChrisPhethean Revisedv2curriculumdetailed
0.4 04/07/2016 ChrisPhethean New modules for M18 added anddetailed using contributions fromrelevantpartners.
0.5 05/07/2016 ChrisPhethean Curricula revisions to M6 modulesmade.
0.6 07/07/2016 ChrisPhethean Skills assignment added. Conclusionswritten.
0.7 07/07/2016 ChrisPhethean Changes and updates in advance ofreview.Finalcontentadded.
0.8 08/07/2016 ChrisPhethean Final version of skills classificationadded.
0.9 19/07/2016 ChrisPhethean Reviewercommentsaddressed
0.9a 19/07/2016 ElenaSimperl ScientificReview
1.0 27/07/2016 AnetaTumilowicz FinalQA
D2.2DataScienceCurricula2Page3of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
TableofContents
ChangeLog..................................................................................................................................................................................2
TableofContents......................................................................................................................................................................3
ListofTables..............................................................................................................................................................................4
1. ExecutiveSummary......................................................................................................................................................5
2. Introduction.....................................................................................................................................................................5
2.1 Recapofcurriculaversion1.............................................................................................................................5
2.1.1ThecoreEDSAcurriculum..............................................................................................................................5
2.1.2ModulesreleasedinYear1.............................................................................................................................6
2.2 Deliveryoflearningresources........................................................................................................................7
3. Insightsfromdemandanalysis................................................................................................................................7
4. Newcurriculareleasedforversion2....................................................................................................................8
4.1 Overviewofversion2release.........................................................................................................................8
4.2 Revisedmodulesinversion2..........................................................................................................................9
4.2.1Foundationsofdatascience(SOTON).......................................................................................................9
4.2.2EssentialsofDataAnalyticsandMachineLearning(Persontyle)................................................12
4.2.3Distributedcomputing(KTH).....................................................................................................................15
5. Newmodulesinversion2.......................................................................................................................................17
5.1 Statisticalandmathematicalfoundations–JSI.....................................................................................17
5.1.1LearningObjectives.........................................................................................................................................17
5.1.2SyllabusandTopicDescriptions................................................................................................................17
5.1.3ExistingCourses................................................................................................................................................18
5.1.4ExistingMaterials.............................................................................................................................................18
5.1.5DescriptionsofExercises..............................................................................................................................19
5.1.6FurtherReading................................................................................................................................................19
5.2 Datamanagementandcuration‐TU/E....................................................................................................20
5.2.1LearningObjectives.........................................................................................................................................20
5.2.2SyllabusandTopicDescriptions................................................................................................................20
5.2.3ExistingCourses................................................................................................................................................20
5.2.4ExistingMaterials.............................................................................................................................................21
5.2.5ExampleQuizzesandQuestions.................................................................................................................21
5.2.6DescriptionofExercises.................................................................................................................................21
5.2.7FurtherReading................................................................................................................................................21
5.3 Bigdataanalytics‐Fraunhofer....................................................................................................................22
5.3.1LearningObjectives.........................................................................................................................................22
5.3.2SyllabusandTopicDescriptions................................................................................................................22
Page4of32EDSAGrantAgreementno.643937
5.3.3ExistingCourses................................................................................................................................................23
5.3.4ExistingMaterials.............................................................................................................................................23
5.3.5ExampleQuizzesandQuestions.................................................................................................................24
5.3.6FurtherReading................................................................................................................................................24
5.4 Datavisualisationandstorytelling‐SOTON/ODI................................................................................25
5.4.1LearningObjectives.........................................................................................................................................25
5.4.2SyllabusandTopicDescriptions................................................................................................................25
5.4.3ExistingCourses................................................................................................................................................26
5.4.4ExistingMaterials.............................................................................................................................................26
5.4.5ExampleQuizzesandQuestions.................................................................................................................26
5.4.6DescriptionofExercises.................................................................................................................................26
5.4.7FurtherReading................................................................................................................................................26
5.5 LinkedDataandtheSemanticWeb‐SOTON.........................................................................................27
5.5.1LearningObjectives.........................................................................................................................................27
5.5.2SyllabusandTopicDescriptions................................................................................................................27
5.5.3ExistingCourses................................................................................................................................................28
5.5.4ExistingMaterials.............................................................................................................................................28
5.5.5ExampleQuizzesandQuestions.................................................................................................................28
5.5.6DescriptionofExercises.................................................................................................................................28
5.5.7FurtherReading................................................................................................................................................28
6. Datascienceskillgroups.........................................................................................................................................29
7. Conclusion.....................................................................................................................................................................32
ListofTablesTable1:CoreEDSACurriculum,version1‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐5
Table2:TopicScheduleandPartnerAllocation‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐6
Table3:RecommendationsforEDSAcurriculumdevelopment,originallypresentedinD1.4‐‐‐‐‐‐‐‐‐7
Table4:CoreEDSACurriculum,version2.0‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐9
Table5:ExistingcoursesthatincludeamodulesimilartoStatisticalandMathematicalFoundations18
Table6:Skillsthateachmodulecurriculatargets‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐30
D2.2DataScienceCurricula2Page5of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
1. ExecutiveSummaryThisdeliverablepresentsthesecondversionoftheEDSAcurriculum,whichcomprisesof15coredatasciencetopicsthatarebeingdeliveredbythe3‐yearproject.Previously,wereleasedthefirstversionofthecurriculumcontaining6of15modules,eachofwhichhadaspecificcurriculumcontaininglearningobjectives, topicdescriptionsandvarious resourcesandmaterials thatbeen identified.Modules areseparated into four stages: foundations, data storage and processing, data analysis and datainterpretationanduse.Thesecategoriesremaininversion2andprovidestructuretoourcourses.Wepresentnewmodulesinthisdeliverable,reflectinganumberofchangestothecurriculum’slistoftopics,alongwithrevisionstothepreviouslyreleasedmodulesbasedonlearnerfeedback,teachingexperienceandincreasedamountsofdemandanalysis.
2. IntroductionBeforepresentingthenewversionoftheEDSAcurriculum,wefirstrevisittheinitialversionreleasedinJuly2015basedonpreliminarydemandanalysis.Thissectionrecapsthemaincurriculum,comprisedoffifteentopics,anddiscussestheprogressmadetodateregardingtheimplementationofthemodulesreleasedinYear1throughlearningresourcesnowavailableontheEDSACoursesportal
2.1 Recapofcurriculaversion1InD2.1(DataScienceCurricula1),wereleasedthefirstversionoftheEDSAcurriculum.15coretopicswerepresented,withtheintentiontodelivertheseingroupsoverthecourseofthe3‐yearproject.Atthetimeofwriting(Month6), the first6moduleswerereleased,eachprovidedbyadifferentEDSApartner.Thefollowingsectionoutlinesanoverviewofthecurriculum(2.1.1.)andthenrevisitsthe6modulesthatwerereleasedinversion1(2.1.2).
2.1.1ThecoreEDSAcurriculum
We presented the 15 topics that make up the core data science curriculum for EDSA, which arereproducedbelow:
Table1:CoreEDSACurriculum,version1
Module Topic Stage
1 FoundationsofDataScience Foundations
2 FoundationsofBigData Foundations
3 Statistical/MathematicalFoundations Foundations
4 Programming/ComputationalThinking(RandPython) Foundations
5 DataManagementandCuration StorageandProcessing
6 BigDataArchitecture StorageandProcessing
7 DistributedComputing StorageandProcessing
8 StreamProcessing StorageandProcessing
9 MachineLearning,DataMiningandBasicAnalytics Analysis
10 BigDataAnalytics Analysis
11 ProcessMining Analysis
12 DataVisualisation InterpretationandUse
13 VisualAnalytics InterpretationandUse
14 FindingStoriesinOpenData InterpretationandUse
15 DataExploitationincludingdatamarketsandlicensing InterpretationandUse
Page6of32EDSAGrantAgreementno.643937
Moduleswere structuredand classifiedaccording to the stageor theme in thegeneraldata scienceprocessthattheyweremostsuitedto(foundations,storageandprocessing,analysisandinterpretationanduse).Bycoveringthesestagesequally,weensureabroadfocusthatallowsdifferentaudiencesandstakeholderswithvaryingneedstoaccessourmaterialsandfindrelevanceandvalueinthem.
2.1.2ModulesreleasedinYear1
Table2:TopicScheduleandPartnerAllocation
Topic Schedule AllocatedPartner
FoundationsofDataScience M6 Soton
FoundationsofBigData M6 JSI
BigDataArchitecture M6 Fraunhofer
DistributedComputing M6 KTH
MachineLearning,DataMiningandBasicAnalytics M6 Persontyle
ProcessMining M6 TU/e
Statistical/MathematicalFoundations M18 JSI
DataManagementandCuration M18 TU/e
BigDataAnalytics M18 Fraunhofer
DataVisualisation M18 Soton
FindingStoriesinOpenData M18 ODI
Programming/ComputationalThinking(RandPython) M30 Soton
StreamProcessing M30 KTH
VisualAnalytics M30 Fraunhofer
DataExploitationincludingdatamarketsandlicensing M30 ODI
FollowingthescheduleoutlinedinTable2,curriculaforthefirstsixmoduleswerereleasedinM6:
● FoundationsofDataScience● FoundationsofBigData● BigDataArchitecture● DistributedComputing● MachineLearning,DataMiningandBasicAnalytics● ProcessMining
Eachcurriculumcontainsthefollowingdetails:
● LearningObjectives● SyllabusandTopicDescriptions● ExistingCourses● ExistingMaterials● ExampleQuizzesandQuestions● DescriptionofExercises● FurtherReading
Eachmoduleisreleasedasasyllabuswhichcanthenbeusedtodelivervarioustypesofcourseware,invariousdeliveryformats.Thismeansthatpartnerscantargetdifferentaudiences,sectorsorlevelsofexperiencebydeliveringdifferenttypesofcourses,forexampleaface‐to‐facetrainingcourse,anonlineMOOC,orasasetofself‐studylearningresources.
D2.2DataScienceCurricula2Page7of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
2.2 DeliveryoflearningresourcesAccompanyingthereleaseofthefirstversionofthecurriculuminM6,partnersthenproducedlearningresourcestoreleaseforthefirstsixmodulesinM12,asoutlinedinD2.4.Wehavemadethesecoursesavailable on the EDSA courses portal (http://courses.edsa‐project.eu), where there are a range ofdeliverychannelsandformatsavailable:
Self‐studycourses,allowinglearnerstostudyopeneducationalresourcesattheirownpace. MOOCs, hosted on external platforms such as FutureLearn and Coursera, allowing social
learninginlarge,onlinecourses. Blendedcourses,combiningface‐to‐faceelementswithonlinematerials,whichareofferedby
EDSApartnersandassociates. Face‐to‐facecourses,whicharecoursesdeliveredinpersonbyEDSApartnersandassociates.
EachoftheM6moduleshasatleastaself‐studycourseoptionavailableonthesite,whilecoursessuchasTU/e’s ‘ProcessMining’modulealsohavefurtheroptionssuchasMOOCs.ThecoursesportalalsoincludesadditionalcoursesfromtheprojectpartnersthatarerelatedtotheEDSAcurriculum.
3. InsightsfromdemandanalysisFollowingthedemandanalysiscarriedoutinWP1andreporteduponinD1.4,anumberofinsightshavebeengained intothecurriculumwhichthisdeliverablebuildsupon.Assummaries inD1.4, itcanbearguedthattheEDSAcurriculumpresentedinM06isalreadygenerallyinlinewithindustrydemandsacross Europe. As such, the current focus on comprehensively training the technical and analyticalaspectsofdatascienceshouldbemaintained.
Where the demand analysis revealed there was scope for innovation was through integrating thistechnicalandanalyticaltrainingintoamoreholisticapproachtoskillsdevelopment,expandingtocoversupporting skills that can increase the impact of data science within organisations. Therecommendationsfromthedemandanalysis,initiallypresentedinD1.4arereproducedinTable3.Thehighlighted rows (recommendations 2‐4 inclusive) represent thosewhich influence the curriculumdesign,someofwhicharereflectedinthisdeliverablewhiletheotherswillguideourrevisionsoverthenextyearbeforethefinalreleaseinM30.
Table3:RecommendationsforEDSAcurriculumdevelopment,originallypresentedinD1.4
Title Interventionlevel SummarydescriptionHolistictrainingapproach
Generaltrainingapproach
RefinetheEDSA’strainingapproachandcurriculumcycletostrengthenskillsalongthefulldataexploitationchain.
Opensourcebasedtraining
Existingcurriculumdesign
Continuecurrenttechnicalandanalyticaltrainingbasedonopensourcetechnologies;applycross‐toolfocustodeliveroverarchingtraining.
Softskillstraining
Expansionofcurriculum
Implementsoftskilltrainingtoincreaseperformanceandorganisationimpactofdatascientists/datascienceteams.
Basicdataliteracytraining
Expansionofcurriculum
Developbasicdataliteracyanddatasciencetrainingfornon‐datascientiststoimprovebasicskillsacrossorganisationsandfacilitateuptakeofdata‐drivendecisionmakingandoperations.
BlendedtrainingCoursedelivery Developblendedtrainingapproachesincludingsector‐specificexercisesandexamplestoincreaseeffectivenessoftrainingdelivery.
Datascienceskillsframework
Trainingapproachanddelivery
Implementdatascienceskillsframeworktostructureskillsrequirements,assessskillsofdatascientists,andidentifyindividualskillsneeds.
Navigationandguidance
Trainingmarket Developqualityassessmentofthirdpartycourses;providenavigationsupporttoidentifyrelevanttrainingsfromEDSAandthirdparties.
Page8of32EDSAGrantAgreementno.643937
4. Newcurriculareleasedforversion2The following sections discuss the revisions to the curriculum based on the feedback discussed inSection3,andanychangesthathavebeenmadetotheindividualmodulesfollowingtheirinitialrelease.
4.1 Overviewofversion2release
Building on our experiences of having released the first version of the curriculum, in addition toadditionalstudyingofthedatasciencelandscapeinEurope,anumberofmodificationshavebeenmadetothecurriculum.
LinkedData
After reviewing the curriculum and key technological developments around data science, it wasobservedthattherewasnocontentcoveringlinkeddataorsemanticwebtechnologies.Havingassessedtheavailabilityoflearningresourcesinthisarea,itwasdecidedthataswellasimprovingourcoveringofdatamanagementskills, this topiccouldbeusedasan initialmodule to testout theFutureLearnplatform for delivering online MOOC courses. Using the previous EU project ‘EUCLID’1, whoseconsortium includedanumberof theEDSApartners,a furthermodulewasadded to theEDSAdatasciencecurriculum,adaptedfromthesyllabusandlearningresourcescreatedinEUCLID.
Thelinkeddatamodule‐titled‘AnIntroductiontoLinkedDataandtheSemanticWeb’‐bothfillsagapinourcurriculumandallowedustotrialtherunningofacourseonFutureLearnbasedonexisting,high‐quality,openlearningmaterials.ThismeantthatwewereabletoquicklyadaptthecontentformtheEUCLIDiBooktobereleasedasaMOOConFutureLearn inApril2016andprovide insights intotheplatformtotherestoftheprojectbeforemovingforwardwithusingitfortheothermodules.Statisticsregarding the number of learners who registered on the course are presented in D3.2 (Report ondeliveryofvideolectures,webinarsandface‐to‐facetrainings2).
Modificationstocurriculumstructure
Accordingtotheoriginalschedule,theM18curriculumwouldincludethefollowingnewmodules:
● Statistical/MathematicalFoundations● DataManagementandCuration● BigDataAnalytics● DataVisualisation● FindingStoriesinOpenData
After reviewing the intended aims for the ‘Data Visualisation’ and ‘Finding Stories in Open Data’modules,asignificantoverlapwasidentifiedwhichmeantthatmanyoftheconceptswouldberepeatedacross the twomodules in order to convey their narrative effectively. As such, the leaders of thesemodules(Southamptonfordatavisualisation,andODIforFindingstories),collaboratedondesigningamerged curricula for ‘Data Visualisation and Storytelling’. Southampton and the ODI will releaseresourcesincollaborationasaself‐studyoptionforthiscourseasplannedinM24,andadditionallywillcollaboratetoprovideeitheranonlinecourseoreBookcoveringtheentirecurricula.
Inaddition,the‘VisualAnalytics’moduleduetobedeliveredbyFraunhoferinM30wasreplacedwith‘SocialMedia Analytics’, in order to teach crucial expertise on how to extract value from the ever‐growingamountsofsocialdata.Thisbroadensourcurriculumtoensurethatwecoveragreaterrange
1http://euclid‐project.eu/
D2.2DataScienceCurricula2Page9of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
ofessentialdatasciencetopics,ratherthanfocusingintentlyontopicswhereoverlapwouldhavebeenunavoidable.
Revisedcorecurriculum
Giventhechangesdiscussedabove, thisdeliverablesees thereleaseofversion2.0of thecoreEDSAcurriculum,whichispresentedbelow.Asthedatavisualisationandfindingstoriesmoduleshavebeenmergedintoonetopic,andbecausewehaveaddedthelinkeddatatopic,weremainwithacurriculumof15modules.
Table4:CoreEDSACurriculum,version2.0
Module Topic Stage StatusasofD2.2
1 FoundationsofDataScience Foundations Releasedandrevised
2 FoundationsofBigData Foundations Released
3 Statistical/MathematicalFoundations Foundations NewlyReleased
4 Programming/ComputationalThinking(RandPython) Foundations TobereleasedM30
5 DataManagementandCuration Storage andProcessing
NewlyReleased
6 BigDataArchitecture Storage andProcessing
Released
7 DistributedComputing Storage andProcessing
Releasedandrevised
8 StreamProcessing Storage andProcessing
TobereleasedM30
9 LinkedDataandtheSemanticWeb Storage andProcessing
(Released)*
10 MachineLearning,DataMiningandBasicAnalytics Analysis ReleasedandRevised
11 BigDataAnalytics Analysis NewlyReleased
12 ProcessMining Analysis Released
13 SocialMediaAnalytics Analysis TobereleasedM30
14 DataVisualisationandStorytelling Interpretation andUse
NewlyReleased
15 DataExploitationincludingdatamarketsandlicensing Interpretation andUse
TobereleasedM30
*ModulebasedonEUCLIDcurriculum,andreleasedasFutureLearnMOOCinApril2016.
4.2 Revisedmodulesinversion2
This section outlines changes to the topics covered inmodules that have been updated since theircurriculawerereleased inM6: ‘FoundationsofDataScience’, ‘DistributedComputing’, and ‘MachineLearning,DataMiningandBasicAnalytics’.
4.2.1Foundationsofdatascience(SOTON)
The syllabus for foundations of data science was revised to incorporate a small change of topicsfollowingtheexperiencesofdeliveringthecourseasaface‐to‐facemodulewithinSouthampton’sMScDataScienceprogramme.Topicswerealsore‐organisedtofitwithinfourcategoriesthatcorrespondtostagesofthetypicaldatasciencepipeline:
Page10of32EDSAGrantAgreementno.643937
● TheFoundationsofDataScience○ Introduction
■ This section covers the introduction to the course, and details the learningoutcomesasoriginallylaidoutinD2.1.
○ DataScienceinanutshell○ Terminology○ Thedatascienceprocess○ Adatasciencetoolkit○ Typesofdata○ Exampleapplications
● DataCollectionandManagement○ Datacollectionasthefirststageofdatascience○ Wheredatacomesfrom○ Pre‐processing○ Cleaningandfillinginmissingdata
■ Fixingdata■ Noisydata
○ Dataintegration○ Usingthecloudandprinciplesofcloudcomputing○ Documentrelatedstorage
■ MongoDB● FundamentalsofStatisticsandDataAnalysis
○ Introductiontodatamining■ Disciplinarycomponents■ Machinelearningandstatistics■ Applicationsofdatamining
● Real‐worldexamples○ Introductiontostatistics
■ Natureofstatisticsandintroduction■ Centraltendenciesanddistributions■ Variance■ DistributionPropertiesandarithmetic■ Samples/CLT■ Linearregression
○ UsingtheRstatisticalpackage○ IntroductiontoMachineLearning
■ Whatismachinelearning?Whatcanitdo?● Exampleapplications
■ Supervisedvunsupervisedlearning■ PrerequisitesforMachineLearning■ Regressionproblems■ Classificationproblems
● Linearlyseparablevnonlinearlyseparable● NaiveBayes● SupportVectorMachine
● DataVisualisation○ Introductiontodatavisualisation○ Typesofdatavisualisation
■ Exploratory■ Explanatory
○ Dataforvisualisation■ Datatypes■ Dataencodings■ Retinalvariables
D2.2DataScienceCurricula2Page11of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
■ Mappingvariablestoencodings■ Visualencodings
○ Chartjunkandthebeautyparadox○ Technologiesforvisualisation
■ WebTechnologiesandtheirroleinvisualisation● Hypertext● CSS● Javascript
■ D3.js● D3chainsyntax● D3Libraries● AlternativestoD3:Python,high‐leveltoolsetc.
○ Data‐drivenjournalismandStorytelling
Page12of32EDSAGrantAgreementno.643937
4.2.2EssentialsofDataAnalyticsandMachineLearning(Persontyle)
ThismoduleisarenamedandrevisedversionofPersontyle’s‘MachineLearning,DataMiningandBasicAnalytics’modulethatlaunchedinM6,basedontheexperiencesofdevelopingthelearningmaterialsanddeliveringthecourse.The listof topicsandtheirassociateddescriptionshavebeenmodifiedasfollows:
Lesson1:RREFRESHER
o ThismoduleprovidesanoverviewoftheRprogramminglanguage.ItisintendedtobemuchmorethanaquickintroductionorrefresherinR.ItisdesignedtofunctionasasourcethatyoucanreferencewhenworkingwithRontheexamplesandexercisesintheremainderoftheguide.Itcontainsnotonlyalotof informationabouttheRprogramminglanguage,butalsogeneralinformationthatisimportanttoknowwhenprogramminginanylanguage.
Lesson2:FEATUREINITIALIZATION
o Webeginourpre‐processingmodulesby considering the initial steps thatmustbe taken totransformtherawdata,whichhasbeencollectedtoasetoffeatureswhicharecapableofbeingexaminedandtransformedinthefurtherpre‐processingsteps,andeventuallyofbeingusedinstatisticalmodelling.
Lesson3:FEATURESELECTION
o Once quantitative data has been generated we should examine the features that we haveavailable.Inmachinelearningtasks,weoftenhaveaccesstofarmorefeaturesthanwecanhopetouse,andanimportantsub‐problemisselectinganinformationrichsubsetoffeaturestousetobuildourstatisticalmodels.Itisimportanttonotethatmorefeaturesisnotnecessarilybetter:featuresmayormaynotcontaininformationaboutthetargetvariable,buttheywillcertainlycontainnoise.Further,workingwithadditionalvariablestendstoentailadditionalparametersinourmodels,whichcanincreasethepotentialforoverfitting.
Lesson4:FEATURETRANSFORMATION
o Featuretransformationisthenamegiventoreplacingouroriginalfeatureswithfunctionsofthesefeatures.Thefunctionscaneitherbeofindividualfeatures,orofgroupsofthem.Theresultofthesefunctionsisitselfarandomvariable–bydefinitionthefunctionofanyrandomvariableisitselfarandomvariable.Utilizingfeaturetransformationsisequivalenttochangingthebasesofourfeaturespace,anditisdoneforthesamereasonthatwechangebasesincalculus:Wehopethatthenewbaseswillbeeasiertoworkwith.
Lesson5:MISSINGDATA
o Thismodulecontainsmanyreferencestolatermodules.Thisisbecauseimputingmissingdatagenerallyrequirestheuseofstatisticalmodelswithwhichtodotheimputation.Itisincludedhereasitispartofpre‐processing,butreadersmaywishtoreturntothismoduleafterreadingthestatisticalmodelingmodules.
Lesson6:BASICREGRESSIONMODELS
o Inthismodulewewilllookatsomeofthesimplestregressionmodels:Whattheyare,howtheywork,howtheycanbebuiltfromdataandhowtheycanbeevaluated.
Lesson7:MODELEVALUATION,SELECTIONANDREGULARIZATION
o Whencreating,selectingandevaluatingstatisticalmodels,youshouldthinkintermsofathreestepprocess:
Training:Trainmodelsusingdedicatedtrainingdata.
Evaluation/Validation:Selectthebestperformingmodel,eitherusingdedicatedvalidationdataorstatisticalmethods.Wewillcallthisstepevaluationandreservethetermvalidationfortheevaluation/testingtechniquediscussedinthefollowingsections.
D2.2DataScienceCurricula2Page13of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
Testing:Obtainunbiasedestimatesoftheperformanceoftheselectedmodelusingdedicatedtestdataorstatisticalmethods.
o Theaimofthismoduleistoclarifywhythesethreestepsarerequiredandprovideyouwiththemeansofperformingthesecondandthirdsteps.
Lesson8:BASICCLASSIFICATIONMETHODS
o Inthismodulewelookatanumberofbasicclassificationmethods.Wefirstlookatcaseswheretheindependentvariablesarereal,andexaminethefollowingalgorithms:
LinearandQuadraticDiscriminantAnalysis
ThePerceptronAlgorithm
Logistic
o Wethenlookatthecasewhereboththeindependentandtargetvariablesarediscrete,whichwewilltermdiscrete‐discreteclassification.
Lesson9:ANALYSISOFCLASSIFICATIONMODELS
o Weintroducedmisclassificationerrorinmodule8,wherewealsodiscussedtheuseofmeansquarederrorforprobabilityestimates.Thesearethemostcommonerrorscoresusedforfittingparameters.Youshouldnotethatmisclassificationisthesimpleaccuracystatisticdiscussedinthenextsection,andhencesharesitsproblem.
Lesson10:LOCAL/NON‐PARAMETRICMETHODS
o Localmethodsareso‐namedbecausewhenestimatinganewcasetheyworkwithasubsetofthetrainingdatafoundtobeneartothenewcase,usingsomedistancemetricorspatialdivision.
Lesson11:BASICKERNELMETHODS
o Theorderoftopicsisasfollows:Wewillprovideabasicoverviewofthekerneltrick.Wethenlistsomecommonkernels, followedbyanexampleofanapplicationof thekernel trickwithkernel regression.We thenprovidebackgroundof the theorybehind thekernel trick,whichdelvesintothequitecomplexmathematicsandwhichmaybeskippedbytheintimidatedreader.Finally,weexaminekernelPCA.
Lesson12:SPLINES
o Splinesareacleverdevelopmentofthemorebasicregressiontechniquesthatcanbeunderstoodasimplementingtwoideas:
Instead of generating a single model for the entire feature space, we can divide it intodifferentsectionsandmodeleachofthesewithanindividualmodel.
Tocombinethesemodelselegantlyandplausiblywewillrequirethattheregressionlinesmeetwithadegreeofsmoothnessatthesectionboundaries.
Lesson13:GAUSSIANPROCESSES(NewlessontobedeliveredinAugust2016)
o Astochasticprocessisacollectionofrandomvariables,oftenrepresentingtheindeterminateevolutionofsomesystemovertime.
o Stochasticprocessesare:
Infinitevariateextensionsofmultivariatedistributions
Modelsofstochasticevolutionofsystemsovertime
Distributionsoverfunctions
Page14of32EDSAGrantAgreementno.643937
Lesson14:TREES&ENSEMBLEMETHODS(NewlessontobedeliveredinAugust2016)
o Inthismoduleweintroducedecisiontreebasedmethodsandusethemtoexamineanumberofensemblemethods.Thiscombinationoftopicsisnotaccidental–treemodelsarealmostalwaysusedaspartofensembles,andthemostpopularensemblemethodsaretreebased.
o Treebasedmethodsaresometimesdescribedasradicallynon‐linear. In fact, theyperformaformofdiscretization,wherethefeaturespaceisdividedintoregionsandavalueestimatedforthetargetvariableineachregion.
Lesson15:SUPPORTVECTORMACHINES(NewlessontobedeliveredinAugust2016)
o Inthismoduleweintroduceanotherwell‐knownandveryhighperformingmachinelearningtechnique: support vector machines (SVMs). We will explain SVMs in terms of solving theoptimization problem for linear classifiers known as finding the maximally separatinghyperplanewhenclassesareinseparable.Weexaminethebasiccase,knownassupportvectorclassifiers,beforelookingathowwecanobtainamorepowerfulprocedurebytransformingthevariablesweworkwithusingkernels.
Lesson16:NEURALNETWORKSANDDEEPLEARNING
o Artificialneuralnetworks(ANNsorjustNNs)areperhapsthemostfamousmachinelearningtechnique. They have ridden repeatedwaves of hype, and disappointment, since first beingformulated in the 1940s.With the development of deep neural networks (DNNs) they haveproventoperformverywellonalargenumberoftasksandwearecurrentlyinanotherperiodof strong interest in neural network techniques. This interest can be negative – too manystudentsanddata scientistshavebeencaughtup in the currenthypecycleandplaceundueinterest inwhat ismerely one tool amongmany.Nonetheless, they are a powerful tool anddeserveaplaceinanydatascientist’sarsenal.
D2.2DataScienceCurricula2Page15of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
4.2.3Distributedcomputing(KTH)
KTHrevisedthesyllabusfordistributedcomputingbasedonexperiencesofteachingcoursesbasedontheM6version.Theexistingtopiclistremainedthesame,howeverathirdpartwasaddedtocovernewandfurthertopics:
PartI:DistributedServicesandDistributedAlgorithms
● UnchangedfromD2.1PartII:Clouds,Peer‐to‐PeerandBigData
● UnchangedfromD2.1PartIII:Follow‐upcoursesintheDistributedComputingmodule
A. DistributedArtificialIntelligenceandIntelligentAgents● IntroductiontoAgents
○ Thispartintroducesthenotionofintelligentagentandconsidersindividualandgroupperspectivesofagenttechnology
● Negotiation○ Inthispartwillconsidernegotiationprinciples,negotiationprotocols(voting,
monotonicconcession,Clarktax,auctions)andnegotiationmechanisms(organizationalstructures,meta‐levelinformationexchange,multi‐agentplanning)
● Coordination○ Thispartintroducescoordinationprinciplesandconsiderscoordinationmechanisms
(organizationalstructures,meta‐levelinformationexchange,multi‐agentplanning,normsandlowsandcooperativework)
● Interoperability○ Hereweconsiderapproachestosoftwareinteroperation,speechactsandAgent
CommunicationLanguages(ACL),KQML,FIPA.● Multi‐AgentSystemsarchitectures
○ Inthispartwediscusslow‐levelarchitecturesupport,DAItestbeds,Middle‐agentbasedarchitecturesandagentmarketplaces
● SoftwareEngineeringofMulti‐AgentSystems○ Thispartconsidersbasicapproachestoagentsystemsengineering,asweeasGAIAand
agent‐UMLapproaches● Agenttheory
○ Thispartintroducesagenttheory,basicsofmodallogicandBDI‐architecture● Agentarchitectures
○ Herewediscussdeliberative,reactiveandhybridagentarchitecture● Agentmobility
○ Weconsiderrequirements,implementationandsecurityformobileagentsaswellasenvironmentsformobileagents.
B. ProgrammingwebServices● Introduction
○ ThisconsidershistoricalperspectiveofservicetechnologyandbasicconceptsofWebservices.
● Markuplanguages○ HerebasicsofmarkuplanguagesandXMLareconsidered.
● XMLmessaging○ ThisconsidersapproachestoXMLmessagingaswellasSOAPandJSON
● Servicedescription
Page16of32EDSAGrantAgreementno.643937
○ Herefunctionalandnon‐functionalcomponentsofservicedescriptionareconsideredtogetherwithWebservicedescriptionlanguage.
● Servicediscovery○ Thisconsidersapproachestoservicediscovery(registry,indexandpeer‐to‐peer)and
specificationsforservicediscovery● Servicescoordination
○ Basic components of the service coordination environments are considered togetherwithspecifications.Atomictransactionandbusinessactivityarediscussed.
● Servicecomposition○ Herebasicsandapproachestoservicecompositionareconsidered,languageforservice
compositionispresented● Servicessecurity
○ BasicelementsofWebservicesecurityarepresented,mainsecurityspecificationsarediscussed
● Webservicesandstatefulresources○ ApproachesandspecificationsfordescriptionofstatefulresourcesinWebservicesare
considered● SemanticWebServices
○ Here basics of semantic Web and its specification for semantic Web services arediscussed
D2.2DataScienceCurricula2Page17of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
5. Newmodulesinversion2Followingtherevisionsdetailedabove,weprogressedwithdesigningcurriculaforthe4modules inM18(withdatavisualisationandstorytellingcomprisedofboththeoriginalDataVisualisationmoduleandtheFindingStoriesinOpenDatamodule,asdiscussedabove).AsinD2.1,weprovideasummaryofeachmodule’s content, and a list of thematerials that already exist, inpreparation for the learningresourcestobedevelopedandreleasedforM24.
5.1 Statisticalandmathematicalfoundations–JSI
5.1.1LearningObjectives
Bytheendofthismodule,learnerswill:
● Befamiliarwithconcepts,definitionsandmethodsfromthefieldofmathematicsandstatistics.● HaveknowledgeontheuseofthemathematicalandstatisticalconceptsinDataScience.● HavegainedexperienceonmathematicalandstatisticalmethodswithRandQMinertools.
5.1.2SyllabusandTopicDescriptions IntroductiontoMathematics
o ThistopicdescribesthenotionofMathematicscourse,providessyllabus,aimsforthecourseandrecommendationsforthelearners.
FoundationsofMathematics
o ThistopicgivesthebasicdefinitionsfromthefieldofMathematics,suchasarithmeticandalgebraconcepts, linearequations,quadraticequations, functions,differentiationandintegration.
LinearAlgebra
o ThistopicprovidesanumberofconceptsfromthefieldofLinearAlgebra,suchasvectorspaces, orthogonality, dot product, norm, matrices, determinant, matrixdecompositions.
Optimization
o This topic gives insights into the field of Optimization and definitions related tooptimization,suchasminimum,maximum,saddlepoints,convexfunctions,primalanddualproblems,Lagrangemultipliers.
IntroductiontoStatistics
o ThistopicdescribesthenotionofStatisticscourse,providessyllabus,aimsforthecourseandrecommendationstothelearners.
FoundationsofStatistics
o ThistopicprovidesthebasicdefinitionsfromthefieldofStatistics,suchasprobability,randomexperiments,definitionofprobability,conditionalprobability, independence,interpretationofprobability,Bayes’theorem.
o The learners will get familiar with random variables, probability density functions,expectationandexpectationvalues.
o The topic also covers probability distribution, discrete/continuous distributions,samplingandsamplingdistributions.
Page18of32EDSAGrantAgreementno.643937
ExploratoryDataAnalysiso ThistopicprovidesaviewonExploratoryDataAnalysisanditsmethods.Inparticular,
the topic covers descriptive/summary statistics – central tendency (mean, median,mode), variability (standard deviation and variance), skew, kurtosis, histograms,frequencypolygons,box‐plots,quartiles,scatterplots,heatmaps.
RegressionandInferentialStatistics
o This topic describes the notion of Regression and such important concepts, ascovarianceandcorrelation.
o The learners will learn about the null hypothesis significance tests (NHST) andconfidenceintervals.
DataanalyticswithR
o This topicwillprovidea setofpracticalexamples fromtheareaofMathematicsandStatisticsusingR‐aprogramming languageandsoftwareenvironment forstatisticalcomputingandgraphics.
DataanalyticswithQMiner
o This topic provides some practical insights on mathematical implementations andstatistical data analysis using QMiner. QMiner implements a comprehensive set oftechniquesforsupervised,unsupervisedandactivelearningonstreamsofdata.
5.1.3ExistingCourses
Table5:ExistingcoursesthatincludeamodulesimilartoStatisticalandMathematicalFoundations
Institution
Course Module URL TargetAudience LanguagesandDataTypes
JSI DataAnalyticswithQMiner
DataanalyticswithQMiner
http://videolectures.net/sikdd2014_rei_large_scale/?q=qminer
Computerscientists,mathematicians,statisticians.
QMiner
PrincetonUniversity
StatisticsOneMOOC(Coursera)
StatisticsOne
https://www.coursera.org/course/stats1
Computerscientists,mathematicians,statisticians,other.
R
5.1.4ExistingMaterials
AnumberofrelevantmaterialshavebeenidentifiedonVideolectures:
● StatisticalMethodshttp://videolectures.net/acai05_taylor_sm/?q=statistics● IntroductiontoStatistics
http://videolectures.net/cernstudentsummerschool2010_cowan_statistics/?q=statistics
● Lecture15:StatisticalThinkinghttp://videolectures.net/mit600SCs2011_guttag_lec15/?q=statistics
● IntroductiontoStatisticshttp://videolectures.net/cernstudentsummerschool09_cowan_is/?q=statistics
D2.2DataScienceCurricula2Page19of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
● OntheComputationalandStatisticalInterfaceand"BIGDATA"http://videolectures.net/colt2014_jordan_bigdata/?q=foundations%20of%20statistics
● ProbabilityandMathematicalNeedshttp://videolectures.net/bootcamp2010_anthoine_pmn/
Additionally,materialhasbeenidentifiedinthefollowingresources:
● JamesWardandJamesAbdey.Foundationcourse:MathematicsandStatisticshttp://www.londoninternational.ac.uk/sites/default/files/programme_resources/ifp/fp0001_maths_and_stats_taster_2013.pdf
● MasterStatisticswithRhttps://www.coursera.org/specializations/statistics● BasicStatisticshttps://www.coursera.org/learn/basic‐statistics● InferentialStatisticshttps://www.coursera.org/learn/inferential‐statistics‐intro● FoundationsofDataAnalysis‐Part1:StatisticsUsingR
https://www.edx.org/course/foundations‐data‐analysis‐part‐1‐utaustinx‐ut‐7‐10x
5.1.5DescriptionsofExercisesExerciseswillbedesignedtoallowstudentstocarryoutdataanalyticswiththeQMinertool:http://qminer.ijs.si/
5.1.6FurtherReading● G.Cowan,StatisticalDataAnalysis,Clarendon,Oxford,1998seealso
www.pp.rhul.ac.uk/~cowan/sda● R.J.Barlow,Statistics,AGuidetotheUseofStatisticalinthePhysicalSciences,Wiley,
1989seealsohepwww.ph.man.ac.uk/~roger/book.html● L.Lyons,StatisticsforNuclearandParticlePhysics,CUP,1986● F.James,StatisticalMethodsinExperimentalPhysics,2nded.,WorldScientific,2006;
(W.Eadieetal.,1971).● S.Brandt,StatisticalandComputationalMethodsinDataAnalysis,Springer,NewYork,
1998(withprogramlibraryonCD)● W.‐M.Yaoetal.(ParticleDataGroup),ReviewofParticlePhysics,● J.PhysicsG33(2006)1;seealsopdg.lbl.govsectionsonprobabilitystatistics,Monte
Carlo
Page20of32EDSAGrantAgreementno.643937
5.2 Datamanagementandcuration‐TU/E
5.2.1LearningObjectives
Theobjectiveofthismoduleistointroduceandexplainconceptsandpracticesforthemanagementofresearchdata assets, including the creation of a plan, organisation, file formats, and the creationofmetadata.
Italsointroducesknowledgerequiredtomanagethesafeaccesstodata,fromstoragetoaccessrightsandlicensing,includingnotionsaboutprivacyissues.
5.2.2SyllabusandTopicDescriptions● Dataanddatamanagement
○ ResearchdataexplainedThisunitaimsatbeingawareofdifferenttypesofresearchdata,andunderstandinggeneralideasaboutbigdatausefulforthepurposeofthiscourse.
○ DatamanagementplansThis unit aims at explaining what data management is, and how a data management plan can bedesigned.
● Organizationanddocumentation○ Organizationofdata
This unit aims at introducing the concepts of data organization, as well as file versioning, namingconventions,andbestpractices.
○ FileformatsandtransformationThisunitpresentsdataformatting,compression,andothertransformations.
○ Documentation,metadata,andcitationThisunitexplainstheuseofdocumentingdata,aswellasmetadata.
● Dataanddatamanagement○ Storageandsecurity
Thisunitintroducesconceptsofsafetyofstorageandsafetyguidelines.Italsoexplainsdataencryption,anddestroyingsensitivedata.
○ Dataprotection,rights,andaccessThisunitaimsatdescribingdataprotection,rights,andaccess, fromprivacyto intellectualpropertyrights.
○ Sharing,preservation,andlicensingThisunitpresentsthepracticalimplicationsofsharingdata,aswellaslicensing,inparticularthroughdifferentkindsofopendatalicences.
5.2.3ExistingCourses● IntroductiontoResearchDataManagementandSharing,freecourseonCoursera,byThe
UniversityofNorthCarolinaatChapelHillandTheUniversityofEdinburgh.○ https://www.coursera.org/learn/data‐management
● DataManagement,distancelearningcoursebyTheOpenUniversity.○ http://www.open.ac.uk/postgraduate/modules/m816
● DigitalCuration,tutorialwithextensivelearningmaterialsonline,bytheDigitalCurationCentre.
○ http://www.dcc.ac.uk/training/train‐the‐trainer/dc‐101‐training‐materials● NewEnglandCollaborativeDataManagementCurriculum
○ http://library.umassmed.edu/necdmc/modules
D2.2DataScienceCurricula2Page21of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
5.2.4ExistingMaterials
● Research Data Management Training, free course called "MANTRA", by the University ofEdinburgh.
○ http://datalib.edina.ac.uk/mantra/● NewEnglandCollaborativeDataManagementCurriculum
○ http://library.umassmed.edu/necdmc/modules● DigitalCuration,tutorialwithextensivelearningmaterialsonline,bytheDigitalCurationCentre.
○ http://www.dcc.ac.uk/training/train‐the‐trainer/dc‐101‐training‐materials
5.2.5ExampleQuizzesandQuestionsQuizzescanincludequestionsthattesttheunderstandingofconcepts,terminology,regulations,andmethods.
5.2.6DescriptionofExercises
● Providerulesthatimplementasamplepolicy.● Provideadescriptionofaninfrastructurerespondingtoasamplesituation.● Evaluatetheopportunitytopreservesomedata,inasamplesituation..
5.2.7FurtherReading● ‘ManagingResearchData’,FacetPublishing(2012)● ‘PrinciplesofDataManagement’,KeithGordon(2013)● ‘PreparingtheWorkforceforDigitalCuration’,NationalAcademiesPress(2015)
Page22of32EDSAGrantAgreementno.643937
5.3 Bigdataanalytics‐Fraunhofer
ThismodulewasdevelopedbyFraunhoferandprovidesanoverviewofapproachesfacilitatingdataanalytics on huge datasets. Different strategies are presented including sampling to make classicalanalyticstoolsamenableforbigdatasets,analyticstoolsthatcanbeappliedinthebatchorthespeedlayerofalambdaarchitecture,streamanalytics,andcommercialattemptstomakebigdatamanageablein massively distributed or in‐memory databases. Learners will be able to realistically assess theapplicationof bigdata analytics technologies fordifferentusage scenarios and startwith their ownexperiments.
5.3.1LearningObjectives
Learnerswithexperienceinprogramming,dataanalyticsandthearchitectureofbigdatasystemsgettoknowmethodsandtoolstoanalyzebigdata.Theywillbeabletorealisticallyassesstheapplicationofbigdataanalyticstechnologiesfordifferentusagescenariosandstartwiththeirownexperiments.
5.3.2SyllabusandTopicDescriptions
● Introduction:○ Bigdata analytics solutionsask for skills from twodifferent fields. First, information
technologyskillsareneeded toprovidehorizontally scalablesystems for storingandprocessing data. Second, data analysis skills are needed and, in particular, tools andmethods for the mathematical modelling of data. Collaborative filtering provides arunning example for all aspects of big data analytics. The use‐case is to providerecommendationsforproductsbasedonuserpreferences.Aconcreteexampleisgivenbythetaskofmusicrecommendationbasedonthelast.fmdataset.Thelatterprovidesuser/artist pairings based on listening events and can be used to generate artistrecommendations.Thetaskmakesitnecessarytodefinesimilarityofdisparateitems.ThisisbasedonutilitymatricesandtheJaccardsimilarityresp.distance.Onceadistancemeasure isdefined,data canbeanalysedbyclustering.On theonehand, this canbeachievedusingsamplingtofitbigdatasetsintoclassicalanalyticstools.Ontheotherhand,bigdataanalyticssolutionscanbeintegratedinalambdaarchitecture.
● Collaborativefilteringinthelambdaarchitecture:○ Lingual provides an ANSI SQL interface for ApacheHadoop. This is applied for data
understanding in the collaborative filtering use‐case. Building the utility matrix forcollaborativefilteringisatypicalbatchprocessingtask.ThiscanbemodelledontopofHadoopusingCascading.Classicaldataanalyticssoftwarecannothandlehugeamountsofdata.ApossiblesolutionisgivenbysamplingasubsetofthedatasetwhichcanthenbeanalysedinclassicaltoolsusingRHadoop.Theactualclustermodelforthebatchviewis computed in R whereas the application of this model is again performed usingCascading.ThesizeofthefinalmodelissmallenoughsuchthatvalidationispossibleinRviaLingual.
● Generatingreal‐timerecommendations:○ Stream‐processingcanbeusedtoupdatetheresultsfromcollaborativefilteringfornew
entitieswhichhavenotbeenpartof thebatchprocessing,yet.For largestreams, thetechniqueof streamsynopsesallows tokeep thedatasizemanageable. Inparticular,count sketches can be used to efficiently approximate the distance computationnecessary for collaborative filtering. Such a solution canbe integrated in the lambdaarchitectureusingApacheStorm.
● BigdataanalyticsinSpark:○ Sparkisoneofthefastestdevelopingplatformsforbigdataanalytics.Itprovidesmeans
toovercomethediscI/OoverheadoftenseeninMapReducebasedprocessing.ConceptslikedataframesandSparkSQLmakeitconvenienttoworkwithstructureddatainsideSparkprograms.Inparticular,itprovidestheSparkMLlib,alibrarythatcontainsmanydistributedmachinelearningapproachessuitableforbigdataanalytics.PySparkallowsthe powerful combination of Spark with Python and thus to combine distributed
D2.2DataScienceCurricula2Page23of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
processingandthewiderangeoflibrariesfordataanalyticsandvisualisationinPython.Linearregression,collaborativefilteringwithalternatingleastsquares(ALS),K‐Meansclustering,andpoweriterationclusteringprovidethemeanstoimplementlargescalecollaborativefilteringinSpark.Sparkoffersaframeworkformachinelearningpipelines.These make it easier to combine multiple algorithms into a single workflowwith acommon interface to parameters. Linear regression is available in both Spark andPython.Itmaybebeneficialtouseeithertheoneortheother,dependingondatasize.
● ComplexeventprocessingwithProton:○ Thebasicprincipleofcomplexeventprocessingistoderivecomplexeventsonthebasis
ofapossiblylargenumberofsimpleeventsusinganeventprocessinglogic.ProtononStormallowsrunninganopensourcecomplexeventprocessingengineinadistributedmanner on multiple machines using the STORM infrastructure. Event processingnetworksprovideaconceptualmodeldescribingtheeventprocessingflowexecution.Suchanetworkcomprisesacollectionofeventprocessingagents,eventproducers,andeventconsumersthatarelinkedbychannels.Frauddetectiononcalldetailrecordsisanillustrativeuse‐caseexampleshowinghowtheseconceptscanbeimplemented.
● SQLoperatorsforMapReducewithTeradata:○ Databasemanagementsystemprovidersseektoenhancetheirtraditionaldatabaseand
makethemapplicabletobigdatause‐cases.Abasicconcepttoachievethisisgivenbypartitioning of tables, leading tomassively parallel databases. Table operators allowmakinguseofthepartitioningfordistributedalgorithmsusingMapReduce.AselectedcommercialtoolofferingtheseapproachesistheTeradataAstersolution.
● In‐memoryprocessing:○ Moreandmoremainmemorybecomesavailableatareasonableprice.Asaccessspeed
isreducedsignificantlyoncedataoutsideofmainmemoryisaccessed,highperformanceapplicationsfocusonkeepingasmuchdataaspossibleinmainmemory.Thereisawidevarietyofin‐memorydatabasesystemsavailable.Centralperformanceandapplicabilitymeasurestobekeptinmindwhenchoosingsuchasystemcompriseoperatingsystemcompatibility,hardwarerequirements,licenseandsupportissues,runtimemonitoringcapabilities,memoryutilisation,databaseinterfacestandards,extensibility,portability,integration of open source big data technologies, local and distributed scaling andelasticity,availableanalyticsfunctionality,persistence,availability,andsecurity.
5.3.3ExistingCourses
● Big data analytics, a face‐to‐face training offered by Fraunhofer IAIShttp://www.iais.fraunhofer.de/bigdataanalytics.html
5.3.4ExistingMaterials● Data Mining and Applications Graduate Certificate at Stanford Center for Professional
Development,http://scpd.stanford.edu/public/category/courseCategoryCertificateProfile.do?method=load&certificateId=1209602
● Machine Learning With Big Data at Courserahttps://www.coursera.org/specializations/bigdata
● MiningMassiveDataSets,StanfordUniversityhttp://web.stanford.edu/class/cs246/● DATA SCIENCE AND BIG DATA ANALYTICS by EMC
https://education.emc.com/guest/campaign/data_science.aspx● Spark Fundamentals I by BigData University
http://bigdatauniversity.com/courses/spark‐fundamentals/
Page24of32EDSAGrantAgreementno.643937
5.3.5ExampleQuizzesandQuestions
Multiplechoicequestionsconcerning
● TheJaccarddistancetobuildtheutilitymatrixforcollaborativefiltering● Locatingcollaborativefilteringinthelambdaarchitecture● Theroleofclassicaldataanalyticsforbigdata● ThecountdistinctSketchalgorithm● Thealternatingleastsquaresalgorithm● Theparallelizatioinofk‐meansclustering● BigdataanalyticsinSpark● In‐memoryprocessing
5.3.6FurtherReading
● http://github.com/ishkin/Proton/tree/master/IBM%20Proactive%20Technology%20Online%20on%20STORM
● http://www.teradata.de/products‐and‐services/analytics‐from‐aster‐overview● http://en.wikipedia.org/wiki/List_of_in‐memory_databases● http://s.fhg.de/in‐memory‐systems
D2.2DataScienceCurricula2Page25of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
5.4 Datavisualisationandstorytelling‐SOTON/ODI
Asdiscussedabove,thismoduleisacombinationofthe‘Datavisualisation’and‘Findingstoriesinopendata’ modules that were listed in the first version of the curriculum. We present here an overallcurriculumforthenewmodulethatcombinesthetopicsoftheoriginalmodules.
5.4.1LearningObjectives
Bytheendofthecourse,learnerswillbeableto:
● Explaintheroleofdatavisualisationindatascience● Describewhyparticularvisualisationsareeithergoodorbad● Designvisualisationsthattellacoherentstory● Explainhowthebrainperceivesvisualinformation● Createstaticandinteractivevisualisations
5.4.2SyllabusandTopicDescriptions
Themainsyllabusisseparatedintofour ‘stages’: fundamentals,planning,storytellingandproducingvisualisations.
Fundamentals
● Importanceofvisualisationwithindatascience:thissectiondiscussesthecurrentdatasituationandhowthevolumeofdatameansthereisagrowingneedforactionableinsightsthatcanberevealedthroughvisualisation.
● Dashboardsandinfographics:wewillreviewtheriseofdashboardsandinfographicsandhowtheseareusedtotellcoherentstoriesaboutdata
● Originofgraphs:thiscoversabriefhistoryofvisualisation, lookingatwhycertaingraphsorchartswereintroducedinordertomakesenseofgrowingamountsofdata.
● Europeanand local laws: this topicwill coverhow law isused toprotect freedomof speechwithinjournalism
● Ethicsincommunicatingstories:whataretheethicalconsiderationsrequiredwhencontentcanbesharedrapidlytoproliferateacrosstheWeb?
Planning
● Audience‐Story‐Action‐thissectionwillcoverthreekeyconsiderationsthatadesignermustmakebeforecreatingavisualisation
● Headlinewritingandstoryproduction‐lookingathowtoscopeadatajournalismproject,wewillcoverhowtoeffectivelycommunicateinsightbasedonathoroughunderstandingofyouraudience
● 80:20rule‐howthisappliesindatajournalismandhowyoucanreducetimespentonearlystagesofthejournalismprocess.
Storytelling
● Visualperceptionandinformationdesign:wewilllookatthecognitiveprocessbehindhowthebraininterpretsvisuals,includingpsychologicaltheoriessuchas:gestalttheory,pre‐attentiveprocessing,elementaryperceptualtasks.Wewillshowhowanunderstandingoftheseallowsustonoticehowthebraincanbetricked,sothatbaddesigncanbeavoided.
● Colourandinteractiontodrawattentiontodata● Chartjunkandbaddesign ‐wewill exploreTufte’s theoriesofminimalismand chartjunk as
guidelinesforproducingeffectivegraphics.ProducingVisualisations
● Obtainingdata:thistopiclooksathowtoobtainthedatarequiredtotellacompellingstory,andwhatdatasetsarerequiredtotelldifferenttypesofstory.
● Cleaningdata:thislooksathowtopreparedataforanalysisonceithasbeengathered,beginningwiththeessentialprocessofdatacleansing
○ Identifyingoutliers,fixtypography,correctmissingdata
Page26of32EDSAGrantAgreementno.643937
○ Enrichingdata● Typesofchart:whatchartscanbeusedforwhatdata,andwhichstoriescanthesetell?● BuildingvisualisationswithD3:Thistopiclooksatthepracticalwaystovisualisedatatotella
story. Technologies covered include HTML5, D3.js and DC.js to produce compelling onlinegraphics.
5.4.3ExistingCourses● DataVisualizationandD3.jsMOOC(Udacity)‐ZipfianAcademy
○ https://www.udacity.com/course/data‐visualization‐and‐d3js‐‐ud507
● BigData:DataVisualisationMOOC(Futurelearn)‐QueenslandUniversityofTechnology
○ https://www.futurelearn.com/partners/queensland‐university‐of‐technology
● Datavisualisationone‐dayworkshop(TheGuardian)
○ https://www.theguardian.com/guardian‐masterclasses/2015/aug/07/data‐
visualisation‐a‐one‐day‐workshop‐tobias‐sturt‐adam‐frost‐digital‐course
● Datavisualisation(moduleonMaster’sdegreeonDataScience)‐BarcelonaGraduateSchoolof
Economics
○ http://www.barcelonagse.eu/sites/default/files/course‐d007‐data‐visualization.pdf
● DataAnalysis,VisualisationandCommunication(MSccourse)‐UniversityofAberdeen
○ http://www.abdn.ac.uk/study/postgraduate‐taught/degree‐programmes/46/data‐
analysis‐visualisation‐and‐communication/
5.4.4ExistingMaterials
● FindingStoriesinData‐CourseMaterials‐http://training.theodi.org/FindingStories/AdditionalresourceswillbedevelopedbybothSOTONandODI.
5.4.5ExampleQuizzesandQuestions
Quizzeswillbeintheformofmultiple‐choicequestionstotestknowledgeof:
● Theimportanceofdatavisualisationandstorytelling● Howshouldparticulardatabeencodedintovisualdimensions● Thestagesofplanningavisualisation● Chartjunkandminimalisminvisualisationdesign
5.4.6DescriptionofExercises
● Visualising data ‐ students will guided through how to create an online visualisation, withparticipantsdiscoveringthechallengesinvolvedintheprocess.
5.4.7FurtherReading
● DataJournalismHandbook‐http://datajournalismhandbook.org/● Cairo,A.TheFunctionalArt● Tufte,E.Thevisualdisplayofquantitativeinformation● Few,S.Showmethenumbers
D2.2DataScienceCurricula2Page27of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
5.5 LinkedDataandtheSemanticWeb‐SOTON
Thismoduleisanewadditiontothecurriculumandfulfilsagapinthepreviousversionregardingthecoverageofsemantictechnologiesanddatamanagement.
5.5.1LearningObjectivesThemoduleisintendedforanyonewantingtogainabasicunderstandingoflinkeddata,beforeseekingoutfurtherstudyopportunities,oraspartoftheirexistingtraining.Itwillappealtomanycomputerscience,web science, or data science graduates, students and practitionerswishing to expand theirknowledge.
Linked data allows the structured publication on the Web which facilitates the exposure of andinterlinkingofdatasetssothatdatamaybeexchanged,reusedandintegrated.Increasingnumbersoforganisations and initiatives are starting to appreciate the power of linked data, and it providessignificantpowerandpotentialtothegrowingdomainofdatascienceandwebscience.
Thismodulewillexposeyoutotheknowledgeandskillsaroundusinglinkeddata–skillsthatareinincreasingdemandas theWebevolves fromaWebofDocuments toaWebofData. ItwillhelpyouunderstandhowyoucanuselinkeddatatechnologiesandhowtowritequeriesinSPARQL,allowingyoutoexploitthesetechnologiesinyourownwork.
BytheendofthisprimerinLinkedData,learnerswillbeableto: Describevariousbackgroundtechnologiesandstandardsbehindlinkeddata; Describetheroleoflinkeddataindatascienceapplications Understandhowlinkeddataapplicationsaredesignedandused UseSPARQLtoprocesslinkeddata.
5.5.2SyllabusandTopicDescriptions
WeekOne–LinkedDatafundamentals.o Bytheendofthisweekyouwillbeableto: ReflectonwhylinkeddataisimportantforstructuringtheWebofData Summarisethebackgroundtechnologiesbehindlinkeddata Outlinethevariousbackgroundstandardsbehindlinkeddata Explainwhatlinkeddatais Summarisetheprinciplesbehindlinkeddata Explainwhat5‐starlinkedopendatameans Evaluatedifferentlinkeddatatools
o This part of the course covers the basic topics required to understand linked data and thesemanticweb.ItbeginswithreviewingthetechnologiesandstandardsbehindtheWebitselfandhowtheseareevolvingtofacilitatelinkeddata.Studentswillgainanunderstandingofwhatlinkeddatais,beforewecoveranumberofprinciplesbehinditandthe5‐starmodelofratingdataopenness.
WeekTwo–SPARQLQueries.o Bytheendofthisweekyouwillbeableto: DescribethebasicconceptsofSPARQL RecallthedefinitionsofSPARQLterminology ApplyknowledgeofSPARQLtoformulateSPARQLqueries DetectdifferenttypesofSPARQLqueries ConstructSPARQLqueries
o This sectionof the course introduces the learners to SPARQL, the language forquerying thesemanticweb.We cover general principles such as how to structure a query, and providedefinitionsfortherequiredterminology,beforeintroducingbasicqueries,suchasASKandmorepredominantlySELECT.
Page28of32EDSAGrantAgreementno.643937
WeekThree–MoreSPARQLQueries.o Bytheendofthisweekyouwillbeableto: DesignmoreadvancedSPARQLqueries DetectmoretypesofadvancedSPARQLquery DescribethepracticalapplicationsforSPARQLandLinkedData
o ThethirdpartofthecoursemovesontomoreadvancedSPARQLqueries,particularlyDESCRIBE
and CONSTRUCT.We cover the essential syntax and structure of these queries, building onpreviouslygainedknowledgefromearlierinthecourseaboutqueryinglinkeddata.Finallywecoverthevariousapplicationsoflinkeddataandhowthistechnologycanbeusedtodeveloprichtoolsthatprovideincreasedvaluethroughthelinkingofdata.
5.5.3ExistingCourses
Thiscourse takes inspiration fromtheEUCLIDproject’s iBookwhich isopenlyavailableandcontainsquizzesandexercisestotestknowledgeinaself‐studyformat.
5.5.4ExistingMaterials
The EUCLID iBook is available to download from the Apple iBooks store, and from the EUproject’swebsiteathttp://euclid‐project.eu/
5.5.5ExampleQuizzesandQuestions
Quizzestestthestudents’knowledgeofvariousconceptssuchas:
Thesemanticwebtechnologystack
RDFgraphs
SPARQLqueryresults
5.5.6DescriptionofExercises
Interactiveexerciseswillprovidestudentswithhands‐onexperienceofusinglinkeddataandSPARQL.Twovariationsareprovided:thefirstallowsthestudenttowritetheirownRDFstatementsandthenuseSPARQLtoverifythatitiscorrectlyformed,whilethesecondusealiveSPARQLendpointtoprovideamorecomplexdatasetonwhichthequeriesoutlinedinthecoursecanbetestedandexperimentedwith.
5.5.7FurtherReading
● TheEUCLIDprojectiBook:○ Simperl, E., Norton, B., Acosta, M., Maleshkova, M., Domingue, J., Mikroyannidis, A.,Mulholland,P.andPower,R.,2013.Usinglinkeddataeffectively.
● T.Berners‐Lee,J.HendlerandO.Lassila(2001)"TheSemanticWeb".ScientificAmericanvol.284number5,pp34‐43.Availableon‐lineathttp://www.scientificamerican.com/article.cfm?id=the‐semantic‐web.
● P.Hitzler,M.Krötzsch,andS.Rudolph(2010)."QueryLangauges:FoundationsofSemanticWebTechnologies".CRCPress
D2.2DataScienceCurricula2Page29of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
6. DatascienceskillgroupsInTable3above,wepresentinsightsfromthedemandanalysiscarriedoutinWP1.OneoftheserelatestotherecommendationthatEDSAshoulddevelopadatascienceskillsframeworkinordertostructuretheskillsrequirementsofparticularjobs.Additionally,thiswouldallowdatascientiststhemselvestoassesstheirskillsprofiles,comparethemwiththoserequiredfordatajobs,andidentifyareasinwhichthey need to seek further training. Linking with the WP1 demand dashboard would facilitate theautomaticsearchofrelevanttrainingcoursesforajobthatausermayfindwhichtheyneedtoskillfor.
Inthissectionwepresentaninitialmatchingofmodulestoskills,basedonthecurriculareleasedinthisandthepreviousdeliverable.Forthepurposesofthisskillsassignment,weusealistofskillsthathasbeenusedthroughouttheEDSAproject.TheassignmentofskillstomodulesisshowninTable6.
Inaddition,additionalskills thatwerenot inthecurrentvocabularyandwhichrelatedtoparticulartools or technologies were extracted from the curricula in order to enhance the accuracy of thismapping:
FoundationsofBigData
o QMiner
o MapReduce
BigDataArchitecture
o MapReduce
o ApacheStorm
o Cassandra
o Hadoop
DistributedComputing
o Hadoop
EssentialsofDataAnalyticsandMachineLearning
o R
ProcessMining
o ProM
StatisticalandMathematicalFoundations
o R
o QMiner
BigDataAnalytics
o Hadoop
o R
o ApacheStorm
o Spark
o Python
o MapReduce
Page30of32EDSAGrantAgreementno.643937
Table6:Skillsthateachmodulecurriculatargets
(continuedonnextpage)
FoundationsofDataScience
FoundationsofBigData
BigDataArchitecture
DistributedCom
puting
EssentialsofDataAnalytics
andMachineLearning
ProcessMiningTU/E
Statisticalandmathematical
foundations
Datamanagem
entand
curation
Bigdataanalytics
Datavisualisationand
storytelling
LinkedDataandtheSemantic
Web
TotalCoverage
advancedcom
puting
python 0
advancedcomputing x X X 3
programming X X X 3computationalsystems x X 2
coding X X 2
cloudcomputing X X 2
dataskills
databases X X 2
datamanagement X X X X 4
dataengineering 0
datamining X X x 3
dataformats X 1
linkeddata X 1informationextraction X X 2
streamprocessing x X 2
dom
ainexpertise
enterpriseprocess 0
businessintelligence X X 2
dataanonymisation X 1
semantics 0
schema 0
datalicensing X 1
dataquality 0
datagovernance X 1
general datascience X x 2
bigdata X X X X 4opendata
0
D2.2DataScienceCurricula2Page31of32
2016©Copyrightlieswiththerespectiveauthorsandtheirinstitutions.
machinelearning
machinelearning x X X 3socialnetworkanalysis 0
inference X 1
reasoning X 1
processmining X 1
maths&statistics
linearalgebra X X X 3
calculus X 1
mathematics X X 2
statistics X X X 3
probability X 1
Rstudio 0
dataanalytics X X X 3
dataanalysis X X X 3
visualisation
datavisualisation X X 2
infographics X 1
datamapping X 1
datastories X 1
datajournalism X 1
d3js X 1
tableau 0
Thismappingrepresentsaninitialcategorisationofthemodulesinourcurriculumbasedontheskillsthattheytrain.Inthecomingmonths,furtherrefinementwillbemade–inparticulartheback‐endworkof thedashboard is also producing anupdated list of skills basedon their increasing occurrence inEuropeanjobpostsovertime.Usingthiswillprovideamoregranularandaccurateassessmentoftheexactskillscoveredineachmodule.
Additionally,themappingiscurrentlyatacurriculumlevel,indicatingwhichtopicsarecoveredbythesyllabusofeachmodule.Furthercomplexityariseswhendifferentinstancesordeliverymechanismsofeachcourseareconsidered:forexampleapartnermayproduceanonlinecoursefortheirmoduleandindoingsobaseitaroundaspecificlanguageortool,whereasforaface‐to‐faceofferingtheymightuseacombinationoftoolsinstead.Thisrequiresaclassificationnotonlyatthecurriculumlevel,butatthecourselevel,too.
Page32of32EDSAGrantAgreementno.643937
7. ConclusionFollowingthereleaseofthefirstversionoftheEDSAcurriculuminM6,wehavenowsincehadtimetodevelopandrunanumberofEDSAmodulesandreceivebothstudentandteacherfeedbackfromthese.Additionally, insights fromourdemandanalysisarebeginningtoshapechanges to themodulesandcurriculumstructureitself,ensuringthattheEDSAcurriculumremainsanintegralresourcefordatascientiststodiscovernewopportunitiesfortrainingandwork.
This deliverable has reviewed those changesmade to date, aswell as presenting thenext group ofmodulesfromthecurriculumtobereleased.AtM30,wewillpresentthefinalversionofthecurriculumwhichwillincludefewerbrandnewmodules,andinsteadwillfocusonbringingalltheinsightsfromour trainingdelivery,demandanalysis and learning analytics together.Eachmodule andassociatedcoursewillbetaggedwiththeskillswhichastudentwilllearn.CoursesreleasedatM24andM36willthereforerelatedirectlytotheseskillstojoinupthedemandanalysisworkwiththecoursedeliveryofferings.Aperspectivedatascienceworkerwill thereforebeable to locate jobs that interest them,identifytheirownskillsshortages,andthenlocatetrainingmaterialsinordertohelpthemmeettherequirementsofthejobpost.