24
#datapopupseattle AARON CORDOVA CTO and Co-Founder, Koverse aaroncordova Making Big Data Projects Successful koverse

Making Big Data Projects Successful - Data Science Pop-up Seattle

Embed Size (px)

Citation preview

Page 1: Making Big Data Projects Successful - Data Science Pop-up Seattle

#datapopupseattle

AARON CORDOVACTO and Co-Founder, Koverse

aaroncordova

Making Big Data Projects Successful

koverse

Page 2: Making Big Data Projects Successful - Data Science Pop-up Seattle

#datapopupseattle

UNSTRUCTUREDData Science POP-UP in Seattle

www.dominodatalab.com

D

Produced by Domino Data Lab

Domino’s enterprise data science platform is used by leading analytical organizations to increase productivity, enable collaboration, and publish

models into production faster.

Page 3: Making Big Data Projects Successful - Data Science Pop-up Seattle

Keystomakingsuccessfulbigdataprojectsrepeatable

Page 4: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 2

Intro

AaronCordovaCTO,co-founderatKoverseInc.BuiltsuccessfulbigdatasystemsforDOD,Intelligence,Finance

Page 5: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 3

BigDataProjects

Howittendstobe

Howitshouldbe

Page 6: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 4

BigDataProjects

Interes<ngpart

Page 7: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 5

BigDataProjects

Interes<ngpart

Page 8: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 6

BigDataProjects

Interes<ngpart

MorepropellantSupportInfrastructure

Propellant

LaunchplaSorm

U<li<es

Page 9: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 7

Step1:Import

Bringthedatatothedatascien<stFromwhere?

Page 10: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 8

Step1:Security

Sensi<vedatarequiresaccesscontrolsUsingmorethan1datasetrequirefine-grainedaccesscontrols

Page 11: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 9

Step1:Security

Page 12: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 10

Step2:DataAssump<ons

Needtofindout1.  Structureofthedata(fieldnames,types)

2.  Dataseman<cs(isCustomerIDindatasetAequaltoCIDfromdatasetB?)

Ini<alassump<onsarealmostcertainlywrong.Needtoseeactualdatasamples.Goback,getmoredatasets;normalize,cleanupdata

Page 13: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 11

Step2:DataAssump<ons

Ifprimaryanaly<calsystemcan’thandlediscovery,needanothersystemforsampling,viewing,cleaningup,normalizingdata

Page 14: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 12

Step3:Interes<ngPart!

Runanaly<cs!Needsomesortofsystemforrunninganaly<cs:

RPythonSparkMLLibMapReduceSAS

Page 15: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 13

Step4:DeliveringResults

Reportsarerela<velyeasytodeliver–runonceaday..smalloutputSomeresultsarelarge,needtostayinthesystemIndexingmakesresultssearchableforalargenumberofconsumersResultscanbeembeddedininterac<vedecision-makingappswithanAPI

Page 16: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 14

Step4:DeliveringResults

Findsomesystemforindexinganaly<calresults–possiblycopyingdata,addressconsistencyissuesApplysomesolu<onformakingresultsavailableviaanAPIsotheycanbeembeddedinapplica<ons…Thenbuildapplica<ons

Page 17: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 15

Scalability

Eveniforiginaldatasetsaresmall,mul<pledatasetsneedtobeco-locatedOriginaldataistransformedintoderiva<vesIndexeddatarequiresmorespaceScalabilitybecomesaproblemeventually

Page 18: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 16

Scalability

Migrateoriginalsolu<ontoascalablesystem.Rewriteanaly<cs,dataflowforthescalablesystem.

Page 19: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 17

Repeatability

Systemworks!Nowwhat?Asnewdataarrives,thewholeprocessneedstobere-run,orrunonalltheavailabledataIfanyassump<onsorstructureofthedatachange,needtobeabletore-processdataLiveupdatesneedtobescheduled,resourcedemandsneedtobebalancedOhyeah,andgobackandaddresssecurity…ifpossible

Page 20: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 18

Workingbackwards

Page 21: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 19

Workingbackwards

Wanttoprovidevaluefromdatabutfirsthaveto:

Addressdatadiscovery,security,scalability,repeatability…

Page 22: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 20

YakShavingAvoid

Page 23: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 21

Recommendedapproach

1.  Startwithscalabletechnologies2.  Buildinsecurityfromthestart3.  Admitthatdataismessy,makeitpossibletoaddressdataqualityissues

withinthesystem4.  Integratewithwhateveranaly<caltoolsdatascien<stswanttouse5.  Integrateindexingandsearchintothesystem,avoidcopyingdata6.  Allowforprototypingnewdataflows,analy<cs,appsinproduc<onsystem.

Goingliveamamerofconfigura<on..notarewrite

Page 24: Making Big Data Projects Successful - Data Science Pop-up Seattle

©Koverse|CompanyConfiden<al 22

Recommendedapproach

Gofrom2-3successfulprojectsperyearto20-30