Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
@SnowflakeDB@SnowflakeDB #CloudAnalytics17
LONDON
EnablingtheAgileDataWarehouseSteveHerskovitzVPSalesEngineering,SnowflakeComputing
• AgileWarehouseScaling• SeparationofWorkloads• VirtualWHScalingTechniques
• AgileDataLifecycle• Cloning
• AgileDataAnalytics• TimeTravel
• RealCustomerStory• GTA– Gulliver'sTravelAssociates
EnablingtheAgileDataWarehouse- Agenda
Agile Warehouse ScalingSeparation of Workloads
• Struggle– multipleworkloadssharingafixedresource• OvernightbatchETL
• ETLmustcompletebeforebusinessworkloadsstart• PlannedorunexpecteddatasurgescancauseETLtorunlate• Worseyet,overnightinUSisexactlytheUKbusinessday• ETLandbusinessworkloadsimpacteachother'sSLAs
• Competingbusinessworkloads• Sales,Marketing,Finance,DataScience
• Conventionalsolutions• Dividefixedresourceintotimeslotsforeachworkload• Complexworkloadprioritizationschemes• Periodicorseasonalsurgeshandledbylockingoutsomeusers
SeparationofWorkloads
• Struggle– multipleworkloadssharingafixedresource
• Snowflakesolution• AssigneachworkloaditsownVirtualWH• Ataminimum,twoWH:ETLandBusiness
• ETLcanruncontinuouslyifitmakessenseforthebusiness
• ETL/Businessworkloadcontentioniseliminated• Furthersubdividebusinessworkloadsintoownclustersasneeded
• EliminatecontentionbetweenSales,Marketing,Finance,DataScienceworkloads
• Internationalgroupscanoperateclustersontheirownlocaltimeschedules
• Permitsdepartmentalchargebacks
SeparationofWorkloads
VirtualWarehouse
Databases
VirtualWarehouse
ETL&DataLoading
BusinessWorkloads
Finance
VirtualWarehouse
Test/Dev
VirtualWarehouse
S
Marketing
VirtualWarehouse
Sales
VirtualWarehouse
S
Research
Agile Warehouse ScalingTechniques
• IncreaseT-shirtsize• Moredatabeinganalyzed• Morecomplexqueries• Getsomeconcurrencyboost
• Multi-clusterWHisbestforconcurrency
• Workloadquerieshaveusualweight• Butmoreofthem,e.g.20dashboardusersratherthantheusual5
• Combinethesetwotechniquesforbesteffect
WarehouseScalingTechniques
• Schedulerchecksifitshouldspinupanothercluster• Queriesmustqueuefor>30seconds• Spinningupclusterisoftenimmediate(forXXLorsmaller)• Queriesbegintogetload-balancedacrossnewclusters• One-minuterule:60secondsofloadbalancingbeforenextqueuinginterval
• Repeatuptomaximumclustersconfiguredforwarehouse• Designedto
• Balanceresponsivenessagainstcost• EnsureSLAs
MCWHScalingAlgorithms– ScalingUp
• Schedulerchecksforclusterstodistributequeries• Clusterisactive(notquiesced)• Clusterhaslatestversionofsoftware(incasesystemupdateinprogress)• Clusterhashead-roomformorequeries• Clusteristheleastbusy• Sessionaffinitybreaksanyties(forcache)
• Designedtomaximize• Individualqueryperformance• Overallthroughput• Overallconcurrency
MCWHScalingAlgorithms– LoadBalancing
• Schedulerchecksifitcanspindownacluster• One-minuterule:60secondsofloadbalancingbeforecheckifWHunderloaded• CheckifWHwithonelessclustercouldhavehandledtheloadover15minutes• Quiesce cluster:finishcurrentqueriesbutacceptnonewqueries• Waitanother15minutesbeforecheckingifcanquiesce anothercluster
• Repeatdowntominimumclustersconfiguredforwarehouse• Designedto
• Maximizethevalueoftherunningclusters• MinimizethecostoftheMCWH
MCWHScalingAlgorithms– ScalingDown
• Anticipatedsurges• ExplicitlyincreaseWHnodes(T-shirtsize)whenexpectingmoredata• ExplicitlyincreaseMCWHminimumclusterswhenexpectingmorequeries• CandobothatoncewithALTERWAREHOUSE• Usecron orotherscheduling/orchestrationtool
• Unanticipatedsurges• RelyonMCWHmaximumclustersforsomeextraheadroom
• Maximize• Responsivenessforusers• Throughputandvalueextractedfromvariablecomputepower
• Minimize• Costandadministrativeoverhead
WarehouseScaling– BestPractices
Agile Data Lifecycle
• SeparationofWorkloads• Individualvirtualwarehouseforeachdev/test/prodfunctionalarea
• CLONEfordev/test• Fulllogicalcopyofthedata,butusesnoextrastorage• Test/dev operationsagainstclonehavenoeffectonoriginaldata• Security
• RBAClimitsdev/testaccesstocloneandnotproductiondata• SecureViewspermitrole- oruser-basedobfuscation/masking/projection
• ClonetoTRANSIENTreducesstorageusagebydev/testoperations• TRANSIENTtablescanhaveretentionperiodsetto0daysiftime-travelisnotpartofyourapp
• BusinessImpact– betterqualitycode• Dev andtestteamsareworkingondataatscale,seetrueappperformance• Fullrangeofvaluesmeansfewersurpriseswhenappencounterslivedata
AgileDataLifecycle
Demo
• Createdevelopment(DEV)andintegration(INT)databasesfromproduction(PROD)
Scenario1
PROD
PUBLIC
TableA TableB
INT
PUBLIC
TableA TableB
DEV
PUBLIC
TableA TableBCLONE
CLONE
• Createtwonewtables,CandD,inthedevelopment(DEV)database
Scenario2:newdevelopment
PROD
PUBLIC
TableA TableB
INT
PUBLIC
TableA TableB
DEV
PUBLIC
TableA TableB
TableC TableD
• Mini-release:promotetableCforintegrationtesting
Scenario2:newdevelopment
PROD
PUBLIC
TableA TableB
INT
PUBLIC
TableA TableB
DEV
PUBLIC
TableA TableB
TableC TableD
TableC
CREATETABLELIKE
• Deploytoproduction:promotetableCtoPRODdatabase
Scenario2:newdevelopment
PROD
PUBLIC
TableA TableB
INT
PUBLIC
TableA TableB
DEV
PUBLIC
TableA TableB
TableC TableD
TableC
CREATETABLELIKE
TableC
• Refreshdev:getlatestPRODdataintoDEVandINT
Scenario2:newdevelopment
PROD
PUBLIC
TableA TableB
INT
PUBLIC
TableA TableB
DEV
PUBLIC
TableA TableB
TableC TableD
TableC
TableC
CLONE
DEV
PUBLIC
TableA TableB
TableC TableD
DEV2
CLONE
• CLONEforDataScientists• QuickandSafesandboxfordiscoveryandtesting• Combinewithownvirtualwarehouseforcompleteisolation• BusinessImpact– betterdatascience
• Morefine-graineddataoverlongertimeintervals• Deeperinsights,betterforecasting,moremonetizable results
• CLONEforCompliance• Monthly,quarterly,annualclones– financialreporting,auditingrequirements• BusinessImpact– simplercompliance
• Your"backups"areliveandimmediatelyavailable
AgileDataLifecycle
Time Travel
• CLONEoperatesonmetadatarepo• Tablemicro-partitionsaretrackedbypointersinmetadatarepo• Cloningcopiespointersonly,notthemicro-partitions
• TimeTravelleveragesmetadatapointers• Pointerhasmillisecond-granularitytimestamp• Snowflakeknowswhichmicro-partitionsareactiveinyourtableatanymoment
AgileDataAnalytics
• DataRetentionforTimeTravel• Defaultis24hours,maximum90days• Configurableper-tablebytableowner• Usesmorestoragebecausekeepsthemicro-partitionsaroundlonger
• SimpleSQLsyntax• SELECTcols…FROMt1 AT(TIMESTAMP=>timestamp);• CREATEobj2 CLONEobj1 BEFORE(STATEMENT=>query-id);
AgileDataAnalytics
• SELECTcount(*)FROMlineitem AT(TIMESTAMP=>'2020-01-0112:00:00'::timestamp);
• SELECT(SELECTcount(*)FROMlineitem AT(OFFSET=>-60*2))before_etl,(SELECTcount(*)FROMlineitem)after_etl;
AgileDataAnalytics
Demo
• BusinessImpactofTimeTravel• UNDROP— table,schema,database• Un-TRUNCATEtable(CLONEtable,thenswapnames)• RecoverfromETL/ELTupdate(CLONEdatabase,thenswapnames)• Temporalqueries
• Forexample,whatwasinventoryonagivendate?• Type2slowlychangingdimensions
• upto90-dayrunningwindow• Fastprototypeforlonger-windowType2analytics
• Testpredictivemodelsagainsthistoricaldata• Don'tneedtomakeandstoredailybackups
AgileDataAnalytics
• SeparationofWorkloads• Individualvirtualwarehouseforeachdev/test/prodfunctionalarea
• VirtualWarehousescaling• T-shirtsizes,numberofWHs,andMCWH
• CLONEfordev/testandotherusescases• Fulllogicalcopyofthedata,butusesnoextrastorage
• TimeTravelandCDP• SELECT"asof"fortestingofpredictivemodels,type2changingdimensions• Easy"undo"ofupdates– UNDROP,un-TRUNCATE,CLONE"asof"
AgileDataWarehouse– Summary
Customer Story – GTA
DataWarehousingprojects…
• HowSnowflakehelpedGTAbemoreAgile.
1. FocusonValue: Choosingtherighttools
2. Ourplumbing:Pipelinetransparency
3. AgileDevelopment:Iteration,prototyping&testing
4. Scaling:startsmall,remainflexible
ComplexLongChange
=Risky
1.Choosingyourtools
• Whatwouldletusfocusonaddingvaluequickest?
• Whatisgoingtogivehighestproductivity?
• Whatislowestrisk– butstillfutureproof?
1.Ournewstack…
- Don’tre-inventthewheel
- Standard,proven,&skillsavailability
- LowerdependencyonIT
- Python,opensource
- SQL
- On-demand– getgoingquickly
- Zeroadmin– welostourDBA
- Managedplatform
•
Airflow
Snowflake
EC2+S3
GoldenGate
Extract,Load,thenTransform
CaptainObvious isheretosaysomethingobvious…
WedoeverythinginSnowflake- Replicatesource- Transparency!- NoETLtool
2.Plumbing•
2.Weingestevery5mins…Dataage:1.5days->1.5hrs
Airflow
Bookings
Inventory
Finance
Salesforce
AWSS3BI/Viz
PythonJupyternotebooks
On-premise
5min 60min 90minDataage
3.AgileDevelopment- Cloning!
S3Pre-proddatafeed
S3Proddatafeed
Pre-Production
Production
Dev(n)
3.UsingCloning– next
S3Proddatafeed
Deltas Production
• Maintain1feed
• (n)Dev&TestpairsOn-demand (cost)
• LiveDeltasfeedfrompointofcloning
4.Scaling– flexibility!
Beforeyoustartyourproject,howconfidentareyouon:
– Load&concurrency?;Storage?Test&Dev.?
– Ad-hocAnalytics/DataSciencepulls?
Startedsmall,andscaleuporoutasneeded:
– DirectBIqueryingonlargedatasets
– Projectworkloads
– Re-processing
4.Snowflakereducesopportunitycosttotrythings
Mynewpricinganalysiscodetakes3mtorun…
CanIrunit7000times?
Snowflakescaling=ReducingOpportunitycosts forexperimentation
Agiletips
– Agilemodellingonwhiteboardswiththebusiness
– Prototyping – shareearlyinexcel&BItool
– Iterate – 1st versioninuseearly
– Milestones - nobigbang
– Verification – plantimeforthisandtackleearly
TakeawaysonSnowflake+agile
– Choosetoolsthatfityourteam’sskillset
– Choosetoolsthatmoveyouquicklydodeliveringbusinessvalue
– Transforminyourtargetenvironment
– Createanagiledevelopmentenvironment
– Choosetoolsthatareflexibleandon-demandtostartsmall
Thank You to Our PartnersPlatinum
Gold