London 4 - Enabling Agile Data Warehouse€¦ · Enabling the Agile Data Warehouse Steve Herskovitz...

Preview:

Citation preview

@SnowflakeDB@SnowflakeDB #CloudAnalytics17

LONDON

EnablingtheAgileDataWarehouseSteveHerskovitzVPSalesEngineering,SnowflakeComputing

• AgileWarehouseScaling• SeparationofWorkloads• VirtualWHScalingTechniques

• AgileDataLifecycle• Cloning

• AgileDataAnalytics• TimeTravel

• RealCustomerStory• GTA– Gulliver'sTravelAssociates

EnablingtheAgileDataWarehouse- Agenda

Agile Warehouse ScalingSeparation of Workloads

• Struggle– multipleworkloadssharingafixedresource• OvernightbatchETL

• ETLmustcompletebeforebusinessworkloadsstart• PlannedorunexpecteddatasurgescancauseETLtorunlate• Worseyet,overnightinUSisexactlytheUKbusinessday• ETLandbusinessworkloadsimpacteachother'sSLAs

• Competingbusinessworkloads• Sales,Marketing,Finance,DataScience

• Conventionalsolutions• Dividefixedresourceintotimeslotsforeachworkload• Complexworkloadprioritizationschemes• Periodicorseasonalsurgeshandledbylockingoutsomeusers

SeparationofWorkloads

• Struggle– multipleworkloadssharingafixedresource

• Snowflakesolution• AssigneachworkloaditsownVirtualWH• Ataminimum,twoWH:ETLandBusiness

• ETLcanruncontinuouslyifitmakessenseforthebusiness

• ETL/Businessworkloadcontentioniseliminated• Furthersubdividebusinessworkloadsintoownclustersasneeded

• EliminatecontentionbetweenSales,Marketing,Finance,DataScienceworkloads

• Internationalgroupscanoperateclustersontheirownlocaltimeschedules

• Permitsdepartmentalchargebacks

SeparationofWorkloads

VirtualWarehouse

Databases

VirtualWarehouse

ETL&DataLoading

BusinessWorkloads

Finance

VirtualWarehouse

Test/Dev

VirtualWarehouse

S

Marketing

VirtualWarehouse

Sales

VirtualWarehouse

S

Research

Agile Warehouse ScalingTechniques

• IncreaseT-shirtsize• Moredatabeinganalyzed• Morecomplexqueries• Getsomeconcurrencyboost

• Multi-clusterWHisbestforconcurrency

• Workloadquerieshaveusualweight• Butmoreofthem,e.g.20dashboardusersratherthantheusual5

• Combinethesetwotechniquesforbesteffect

WarehouseScalingTechniques

• Schedulerchecksifitshouldspinupanothercluster• Queriesmustqueuefor>30seconds• Spinningupclusterisoftenimmediate(forXXLorsmaller)• Queriesbegintogetload-balancedacrossnewclusters• One-minuterule:60secondsofloadbalancingbeforenextqueuinginterval

• Repeatuptomaximumclustersconfiguredforwarehouse• Designedto

• Balanceresponsivenessagainstcost• EnsureSLAs

MCWHScalingAlgorithms– ScalingUp

• Schedulerchecksforclusterstodistributequeries• Clusterisactive(notquiesced)• Clusterhaslatestversionofsoftware(incasesystemupdateinprogress)• Clusterhashead-roomformorequeries• Clusteristheleastbusy• Sessionaffinitybreaksanyties(forcache)

• Designedtomaximize• Individualqueryperformance• Overallthroughput• Overallconcurrency

MCWHScalingAlgorithms– LoadBalancing

• Schedulerchecksifitcanspindownacluster• One-minuterule:60secondsofloadbalancingbeforecheckifWHunderloaded• CheckifWHwithonelessclustercouldhavehandledtheloadover15minutes• Quiesce cluster:finishcurrentqueriesbutacceptnonewqueries• Waitanother15minutesbeforecheckingifcanquiesce anothercluster

• Repeatdowntominimumclustersconfiguredforwarehouse• Designedto

• Maximizethevalueoftherunningclusters• MinimizethecostoftheMCWH

MCWHScalingAlgorithms– ScalingDown

• Anticipatedsurges• ExplicitlyincreaseWHnodes(T-shirtsize)whenexpectingmoredata• ExplicitlyincreaseMCWHminimumclusterswhenexpectingmorequeries• CandobothatoncewithALTERWAREHOUSE• Usecron orotherscheduling/orchestrationtool

• Unanticipatedsurges• RelyonMCWHmaximumclustersforsomeextraheadroom

• Maximize• Responsivenessforusers• Throughputandvalueextractedfromvariablecomputepower

• Minimize• Costandadministrativeoverhead

WarehouseScaling– BestPractices

Agile Data Lifecycle

• SeparationofWorkloads• Individualvirtualwarehouseforeachdev/test/prodfunctionalarea

• CLONEfordev/test• Fulllogicalcopyofthedata,butusesnoextrastorage• Test/dev operationsagainstclonehavenoeffectonoriginaldata• Security

• RBAClimitsdev/testaccesstocloneandnotproductiondata• SecureViewspermitrole- oruser-basedobfuscation/masking/projection

• ClonetoTRANSIENTreducesstorageusagebydev/testoperations• TRANSIENTtablescanhaveretentionperiodsetto0daysiftime-travelisnotpartofyourapp

• BusinessImpact– betterqualitycode• Dev andtestteamsareworkingondataatscale,seetrueappperformance• Fullrangeofvaluesmeansfewersurpriseswhenappencounterslivedata

AgileDataLifecycle

Demo

• Createdevelopment(DEV)andintegration(INT)databasesfromproduction(PROD)

Scenario1

PROD

PUBLIC

TableA TableB

INT

PUBLIC

TableA TableB

DEV

PUBLIC

TableA TableBCLONE

CLONE

• Createtwonewtables,CandD,inthedevelopment(DEV)database

Scenario2:newdevelopment

PROD

PUBLIC

TableA TableB

INT

PUBLIC

TableA TableB

DEV

PUBLIC

TableA TableB

TableC TableD

• Mini-release:promotetableCforintegrationtesting

Scenario2:newdevelopment

PROD

PUBLIC

TableA TableB

INT

PUBLIC

TableA TableB

DEV

PUBLIC

TableA TableB

TableC TableD

TableC

CREATETABLELIKE

• Deploytoproduction:promotetableCtoPRODdatabase

Scenario2:newdevelopment

PROD

PUBLIC

TableA TableB

INT

PUBLIC

TableA TableB

DEV

PUBLIC

TableA TableB

TableC TableD

TableC

CREATETABLELIKE

TableC

• Refreshdev:getlatestPRODdataintoDEVandINT

Scenario2:newdevelopment

PROD

PUBLIC

TableA TableB

INT

PUBLIC

TableA TableB

DEV

PUBLIC

TableA TableB

TableC TableD

TableC

TableC

CLONE

DEV

PUBLIC

TableA TableB

TableC TableD

DEV2

CLONE

• CLONEforDataScientists• QuickandSafesandboxfordiscoveryandtesting• Combinewithownvirtualwarehouseforcompleteisolation• BusinessImpact– betterdatascience

• Morefine-graineddataoverlongertimeintervals• Deeperinsights,betterforecasting,moremonetizable results

• CLONEforCompliance• Monthly,quarterly,annualclones– financialreporting,auditingrequirements• BusinessImpact– simplercompliance

• Your"backups"areliveandimmediatelyavailable

AgileDataLifecycle

Time Travel

• CLONEoperatesonmetadatarepo• Tablemicro-partitionsaretrackedbypointersinmetadatarepo• Cloningcopiespointersonly,notthemicro-partitions

• TimeTravelleveragesmetadatapointers• Pointerhasmillisecond-granularitytimestamp• Snowflakeknowswhichmicro-partitionsareactiveinyourtableatanymoment

AgileDataAnalytics

• DataRetentionforTimeTravel• Defaultis24hours,maximum90days• Configurableper-tablebytableowner• Usesmorestoragebecausekeepsthemicro-partitionsaroundlonger

• SimpleSQLsyntax• SELECTcols…FROMt1 AT(TIMESTAMP=>timestamp);• CREATEobj2 CLONEobj1 BEFORE(STATEMENT=>query-id);

AgileDataAnalytics

• SELECTcount(*)FROMlineitem AT(TIMESTAMP=>'2020-01-0112:00:00'::timestamp);

• SELECT(SELECTcount(*)FROMlineitem AT(OFFSET=>-60*2))before_etl,(SELECTcount(*)FROMlineitem)after_etl;

AgileDataAnalytics

Demo

• BusinessImpactofTimeTravel• UNDROP— table,schema,database• Un-TRUNCATEtable(CLONEtable,thenswapnames)• RecoverfromETL/ELTupdate(CLONEdatabase,thenswapnames)• Temporalqueries

• Forexample,whatwasinventoryonagivendate?• Type2slowlychangingdimensions

• upto90-dayrunningwindow• Fastprototypeforlonger-windowType2analytics

• Testpredictivemodelsagainsthistoricaldata• Don'tneedtomakeandstoredailybackups

AgileDataAnalytics

• SeparationofWorkloads• Individualvirtualwarehouseforeachdev/test/prodfunctionalarea

• VirtualWarehousescaling• T-shirtsizes,numberofWHs,andMCWH

• CLONEfordev/testandotherusescases• Fulllogicalcopyofthedata,butusesnoextrastorage

• TimeTravelandCDP• SELECT"asof"fortestingofpredictivemodels,type2changingdimensions• Easy"undo"ofupdates– UNDROP,un-TRUNCATE,CLONE"asof"

AgileDataWarehouse– Summary

Customer Story – GTA

§

CastStudy:EnablingtheAgileDataWarehousewithSnowflake

AdamSladeradam.slader@gta-travel.com

DataWarehousingprojects…

• HowSnowflakehelpedGTAbemoreAgile.

1. FocusonValue: Choosingtherighttools

2. Ourplumbing:Pipelinetransparency

3. AgileDevelopment:Iteration,prototyping&testing

4. Scaling:startsmall,remainflexible

ComplexLongChange

=Risky

1.Choosingyourtools

• Whatwouldletusfocusonaddingvaluequickest?

• Whatisgoingtogivehighestproductivity?

• Whatislowestrisk– butstillfutureproof?

1.Ournewstack…

- Don’tre-inventthewheel

- Standard,proven,&skillsavailability

- LowerdependencyonIT

- Python,opensource

- SQL

- On-demand– getgoingquickly

- Zeroadmin– welostourDBA

- Managedplatform

Airflow

Snowflake

EC2+S3

GoldenGate

Extract,Load,thenTransform

CaptainObvious isheretosaysomethingobvious…

WedoeverythinginSnowflake- Replicatesource- Transparency!- NoETLtool

2.Plumbing•

2.Weingestevery5mins…Dataage:1.5days->1.5hrs

Airflow

Bookings

Inventory

Finance

Salesforce

AWSS3BI/Viz

PythonJupyternotebooks

On-premise

5min 60min 90minDataage

3.AgileDevelopment- Cloning!

S3Pre-proddatafeed

S3Proddatafeed

Pre-Production

Production

Dev(n)

3.UsingCloning– next

S3Proddatafeed

Deltas Production

• Maintain1feed

• (n)Dev&TestpairsOn-demand (cost)

• LiveDeltasfeedfrompointofcloning

4.Scaling– flexibility!

Beforeyoustartyourproject,howconfidentareyouon:

– Load&concurrency?;Storage?Test&Dev.?

– Ad-hocAnalytics/DataSciencepulls?

Startedsmall,andscaleuporoutasneeded:

– DirectBIqueryingonlargedatasets

– Projectworkloads

– Re-processing

4.Snowflakereducesopportunitycosttotrythings

Mynewpricinganalysiscodetakes3mtorun…

CanIrunit7000times?

Snowflakescaling=ReducingOpportunitycosts forexperimentation

Agiletips

– Agilemodellingonwhiteboardswiththebusiness

– Prototyping – shareearlyinexcel&BItool

– Iterate – 1st versioninuseearly

– Milestones - nobigbang

– Verification – plantimeforthisandtackleearly

TakeawaysonSnowflake+agile

– Choosetoolsthatfityourteam’sskillset

– Choosetoolsthatmoveyouquicklydodeliveringbusinessvalue

– Transforminyourtargetenvironment

– Createanagiledevelopmentenvironment

– Choosetoolsthatareflexibleandon-demandtostartsmall

Q&A

AdamSladeradam.slader@gta-travel.com

Thank You to Our PartnersPlatinum

Gold

Recommended