Upload
lars-albertsson
View
187
Download
0
Embed Size (px)
Citation preview
Who’s talking?
Swedish Institute of Comp. Sc. (test tools)Sun Microsystems (very large machines)Google (Hangouts, productivity)Recorded Future (NLP startup)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling)Schibsted (data processing & modelling)
2
Why functional?
Verbs
... has made ... expanding ...
... flourishes ... merged ... has been unable to escape lingering .. built ...
... are ... placed ... say ... are ... to explode ...
.. are considering ... to reopen … to recall ...
3
Or object-oriented?
Nouns, pronouns
... bankruptcy ... government bailout ... automaker Chrysler ... comeback ... sales ... Jeep sport utility vehicles.
... Chrysler ... part ... Fiat Chrysler Automobiles, it ... concerns ... the safety ... Jeeps ...
... Jeeps ... gas tanks ... regulators ... safety advocates ... rear-end crash.
... regulators ... an investigation ... those Jeeps ... Fiat Chrysler’s agreement ... models.
4
Functional benefits? My version.
Matches a few problemsData processing
Matches a few computer propertiesConsistency through immutabilityDeterministic - replay for resilience
5
Local vs distributed properties
LocalHardware provides strong consistency
Faults -> death
6
DistributedEventual consistency
Faults must be survived
Architectural functional patterns
Personal anti-pattern experiences
Strive to look forImmutabilityReexecution
7
Data flows
9
Users
Pageviews
SalesSales
reports
Views with demographics
Sales with demographics
Conversion analytics
Conversion analytics
Views with demographics
Dataset artifacts, typically files with date parameter.
Raw Derived
Anti-pattern - isolated batch jobs
Get data (more on that later)Cron an ETL batch job (function)
Output solidifies. Mostly.Steps in isolation - often different teams
What to do on ETL code changes?
10
Sales with demographics
Views with demographics
Pattern: data pipeline
End-to-end sequences/DAG of jobsNot only exist, but treated end-to-end
Input is raw, original dataSeparate raw data from generated
11
Users
Pageviews
Sales with demographics
Conversion analytics
Conversion analytics
Views with demographics
Lambda architecture, part 1
Save all collected data without preprocessingBut timestamp on generation, register,
arrivalRerun everything downstream on code change
Human fault tolerance
In conflict with privacy management?
12
Pipeline workflow orchestration
Ideally: Good old make + cluster + IDE + xUnitTest end-to-endRebuild on upstream changes (but not all)
State of practice: Luigi, Pinball, AzkabanDon’t take you all the way :-(
13
Lambda architecture, part 2
Parallel batch and real-time pipelinesBatch more accurate, overridesReal-time for window of recent data
14
Obtaining data
Log things. Conceptually stable, but collection is challenging at scale.
Have legacy code and master data in databases? Let us have a look.
15
Database dimensioned for online trafficHadoop = herd of elephants
Load spikeHeight = #mapper nodesArea = #users
Anti-pattern: direct dump
16
AP
I
Direct dumps in the trenches
Company successful - #users increasingMore Sqoop mappers - higher DB loadDaily dump jobs went to 25h
Devops firewalled off Hadoop to recover
17
Anti-pattern: dump through API
SOA/microservice cultureDB protected by throttling
API not used to elephantsQuery area is still large
Herd of elephants through gate - 1-2 weeks
18
AP
I
Anti-pattern: slave dump
Protect live service by mirroring to a dump slaveNo online service risk, good!Why anti-pattern?
19
All dumps are non-deterministic
HDFS down? Dump later.State is gone - dump not accurate
Slave replication down?Dump not accurate
20
Anti-pattern: deterministic mirror
Replay commit log until full day/hourDiscovered through archaeology :-)
Not scalable, point of failureHourly dump took 45 minutes, increasing...
2121
(Anti-)pattern: better dumping
Netflix AegisthusSnapshot Cassandra (fast, atomic,
reliable)Transfer SSTables to HDFSReplicate compaction in MapReduce
Other DBs? Depends on atomic snapshot.
22
All dumps are anti-patterns?
Typical use: Join activity events with user infoEvent time != dump time
Aggregation discards informationWhich users enabled X, tried, and disabled?
23
Pattern: Event source
All facts are events. Immutable, timestampedEvent stream is source of truthNo explicit “current state”
The functional data architecture?
24
Event source incarnated: unified log
Pour events into pub/sub bus, with long history.Kafka de-facto standard.
Tap from bus to HDFS/S3 in time buckets.Camus/Secor
Stream processing pipelines to dest topicsReplay on code changes
25
Unified log, practical considerations
Long history necessaryMust have time to fix stream process bugsUse 3+ months and use stream as temp
DBUnified log also useful for meta and control
Tweak Kafka for low latency
26
Event source + views
View = snapshot of aggregated state @ timeFor ETL, choice of hourly/daily aggregates or exact views
27
LogsView View
Event source + database
Business logic may demand “current state”Event stream is truth, keep DB in sync
28
Event source, synced database
A. Service interface generates events and DB transactions
B. Generate stream from DB commit log. Postgres, MySQL -> Kafka
C.Build DB with stream processing
29
AP
IA
PI
AP
I
Deployment & orchestration
System = many machinesDesired system state = code + configActual state = Orchestrator(current, desired)
30
Anti-pattern: stateful orchestration
Orchestrator = Puppet|Chef|Ansible {current.changeSomeProperties(desired)return current
// current.otherProperties unchanged}
31
Stateful orchestration in the trench
Desired = { case roleA: install(x,y) case roleB: install(z) }Current = x installed on roleB. Old x. Zombie woke up when B load decreased.Puppet+apt = No simple way to remove undesired state
32
Pattern: artifacts from source
Orchestrator = Docker|Packer {delete currentreturn Image(desired)
}
No state leak from existing state. Sort of.
33
Deterministic, predictable?
Image building leaky on purposeE.g. “apt-get update && apt-get install”Imports external state
Ephemeral databases preserve stateAbility to rebuild from unified log is
valuable
34