Master Master DatasetDataset Batch viewsBatch views
New DataNew Data Realtime Realtime viewsviews
QueryQuery
Lambda Architecture
Master Master DatasetDataset
R/W R/W databasesdatabases
Stream Stream processorprocessor
Proposed alternative
(Problematic)
Notes:
• Not limiting ourselves to current tooling
• Reasonable variations of existing tooling are acceptable
• Interested in what’s fundamentally possible
Traditional Architectures
ApplicationApplication DatabasesDatabases
ApplicationApplication DatabasesDatabasesStream Stream processorprocessorQueueQueue
Synchronous
Asynchronous
Approach #1
• Queries:
• Get all sets for all hours in range of query
• Union sets together
• Compute count of merged set
Approach #1
• Lot of database lookups for large ranges
• Potentially a lot of items in sets, so lots of work to merge/count
• Database will use a lot of space
interface HyperLogLog { boolean add(Object o); long size(); HyperLogLog merge(HyperLogLog... otherSets);}
Approach #2
• Use Key->HyperLogLog database
• Key = [URL, hour bucket]
• Value = HyperLogLog structure
Approach #2
• Queries:
• Get all HyperLogLog structures for all hours in range of query
• Merge structures together
• Retrieve count from merged structure
Approach #3
• Use Key->HyperLogLog database
• Key = [URL, bucket, granularity]
• Value = HyperLogLog structure
Approach #3
• Queries:
• Compute minimal number of database lookups to satisfy range
• Get all HyperLogLog structures in range
• Merge structures together
• Retrieve count from merged structure
Approach #3
• All benefits of #2
• Minimal number of lookups for any range, so less variation in latency
• Minimal increase in storage
• Requires more work at write time
struct Equiv { UserID id1, UserID id2}
struct PageView { UserID id, String url, Timestamp timestamp }
[“foo.com/page1”, 0][“foo.com/page1”, 1] [“foo.com/page1”, 2]
...[“foo.com/page1”, 1002]
{A, B, C}{B}
{A, C, D, E}...
{A, B, C, F, Z}
[“foo.com/page1”, 0][“foo.com/page1”, 1] [“foo.com/page1”, 2]
...[“foo.com/page1”, 1002]
{A, B, C}{B}
{A, C, D, E}...
{A, B, C, F, Z}
A <-> C
Approach #1
• [URL, hour] -> Set of PersonIDs
• UserID -> Set of buckets
• Indexes to incrementally normalize UserIDs into PersonIDs
Approach #2
• [URL, bucket] -> Set of UserIDs
• Like Approach 1, incrementally normalize UserId’s
• UserID -> PersonID
Approach #2
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to produce new set
• Get count of new set
Attempt 1:
• Maintain index from UserID -> PersonID
• When receive A <-> B:
• Find what they’re each normalized to, and transitively normalize all reachable IDs to “smallest” val
1 <-> 41 -> 14 -> 1
2 <-> 55 -> 22 -> 2
5 <-> 3 3 -> 2
4 <-> 55 -> 12 -> 1
3 -> 1 never gets produced!
Attempt 2:
• UserID -> PersonID
• PersonID -> Set of UserIDs
• When receive A <-> B
• Find what they’re each normalized to, and choose one for both to be normalized to
• Update all UserID’s in both normalized sets
1 <-> 41 -> 14 -> 1
1 -> {1, 4}
2 <-> 55 -> 22 -> 2
2 -> {2, 5}
5 <-> 3 3 -> 22 -> {2, 3, 5}
4 <-> 55 -> 12 -> 13 -> 1
1 -> {1, 2, 3, 4, 5}
General challenges with traditional
architectures• Redundant storage of information
(“denormalization”)
• Brittle to human error
• Operational challenges of enormous installations of very complex databases
Master Master DatasetDataset
Indexes for Indexes for uniques over uniques over
timetime
Stream Stream processorprocessor
No fully incremental approach will work!
ID Name Location ID
1 Sally 3
2 George 1
3 Bob 3
Location ID City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
Normalization vs Denormalization
ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 New York NY
3 Bob 3 Chicago IL
Location ID City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
Master Master DatasetDataset
Continuously recompute indexes
Indexes for Indexes for uniques over uniques over
timetime
Sidenote on tooling
• Batch processing systems are tools to implement function(all data) scalably
• Implementing this is easy
Conclusions
• Easy to understand and implement
• Scalable
• Concurrency / fault-tolerance easily abstracted away from you
• Great query performance
Absorbed into batch viewsAbsorbed into batch views Not Not absorbedabsorbed
NowTime
Just a small percentageof data!
Master Master DatasetDataset Batch viewsBatch views
New DataNew Data Realtime Realtime viewsviews
QueryQuery
Implementing realtime layer
• Isn’t this the exact same problem we faced before we went down the path of batch computation?
Approach #1
• Use the exact same approach as we did in fully incremental implementation
• Query performance only degraded for recent buckets
• e.g., “last month” range computes vast majority of query from efficient batch indexes
Approach #1
• Relatively small number of buckets in realtime layer
• So not that much effect on storage costs
Approach #1
• Complexity of realtime layer is softened by existence of batch layer
• Batch layer continuously overrides realtime layer, so mistakes are auto-fixed
Approach #1
• Still going to be a lot of work to implement this realtime layer
• Recent buckets with lots of uniques will still cause bad query performance
• No way to apply recent equivs to batch views without restructuring batch views
UserID -> UserID -> PersonIDPersonID
(from batch)
Approach #2
PageviewPageview Convert UserID Convert UserID to PersonIDto PersonID
[URL, bucket][URL, bucket]->->
HyperLogLogHyperLogLog
Approach #2
• Only inaccurate for recent equivs
• Intuitively, shouldn’t be that much inaccuracy
• Should quantify additional error
Master Master DatasetDataset Batch viewsBatch views
NormalizedRobust data model
DenormalizedOptimized for queries
Normalization problem solved
• Maintaining consistency in views easy because defined as function(all data)
• Can recompute if anything ever goes wrong
Incremental compaction
• Databases write to write-ahead log before modifying disk and memory indexes
• Need to occasionally compact the log and indexes
Incremental compaction
• Notorious for causing huge, sudden changes in performance
• Machines can seem locked up
• Necessitated by random writes
• Extremely complex to deal with
More Complexity
• Dealing with CAP / eventual consistency
• “Call Me Maybe” blog posts found data loss problems in many popular databases
• Redis
• Cassandra
• ElasticSearch
Master Master DatasetDataset Batch viewsBatch views
New DataNew Data Realtime Realtime viewsviews
QueryQuery
Master Master DatasetDataset Batch viewsBatch views
New DataNew Data Realtime Realtime viewsviews
QueryQuery
No random writes!
Master Master DatasetDataset
ApplicationApplication R/W R/W databasesdatabases
(Synchronous version)
Master Master DatasetDataset Batch viewsBatch views
New DataNew Data Realtime Realtime viewsviews
QueryQuery
Lambda Architecture
Lambda Architecture
• This is most basic form of it
• Many variants of it incorporating more and/or different kinds of layers