Upload
nguyenkhuong
View
216
Download
0
Embed Size (px)
Citation preview
High-Level Data Models on RAMCloud
An early status report
Jonathan Ellithorpe, Mendel RosenblumEE & CS Departments, Stanford University
Talk Outline● The Idea● Data models today● Graph databases● Experience with TorcDB● Concluding thoughts
Frangipani: A Scalable Distributed File SystemSOSP - 1997 - Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee (DEC Systems Research Center)
Build a distributed file system in two layers:
1. Petal: Distributed block storage - Handle all the hard storage problems2. Frangipani: Simple layer implements a distributed file system on top of Petal
but inherits all the goodness of Petal
Claim: Easier way to build a distributed file system
RAMCloud - Data center storage
● Storage challenges addressed○ Low latency: Reads in 5 microseconds, writes in 15 microseconds○ Scalable: Scale-out architecture - more servers ⇒ more storage & ops○ Durable/Available: Replication and fast recovery from server failures○ Concurrency control: Simple transaction system
● With a simple data model: Tables of objects○ Key-value store: Objects update to max size (1MB) accessed via key○ Operations: Whole object read & write optionally under transaction and
table enumerate
Idea: Build high level data models on RAMCloudClients using higher level data model with RAMCloud goodness
RAMCloud storage goodness
Thin data model translation layer
Client
Client
Client
Client
Client
Client
Client
Client
What data models are most interesting?Looked at most popular databases in use today to inform us as to what are the most useful data models for applications today
Data Structure Databases (e.g. Redis)
Document Databases (e.g. Mongo DB)
Relational Databases (e.g. SQL)
Graph Databases (e.g. Neo4j)
Take Redis as an example
RAMCloud Win Redis Win
Fetching of single objects Consistent scans of a key space● No consistent table scans in RC
Scalibility● Redis not distributed
Higher-level API● Redis handles high-level operations
on data-structures on the server itself (1 RTT). RAMCloud needs a read modify write cycle.
Persistence performance● Redis durable writes → Disk latency● RC durable writes → DRAM latency
Operations that access the entire data structure
● E.g. search list for matching pattern● RAMCloud requires fetching data to
client.
Thoughts on data models on RAMCloud● Data Structure Databases (e.g. Redis)
○ Expect a win on scalability and performance for persistence○ High-level API provides significant advantage for Redis
● Document Databases (e.g. MongoDB)○ Expect a win on scalability and performance for persistence as well○ Not RAMCloud’s strength. RAMCloud only supports single value secondary indexes.
● Relational Databases (e.g. SQL)○ SELECT * FROM employees; → Need consistent table scans…○ RAMCloud doesn’t support consistent table scans○ Indexes are also not transactional
● Graph Databases (e.g. Neo4j)○ Expect a win on scalability○ Not sure that high level APIs will really help at scale
■ Graph data very hard to partition } Win?
Graph Databases● Data:
○ Labeled nodes with properties (Name-value pairs)○ Labeled edges with properties (Name-value pairs) connect Nodes
● Operations:○ CRUD on Nodes and Edges and corresponding properties ○ Queries over the graph
● Allows capturing of the relationships between entities ○ Example entities: People, University, Messages, etc. ○ Example relationships: knows, isLocatedIn, hasCreator, etc.○ Example queries:
■ All friends within 3 hops of Alice whose name is Bob sorted by distance.■ All replies to a post and whether or not reply author knows the post author■ Add a new post with edges to the creator, the location, and any tags.
TorcDB: TinkerPop on RAMCloud● TinkerPop
○ Apache Project○ Graph API and software stack
● Gremlin○ Query language○ Imperative○ Designed around the idea of a pipeline○ Gets translated into called on the
TinkerPop graph API
Long pathLength = g.V(person1Id).choose(
where(out("knows")), repeat(out("knows").simplePath()).until(
hasId(person2Id).or().path().count(local).is(gt(5))
).limit(1).choose(id().is(eq(person2Id)),union(
path().count(local), constant(-1)
).sum(),constant(-1)
),constant(-1)
).next();
Social Network Benchmark● 7 Short read queries
○ Get a list of my posts○ Get all the comments on one of my posts○ ...
● 8 Update queries○ Post a new message○ Like a post○ …
● 14 Complex queries○ All friends within 3 hops of me whose name is “Bob”, sorted by distance from me.○ ...
Data Layout: Graph mapping to RAMCloud objects
Key Value
vID:label Vertex Label
vID:props Properties
Vertex Table
Key Value
vID Incident Edge Labels
vID:eLabel:dir Edge list
Edge List Table
Data Layout: Graph mapping to RAMCloud objects
Key Value
vID:label Vertex Label
vID:props Properties
Vertex Table
Key Value
vID Incident Edge Labels
vID:eLabel:dir Edge list
Edge List Table
Edge1 Edge2 Edge3 ...
vID Properties
Data Layout: Graph mapping to RAMCloud objects
Key Value
vID:label Vertex Label
vID:props Properties
Vertex Table
Key Value
vID Incident Edge Labels
vID:eLabel:dir Edge list
Edge List Table
Questions: ● How do we break up edge lists that don’t fit in
a single RAMCloud object?● How do we do this in such a way that
optimizes the performance of the most common operations?
Edge list design tradeoffs● Two most common edge list operations
○ Add edge to a node ○ Get all the neighbors of a node
● Edge list storage optimized for adding■ Want updated object to be small so read / write time is short■ Store edge list in many small objects?
● Getting all the neighbors of a node■ Want to read as few objects as possible, few round tips to RAMCloud■ Store edge list in as few objects as possible?
Key Idea:Minimize the rd/wr size for adding an edge, while Maximizing the average
chunk size when retrieving the whole list.
TorcDB Edge list layout
Logical Edge List
Head Segment Tail Segment Tail Segment Tail Segment
● Add edges here● Want to keep as
small as possible
● Need to read all of these when reading the whole list
● Want as few as possible → Minimize object read overhead.
TorcDB Edge list layout
RootKey:0 0 | edge2 | edge1Head Segment →
Number of tail segments
Size Threshold
TorcDB Edge list layout
RootKey:0 0 | edge6 | edge5 | edge4 | edge3 | edge2 | edge1Head Segment →
Size Threshold
Number of tail segments
TorcDB Edge list layout
RootKey:0 0 | edge6 | edge5 | edge4 | edge3 | edge2 | edge1Head Segment →
Size Threshold
Number of tail segments
Split
TorcDB Edge list layout
RootKey:0 0 | edge6 | edge5 | edge4 | edge3 | edge2 | edge1Head Segment →
Size Threshold
Number of tail segments
Split
TorcDB Edge list layout
RootKey:0 1 | edge6
RootKey:1 edge5 | edge4 | edge3 | edge2 | edge1Head Segment →
Tail Segment 1 →
Size Threshold
Number of tail segments
TorcDB Edge list layout
RootKey:0 3 | edge16
RootKey:3 edge15 | edge14 | edge13 | edge12 | edge11
RootKey:2 edge10 | edge9 | edge8 | edge7 | edge6
RootKey:1 edge5 | edge4 | edge3 | edge2 | edge1
Head Segment →
Tail Segment 3 →
Tail Segment 2 →
Tail Segment 1 →
Size Threshold
Number of tail segments
Results: ● Head segment small, tail segments large.● Tail segments can be read in parallel● Whole object can be read by prediction
Data Layout: Graph mapping to RAMCloud objects
Key Value
vID:label Vertex Label
vID:props Properties
Vertex Table
Key Value
vID Incident Edge Labels
vID:eLabel:dir Edge list
Edge List Table
Good?
Data Layout: Graph mapping to RAMCloud objects
Key Value
vID Incident Edge Labels
vID:eLabel:dir Neighbor Vertex Labels
vID:eLabel:dir:vLabel List of edges of type eLabel going from vID to nodes of type vLabel
Key Value
vID Incident Edge Labels
vID:eLabel:dir Edge list
Edge List Table
Evaluating TorcDB - Comparison with Neo4j● Neo4j
○ Single server○ MySQL-like implementation (queries execute on server)○ Disk-based, cache recently used data in memory
CypherMATCH (person1:Person {id:{1}}), (person2:Person {id:{2}})MATCH path = shortestPath( (person1)-[:KNOWS*..15]-(person2) )RETURN length(path)
Benchmarking● Loading a 2.5TB graph dataset into TorcDB - Non-Trivial Undertaking
○ Initially took > 1 day■ Loading through TorcDB
○ Got it down to 4 hours■ Took snapshot of RAMCloud tables, load tables directly
○ Eventually got it down to 20 minutes■ Sort snapshot tables into server-local tablets, directly load tablets into servers from local
disk.● Why is load time important? Why not load once and then run benchmarks?
○ Each run of the benchmark performs updates and modifies the graph. For 100% consistent results, need to reload the graph for each run of the benchmark.
Conclusion● Key to performance is minimizing the amount of data that needs to be read
and written to execute a query. ● Minimize the number of round trips → Minimize the data dependencies
○ I.e. Linked Lists probably not a good idea.● RAMCloud not suitable for large scale analytics tasks.● What we were missing in RAMCloud.
○ Large transactions ■ Not generally possible to do consistent reads of 3-hop subgraph.
○ Consistent table scans○ Can’t update an object in place.