High-Level Data Models on RAMCloud Talks/2018/Jonathan... · Build a distributed file system in two layers: ... Concurrency control: Simple transaction system ... Conclusion Key to

High-Level Data Models on RAMCloud

An early status report

Jonathan Ellithorpe, Mendel RosenblumEE & CS Departments, Stanford University

Talk Outline● The Idea● Data models today● Graph databases● Experience with TorcDB● Concluding thoughts

Frangipani: A Scalable Distributed File SystemSOSP - 1997 - Chandramohan A. Thekkath, Timothy Mann, and Edward K. Lee (DEC Systems Research Center)

Build a distributed file system in two layers:

1. Petal: Distributed block storage - Handle all the hard storage problems2. Frangipani: Simple layer implements a distributed file system on top of Petal

but inherits all the goodness of Petal

Claim: Easier way to build a distributed file system

RAMCloud - Data center storage

● Storage challenges addressed○ Low latency: Reads in 5 microseconds, writes in 15 microseconds○ Scalable: Scale-out architecture - more servers ⇒ more storage & ops○ Durable/Available: Replication and fast recovery from server failures○ Concurrency control: Simple transaction system

● With a simple data model: Tables of objects○ Key-value store: Objects update to max size (1MB) accessed via key○ Operations: Whole object read & write optionally under transaction and

table enumerate

Idea: Build high level data models on RAMCloudClients using higher level data model with RAMCloud goodness

RAMCloud storage goodness

Thin data model translation layer

Client

Client

Client

Client

Client

Client

Client

Client

What data models are most interesting?Looked at most popular databases in use today to inform us as to what are the most useful data models for applications today

Data Structure Databases (e.g. Redis)

Document Databases (e.g. Mongo DB)

Relational Databases (e.g. SQL)

Graph Databases (e.g. Neo4j)

Take Redis as an example

RAMCloud Win Redis Win

Fetching of single objects Consistent scans of a key space● No consistent table scans in RC

Scalibility● Redis not distributed

Higher-level API● Redis handles high-level operations

on data-structures on the server itself (1 RTT). RAMCloud needs a read modify write cycle.

Persistence performance● Redis durable writes → Disk latency● RC durable writes → DRAM latency

Operations that access the entire data structure

● E.g. search list for matching pattern● RAMCloud requires fetching data to

client.

Thoughts on data models on RAMCloud● Data Structure Databases (e.g. Redis)

○ Expect a win on scalability and performance for persistence○ High-level API provides significant advantage for Redis

● Document Databases (e.g. MongoDB)○ Expect a win on scalability and performance for persistence as well○ Not RAMCloud’s strength. RAMCloud only supports single value secondary indexes.

● Relational Databases (e.g. SQL)○ SELECT * FROM employees; → Need consistent table scans…○ RAMCloud doesn’t support consistent table scans○ Indexes are also not transactional

● Graph Databases (e.g. Neo4j)○ Expect a win on scalability○ Not sure that high level APIs will really help at scale

■ Graph data very hard to partition } Win?

Graph Databases● Data:

○ Labeled nodes with properties (Name-value pairs)○ Labeled edges with properties (Name-value pairs) connect Nodes

● Operations:○ CRUD on Nodes and Edges and corresponding properties ○ Queries over the graph

● Allows capturing of the relationships between entities ○ Example entities: People, University, Messages, etc. ○ Example relationships: knows, isLocatedIn, hasCreator, etc.○ Example queries:

■ All friends within 3 hops of Alice whose name is Bob sorted by distance.■ All replies to a post and whether or not reply author knows the post author■ Add a new post with edges to the creator, the location, and any tags.

TorcDB: TinkerPop on RAMCloud● TinkerPop

○ Apache Project○ Graph API and software stack

● Gremlin○ Query language○ Imperative○ Designed around the idea of a pipeline○ Gets translated into called on the

TinkerPop graph API

Long pathLength = g.V(person1Id).choose(

where(out("knows")), repeat(out("knows").simplePath()).until(

hasId(person2Id).or().path().count(local).is(gt(5))

).limit(1).choose(id().is(eq(person2Id)),union(

path().count(local), constant(-1)

).sum(),constant(-1)

),constant(-1)

).next();

Linked Data Benchmark Council

Social Network Benchmark

Social Network Benchmark● 7 Short read queries

○ Get a list of my posts○ Get all the comments on one of my posts○ ...

● 8 Update queries○ Post a new message○ Like a post○ …

● 14 Complex queries○ All friends within 3 hops of me whose name is “Bob”, sorted by distance from me.○ ...

Data Layout: Graph mapping to RAMCloud objects

Key Value

vID:label Vertex Label

vID:props Properties

Vertex Table

Key Value

vID Incident Edge Labels

vID:eLabel:dir Edge list

Edge List Table


Key Value



Vertex Table

Key Value



Edge List Table

Edge1 Edge2 Edge3 ...

vID Properties


Key Value



Vertex Table

Key Value



Edge List Table

Questions: ● How do we break up edge lists that don’t fit in

a single RAMCloud object?● How do we do this in such a way that

optimizes the performance of the most common operations?

Edge list design tradeoffs● Two most common edge list operations

○ Add edge to a node ○ Get all the neighbors of a node

● Edge list storage optimized for adding■ Want updated object to be small so read / write time is short■ Store edge list in many small objects?

● Getting all the neighbors of a node■ Want to read as few objects as possible, few round tips to RAMCloud■ Store edge list in as few objects as possible?

Key Idea:Minimize the rd/wr size for adding an edge, while Maximizing the average

chunk size when retrieving the whole list.

TorcDB Edge list layout

Logical Edge List

Head Segment Tail Segment Tail Segment Tail Segment

● Add edges here● Want to keep as

small as possible

● Need to read all of these when reading the whole list

● Want as few as possible → Minimize object read overhead.


RootKey:0 0 | Head Segment →

Number of tail segments

Size Threshold


RootKey:0 0 | edge1Head Segment →


Size Threshold


RootKey:0 0 | edge2 | edge1Head Segment →


Size Threshold


RootKey:0 0 | edge6 | edge5 | edge4 | edge3 | edge2 | edge1Head Segment →

Size Threshold




Size Threshold


Split



Size Threshold


Split


RootKey:0 1 | edge6

RootKey:1 edge5 | edge4 | edge3 | edge2 | edge1Head Segment →

Tail Segment 1 →

Size Threshold



RootKey:0 3 | edge16

RootKey:3 edge15 | edge14 | edge13 | edge12 | edge11



Head Segment →

Tail Segment 3 →

Tail Segment 2 →

Tail Segment 1 →

Size Threshold


Results: ● Head segment small, tail segments large.● Tail segments can be read in parallel● Whole object can be read by prediction


Key Value



Vertex Table

Key Value



Edge List Table

Good?

Linked Data Benchmark Council

Social Network Benchmark


Key Value


vID:eLabel:dir Neighbor Vertex Labels

vID:eLabel:dir:vLabel List of edges of type eLabel going from vID to nodes of type vLabel

Key Value



Edge List Table

Evaluating TorcDB - Comparison with Neo4j● Neo4j

○ Single server○ MySQL-like implementation (queries execute on server)○ Disk-based, cache recently used data in memory

CypherMATCH (person1:Person {id:{1}}), (person2:Person {id:{2}})MATCH path = shortestPath( (person1)-[:KNOWS*..15]-(person2) )RETURN length(path)

Evaluating TorcDB - Comparison with Neo4j

558us 197us

Evaluating TorcDB - Comparison with Neo4j

66us 20,022us6,988us

Benchmarking● Loading a 2.5TB graph dataset into TorcDB - Non-Trivial Undertaking

○ Initially took > 1 day■ Loading through TorcDB

○ Got it down to 4 hours■ Took snapshot of RAMCloud tables, load tables directly

○ Eventually got it down to 20 minutes■ Sort snapshot tables into server-local tablets, directly load tablets into servers from local

disk.● Why is load time important? Why not load once and then run benchmarks?

○ Each run of the benchmark performs updates and modifies the graph. For 100% consistent results, need to reload the graph for each run of the benchmark.

Benchmarking

Conclusion● Key to performance is minimizing the amount of data that needs to be read

and written to execute a query. ● Minimize the number of round trips → Minimize the data dependencies

○ I.e. Linked Lists probably not a good idea.● RAMCloud not suitable for large scale analytics tasks.● What we were missing in RAMCloud.

○ Large transactions ■ Not generally possible to do consistent reads of 3-hop subgraph.

○ Consistent table scans○ Can’t update an object in place.

Thanks!

Documents

High-Level Data Models on RAMCloud Talks/2018/Jonathan... · Build a distributed file system in two layers: ... Concurrency control: Simple transaction system ... Conclusion Key to