Upload
randyguck
View
372
Download
1
Tags:
Embed Size (px)
Citation preview
One Billion Objects in 2GB: Big Data Analytics on Small Clusters with Doradus OLAP
Randy Guck
Principal Engineer
Dell Software Group
What is Doradus? Storage and query service
Leverages Cassandra NoSQL DB
Pure Java
- Stateless
- Embeddable or standalone
Open source: Apache 2.0 License
30 Doradus: The Tarantula Nebula
Source: Hubble Space Telescope
Why Use Doradus? Easy to use: no client driver
Spider storage manager
- Good for unstructured data
OLAP storage manager
- Near real time data warehousing
Compared to Cassandra alone:
- Data model, searching, analytics
Compared to Hadoop:
- Fast data loads and queries
- Dense storage: less hardware
Cassandra
Data
Applications
REST API
OLAP Spider
Doradus
A Multi-Node Cluster
Doradus
Cassandra
Data
Node 2
Cassandra
Data
Doradus
Cassandra
Data
Node 1 Node 3
Applications
REST API
Secondary Doradus
instances are optional
Why Did We Build Doradus OLAP? Some tough customer requirements:
- Statistical queries most important
- Need to scan millions of objects/second
- User-customizable “insights” = millions of possible queries
Couldn’t use indexes, pre-computed queries, etc.
Disk physics
- ~100's of random reads/second
- ~1000's of serial reads/second
Needed a radically new approach!
Doradus OLAP Combines ideas from:
- Online Analytical Processing: data arranged in static cubes
- Columnar databases: Column-oriented storage and compression
- NoSQL databases: Sharding
Features:
- Fast loading: up to 500K objects/second/node
- Dense storage: 1 billion objects in 2 GB!
- Fast cube merging: typically seconds
- No indexes!
Example: Message Tracking Schema
Message Participant Address
PersonManager
Employees
Person
Address
Attachments
Message
Participants
Message
Address
Participants
Attachment
DQL Object Queries Builds on Lucene syntax
- Full text queries
Adds link paths
- Directed graph searches
- Quantifiers and filters
- Transitive searches
Other features
- Stateless paging
- Sorting
Examples:
- LastName = Smith AND NOT (FirstName : Jo*)
AND BirthDate = [1986 TO 1992]
- ALL(Participants).ANY(Address.WHERE
(Email='*.gmail.com')).Person.Department :
support
- Employees^(4).Office='San Jose’
DQL Aggregate Queries Metric functions
- COUNT, AVERAGE, MIN,
MAX, DISTINCT, ...
Multi-level grouping
Grouping functions
- BATCH, BOTTOM, FIRST,
LAST, LOWER, SETS,
TERMS, TOP, TRUNCATE,
UPPER, WHERE, ...
Examples:
- metric=COUNT(*), AVERAGE(Size),
MIN(Participants.Address.Person.Birthdate)
- metric=DISTINCT(Attachments.Extension);
groups=Tags,
Participants.Address.Person.Department;
query=Attachments.Size > 100000
- metric=AVERAGE(Size);
groups=TOP(10,Participants.Address.Email)
OLAP Data Loading
EventsEventsEvents
EventsEventsPeople
EventsEvents
Computers
EventsEventsDomains
Sources
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEvents
Computers
EventsEventsDomains
Batch 2
Batch 3
...
Sources Batches
Batch 4
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEvents
Computers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards
Batch 4
Merge
OLAP Data Loading
Batch 1
EventsEventsEvents
EventsEventsPeople
EventsEvents
Computers
EventsEventsDomains
Batch 2
Batch 3
...
2014-03-01
2014-02-28
2014-02-27
Sources Batches Shards OLAP Store
Batch 4
Merge
Storing Batches
Field Values
ID 5amhvv7J2otBu48Z6PE5cA 7CgvDf5mOU78jNVc58eu cZpz2q4Jf8Rc2HK9Cg08 ...
Size 48120 5435 24220 ...
SendDate 1280246462000 1279354872112 1279357261413 ...
Priority 0 0 1 ...
Subject.txt ballades encash nautch
colloquy geared
nettlier outdoors culvert
hypothec winder
stolons ungot guiding
rupiahs outgone
...
Subject 1 2 0 ...
...
Data is sorted by object ID and stored as columnar, compressed blobs
Key Columns
Email/Message/2014-03-01/{Batch GUID}/ID [compressed data]
Email/Message/2014-03-01/{Batch GUID}/Size [compressed data]
Email/Message/2014-03-01/{Batch GUID}/SendDate [compressed data]
... ......
OLAP Table
Field Value Arrays
Compressed rows
Merging Batches
Key Columns
Email/Message/2014-03-01/ID [compressed data]
Email/Message/2014/03-01/Size [compressed data]
Email/Message/2014-03-01/SendDate [compressed data]
... ...
Email/Person/2014-03-01/ID [compressed data]
Email/Person/2014-03-01/FirstName [compressed data]
Email/Person/2014-03-01/LastName [compressed data]
... ...
Email/Address/2014-03-01/ID [compressed data]
Email/Address/2014-03-01/Person [compressed data]
Email/Address/2014/-03-01/Message [compressed data]
... ...
Email/Message/2014-02-28/ID [compressed data]
Email/Message/2014-02-28/Size [compressed data]
...
Batch #1: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
Batch #2: Shard 2014-03-01
Message Table
ID ...
Size ...
SendDate ...
...
...
OLAP Store
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Person Table
ID ...
FirstName ...
Lastname ...
...
Address Table
ID ...
Person ...
Messages ...
...
Message table data
Shard 2014-03-01
Person table data
Shard 2014-03-01
Address table data
Shard 2014-03-01
Data for other shards
Does Merging Take Long?
OLAP Query Execution Example query:
- Count messages with Size between 1000-10000 and
HasBeenSent=false in shards 2014-03-01 to 2014-03-31
How many rows are read?
- 2 fields x 31 shards = 62 rows
- Typically represents millions of values
Value arrays are scanned in memory
Physical rows are read on “cold” start only
- Multiple caching levels for “warm” and “hot” data
1 Billion Objects in 2GB?Example Security Event (CSV format):
Fixed Fields Variable Fields
Computer Name MAILSERVER18 1 MAILSERVER18$
Log Name Security 2
Time Stamp Sun, 22 Jan 2013 08:09:50 UTC 3 Workstation
Type Success Audit 4 (0x0,0x142999A)
Source Security 5 3
Category Logon/Logoff 6 Kerberos
Event ID 540 7 Kerberos
User Domain NT AUTHORITY
User Name SYSTEM
User SID S-1-5-18
MAILSERVER18,Security,"Sun, 22 Jan 2013 08:09:50 UTC","Success Audit",Security,"Logon/Logoff", 540,"NT AUTHORITY",SYSTEM,S-1-5-18,7,MAILSERVER18$,,Workstation,"(0x0,0x142999A)",3,Kerberos,Kerberos
Events Schema
EventsInsertion
Strings
Fields:
• ComputerName (text)
• LogName (text)
• Timestamp (timestamp)
• Type (text)
• Source (text)
Fields:
• Index (integer)
• Value (text)
• Event (link)
Count: 115 Million Count: 880 Million
Params
Event (inverse)
• Category (text)
• EventID (integer)
• UserDomain (text)
• UserSID (text)
• Params (link)
Event Schema Load Load stats:
Total shards: 860
Total events: 114,572,247
Total ins strings: 879,529,753
Total objects: 994,102,000
Total load time: 2 hours, 2 minutes, 36 seconds (MacBook Air)
Space usage::nodetool -h localhost status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 127.0.0.1 1.96 GB 100.0% 860887ef-2027-431a-a425-c67a9445d0e6 -9176223118562734495 rack1
Demo 1) Count all Events in all shards
- 860 shards => 115M events
2) Find the top 5 hours-of-the-day when certain privileged events fail:
- Event IDs are any of 577, 681, 529
- Event type is ‘Failure Audit’
- Insertion string 8 is (0x0,0x3E7)
- Event occurred in first half of 2005 (181 shards)
Doradus OLAP Summary Advantages:
Simple REST API
All fields are searchable without indexes
Ad-hoc statistical searches
Support for graph-based queries
Near real time data warehousing
Dense storage = less hardware
Horizontally scalable when needed
Doradus OLAP Summary Good for applications where data:
Is continuous/streaming
Is structured to semi-structured
Can be loaded in batches
Is partitionable, especially by time
Is typically queried in a subset of shards
Emphasizes statistical queries
Thank You! Where to find Doradus
- Source: github.com/dell-oss/Doradus
- Downloads: search.maven.org
Contact me
- @randyguck