Upload
pramit-choudhary
View
313
Download
0
Embed Size (px)
Citation preview
Need For Time Series Database
Pramit Choudhary, ML Engineer @eHarmony
MotivationSpeed Matters
We want to know, what’s happening NOWUser accessing data through different mobile platform, no patience
Data is scattered aroundMongoDb, Voldemort, Netezza, Hive, Whisper, may be moreFor cross platform analytical work, data is still moved around ( cause of worry )Need for simplifying the Database Tech StackIncrease in complexity as we start tracking more metrics in-regards to Mobile devices
Data-Analytics Use-cases:Most of the time we study data pattern over a period of time
e.g. 1. What are probable times for the user to get matches ? => need to start tracking the amount of time user spends during the day 2. Feature exploration and extraction: What other features could we possibly use ? => more t/f/z/p statistics tests probably ?
Re-CAPConsistency: Data remains consistent after the execution of an operation. E.g. Post update all client have the same state of the data.
Availability: Always on ( no downtime)
Partition Tolerance: System continues to function even with no communication with one another
Different CombinationsCA : Single Cite cluster, all nodes are always in
contact. e.g. SQL type RDMS
CP : Some data may not be accessible, but the rest is consistent and accurate e.g. MongoDB, HBase, Redis
AP : Available under partitioning, but no guarantee on consistency e.g. Cassandra, Riak, DynamoDb
No SQL World• Key-Value Store (Redis, Riak)
• Document Store (MongoDB, Couchbase)
• Column Store (Cassandra, Hbase, OpenTSDB)
• Graph Store (Neo4j, Node.js)
Introducing a new DB
OpenTSDBAuthor: Benoit Sigoure @ StumbleUpon
What is OpenTSDB?
Open Source Time Series Database
Store trillions of data points
Sucks up all data and keeps going
Never loses precision
Scales using HBase
Note: Using this as an example, better results with KairosDB or InfluxDB. They work on similar principles.
Author: Benoit Sigoure and Chris Larsen
Use-CasesMongoDB and Couchbase : user profiles, product catalogs, geospatial, financial products, social media, digital content, gaming, metadata, events, bills and invoices
Hbase and Cassandra : Structured, semi-structured, unstructured data, full table scans, read, intensive operations, time series interval data, geospatial data
Other Options
Author: Oliver Hankeln
What are Time Series?
Time Series: Data points for an identity over time Typical Identity:
Dotted string: web01.sys.cpu.user.0 ( no concept of filters )
OpenTSDB Identity: Metric: sys.cpu.userTags (name/value pairs): act as filters
host=web01 cpu=0
Author: Benoit Sigoure and Chris Larsen
What are Time Series?
Data Point:
Metric + Tags
+ Value: 42
+ Timestamp: 123
„ sys.cpu.user 1234567890 42 host=web01 cpu=0 „
Author: Benoit Sigoure and Chris Larsen Metric Name
Timestamp
Metric value
Filter1
Filter2
Architecture
Author: Benoit Sigoure and Chris Larsen
Another View
Author: slideshare
About TSDsWrite throughput
Are CPU boundedWorst Case: Can handle 2000 points/sec on an old 2006 dual core CPU
Read throughputDepends on the cardinality of a metricTimespan and number of data points retrieved
ReliabilityNo single point of failure no concept of master daemonDependency, needs HBase with zookeeperHas single point of failure if running over HDFS, but none with respect to database.
More info on the Wiki : http://opentsdb.net/faq.html
Simplistic View of the Table
Without OpenTSDB Hbase Table Representation
Author: Oliver Hankeln
OpenTSDB Magic“Compact columns by concatenation “
Author: Oliver Hankeln
• Tags are put at the end of the row key• Timestamp is normalized on 1hr boundaries
Row Key Size
Author: Oliver Hankeln
BenchMarksLoad Phase
Heavy Read
Heavy Read
Heavy Range Scan
Heavy Inserts
Is it being extensively used?
OVH: #3 largest cloud/hosting provider : Monitor everything includes network performance, resource utilization, application performance, customer facing metric
35 servers, 100k writes/s, 25tb raw data5 day moving window of Hbase snapshotRedis cache on top for customer facing data
Yahoo: Monitoring application performance and statistics ( 15 servers, 280k writes/s
Arista Networks: High performance network monitoring
5k writes/s uses varnish for caching
MapR
“OpenTSDB is a widely used database intended to store and analyze time-series data. Originally designed for only data center monitoring, poor ingest performance had limited the expansion of its use. This benchmark demonstrates a viable option for new applications, such as IoT and other real-time data-analysis applications, using OpenTSDB running on MapR. “ Ted Dunning, Chief Application Architect
Others
Some ReferencesBook: TimeSeries Database – Ted Dunning and Ellen Friedman ( https://www.dropbox.com/s/c1zj0l0q0qmfvo8/Time_Series_Databases.pdf?dl=0 )
Benchmarks: https://www.dropbox.com/s/g67yoxwabwb5s0g/PerformanceBenchMark.pdf?dl=0
Lessons learned: http://www.slideshare.net/cloudera/4-opentsdb-hbasecon
Some Comparisons: http://prometheus.io/docs/introduction/comparison/
Demo
Questions?