Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 1

Scaling Time Series Applications

BashoDorothy Pults – Product Evangelist @deepultsTom Sigler – Solution Architect @tom_sigler

DatabricksPeyman Mohajerian - Solution Architect @mohajeri

BASHO TECHNOLOGIESDistributed Systems Software for Big Data and IoT applications

2011 - Creators of Riak• Riak KV: NoSQL Key Value database• Riak TS: NoSQL Time Series database• Integrations: Spark, Redis caching, Solr, Mesos, Riak S2

120+ employees

Global Offices • Seattle (HQ), Washington DC, London, Paris

1/3 of the Fortune 50

$1.3 Trillion market spend Internet of Things in 2019

30 Billion Installed base of IoT endpoints in 2020

*Source IDC

56% have integrated IOT data

IoT is 24% of the average IT budget

20% decrease in downtime

21% increase in revenue

*Vodafone IOT Barometer

CRITICAL SUCCESS FACTORS FOR IOT

• Explore new business models

• Address Key IoT challenges like Edge Analytics

• Provide comprehensive solutions

• Engage with a broader ecosystem

100TB DAILY – IOT AND WEATHER DATA

530M personal weather stations reports each day

9M webcam uploads

2M crowd reports

> 20M IoT barometric reports

WEATHER FORECAST PREDICTS SALES

Ideal BERRY purchasing weather turns out to be low wind with temperatures below 80 degrees.

People are more likely to eat STEAK when it's warm out with higher winds but no rain, but not if it gets too hot.

EDGE ANALYTICS

• Edge Analytics

• Fog Computing

• Inverted Web

• Reverse CDN

NEW ECOSYSTEM – DATA PIPELINE

WHAT’S NEEDED TO SCALE FOR IoT

• A database optimized for IoT data

• Review your data life cycle

• Summations and aggregation

• Data expiration

• Data cleansing

• Processing close to devices

• Scale for unstructured metadata

TIME SERIES (TS) DATA

• Consists of successive observations made over a time interval

• Structured• Time + State/Measurement • Metadata/Context• Frequency

TIME SERIES CHALLENGES AT SCALE

• Ingestion Velocity• Data Volume• Post Ingestion Workloads

– Real time– Batch

• Lifecycle/Expiry

Riak TS Overview & Architecture

WHAT IS RIAK TS?

Riak TS is a distributed NoSQL key/value store optimized for time series data.

It provides a time series database solution that is extensible and scalable.

Riak TS is derived from Riak KV and adds the ability to co-locate data by composite primary key, including quanta, for efficient sequential read i/o operations.

Why Riak TS?• Highly available• Fault Tolerant• Geo data locality• Scalability

– Operations– Real-time range query performance

RIAK TS MASTERLESS ARCHITECUTURE

Riak has a masterless architecture. Every node is: • homogenous• capable of serving all read and write requests• responsible for a subset of data

RIAK TS: DISTRIBUTION AND CO-LOCATION

• Variation of Dynamo• Composite key drives

grouping on disk– Partition Key– Local Key (sort)

RIAK: REPLICATION OF DATA

• Intra-cluster replication• Multi-cluster replication

put(“bucket/key”)

RIAK: HIGH AVAILABILITY

Hinted handoff allows Riak nodes to temporarily take over storage operations for a failed node and update that node with changes when it comes back online.

RIAK TS: SCALABILITYRiak TS scales in a near-linear fashion so increasing the number of a nodes in a cluster increases the number of reads and writes a cluster can handle in a predictable fashion.

Rebalancing of the cluster is a non-blocking operation, which doesn’t require downtime to perform.

If 10 nodes can serve 40,000 Writes/Second Then 20 nodes should serve 72,000+ Writes/Second

> riak-admin cluster join riak@192.168.2.2

> riak-admin cluster plan

> riak-admin cluster commit

A d d i n g a n o d e

RIAK TS: QUERY

select * from GeoCheckin where time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select MIN(temperature), AVG(temperature), MAX(temperature) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select (temperature * 2), (pressure - 1) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

Arithmetic

Aggregate

Range• SQL Interface• Arithmetic Support• Aggregate

– Count()– Sum()– Mean() & Avg()– Min() & Max()– STDDEV()

• Group By• Expanded

capabilitiesin future releases

BATCH PROCESSING

• Real-time vs. Batch• Spark Connector• Parallel Extract

DATA LIFECYCLE

• Global expiry• Per table expiry

coming soon• Spark batch for

rollups/aggregation

Time SeriesData Modeling

SUPPORTED DDL DATA TYPES• VARCHAR - Any string content is valid, including Unicode. Can only be

compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.

• BOOLEAN - true or false (any case)• TIMESTAMP - Timestamps are integer values expressing UNIX epoch time in

UTC in milliseconds. Zero is not a valid timestamp.• SINT64 - Signed 64-bit integer• DOUBLE - This type does not comply with its IEEE specification: NaN (not a

number) and INF (infinity) cannot be used.

THE KEY

Consists of:• Partition Key

(node/partition)• Quantum (optional)• Local Key (sort order)

RIAK TS: CREATE TABLE

CREATE TABLE GeoCheckin ( deviceID varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY (

(deviceID, quantum(time, 15, 'm')), deviceID, time

Partition Key

Local Key

MODELING THE KEY

Methodology:• What questions does your

application ask?• How is the data presented?

USE CASE: PEDOMETER

• Questions– How many steps today

(distance) for user?– How many steps per

day this week for user?– Daily average?– Change in elevation?

• Key– Partition: UserID– Local: timestamp– Optimized for reads:

quantum of 1 week– Optimized for writes

quantum of 1 day

• Fields– timestamp– steps– device_id– elevation– geohash

• Riak TS• Python client• Jupyter Notebook

• Pandas• Matplotlib

THE DATADescription Field TypeSensor Status status varchar

Exit ID exitid varchar

Timestamp ts timestamp

Average Measured Time avgMeasuredTime sint64

Average Speed avgSpeed sint64

Median Measured Time medianMeasuredTime sint64

Number of Vehicles vehicleCount sint64

Sensor ID id sint64

Report ID report_id sint64

• Vehicle traffic data• City of Aarhus,

Denmark• Two sensors placed

at each exit• 5 min intervals

Spark and Riak: In-situ analytics beyond Hadoop

Who is DatabricksWhy Us Our Product

• Creators of Apache Spark. Contribute 75% of the code - 10x more than others

• Trained 20K Spark users

• Largest number of customers deploying Spark (200+)

• Just-in-Time Data Platform – powered by Apache Spark.

• Empower your organization to swiftly build and deploy advanced analytics with Spark.

open source data processing engine built around speed, ease of use, and sophisticated analytics

largest open source data project with 1000+ contributors

UNIFIED ENGINE ACROSS DIVERSE WORKLOADS & ENVIRONMENTS

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

APACHE SPARK ENGINE

First Cellular Phones Unified DeviceSpecialized Devices

ANALOGY: EVOLUTION OF CONSUMER ELECTRONICS

HISTORY REPEATS: FASTER, EASIER TO USE, UNIFIED

First DistributedProcessing Engine

Specialized Data Processing Engines

Unified Data Processing Engine

Google Trends: Hadoop vs. Spark

Analytics in-situSQL

Streaming

MLEnable SQL analytics over RiakUse Riak to store streaming data

Use Riak to serve results generated by Spark

Riak Spark Connector

User application contacts the coordinating node returning the locations of the data using cluster replication and availability information.Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”.

Build a PoC on Databricks today.Professional services and training also available.

Contact sales@databricks.com

Sign up for a trial at https://databricks.com/try-databricks

Thank You!

If you have any questions please reach out to us at basho.com/contact

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Technology

Scaling Patient-Centered CDS Webinar 09142018 - pccds … Patient... · Scaling Patient-Centered CDS for ... • Clinical Decision Support is delivered to clinicians with no need

AWS Webinar Scaling on AWS for the first 10 million users

Webinar Tutorial - A Beginners Guide To MaxDiff Scaling

Scaling Customer Success on a Rocketship Webinar Slides

Scaling Customer Success: Strategies for Account Segmentation Webinar Slides

Webinar: Scaling MongoDB through Sharding - A Case Study with CIGNEX Datamatics

[Convert.com Webinar] Pay Attention: Understanding the Brain’s Need for Novely & Shortcuts

PHOTO PERSONA SHORTCUTS - s3-eu-west-1.amazonaws.com · PHOTO PERSONA SHORTCUTS. LIQUIFY PERSONA SHORTCUTS. DEVELOP PERSONA SHORTCUTS. ADVANCED TEXT EXPRESSIONS. MODIFIERS & LAYER

Starting Small and Scaling Big with Hadoop (Talend and Hortonworks webinar)) final

SMS Marketing with Shortcuts - MessageMediabroadcaster.messagemedia.com.au/shortcuts/ShortcutsMarketing.pdf · SMS Marketing with Shortcuts (for Shortcuts 7.2 and higher) Page 1 of

ScaleBase Webinar: Scaling MySQL - Sharding Made Easy!

Mysql Cluster Scaling Web Databases Webinar Aug 26

Neupart webinar 1: Four shortcuts to better risk assessments

Webinar Transcript Scaling Up Alternative Food Initiatives

Webinar: Top 10 Tips for Scaling Distributed Agile

Excel - Shortcuts Bible · Excel Shortcuts Bible © eforexcel.com EXCEL Shortcuts Bible Excel 2013 / 2016

Shortcuts around the mistakes I've made scaling MongoDB

Consumer Financial Services Webinar Series Financial Services Webinar Series Webinar #2: Lessons Learned in Developing, Innovating, and Scaling Consumer Financial Products and Services

Webtrends/Hootsuite Webinar - Scaling Social

ScaleBase Webinar: Strategies for scaling MySQL