Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Preview:

Citation preview

Basho Technologies | 1

Scaling Time Series Applications

BashoDorothy Pults – Product Evangelist @deepultsTom Sigler – Solution Architect @tom_sigler

DatabricksPeyman Mohajerian - Solution Architect @mohajeri

Basho Technologies | 2

BASHO TECHNOLOGIESDistributed Systems Software for Big Data and IoT applications

2011 - Creators of Riak• Riak KV: NoSQL Key Value database• Riak TS: NoSQL Time Series database• Integrations: Spark, Redis caching, Solr, Mesos, Riak S2

120+ employees

Global Offices • Seattle (HQ), Washington DC, London, Paris

1/3 of the Fortune 50

Basho Technologies | 3

$1.3 Trillion market spend Internet of Things in 2019

30 Billion Installed base of IoT endpoints in 2020

*Source IDC

Basho Technologies | 4

56% have integrated IOT data

IoT is 24% of the average IT budget

20% decrease in downtime

21% increase in revenue

*Vodafone IOT Barometer

Basho Technologies | 55

CRITICAL SUCCESS FACTORS FOR IOT

• Explore new business models

• Address Key IoT challenges like Edge Analytics

• Provide comprehensive solutions

• Engage with a broader ecosystem

Basho Technologies | 66

100TB DAILY – IOT AND WEATHER DATA

530M personal weather stations reports each day

9M webcam uploads

2M crowd reports

> 20M IoT barometric reports

Basho Technologies | 7

WEATHER FORECAST PREDICTS SALES

Ideal BERRY purchasing weather turns out to be low wind with temperatures below 80 degrees.

People are more likely to eat STEAK when it's warm out with higher winds but no rain, but not if it gets too hot.

Basho Technologies | 88

EDGE ANALYTICS

• Edge Analytics

• Fog Computing

• Inverted Web

• Reverse CDN

Basho Technologies | 99

NEW ECOSYSTEM – DATA PIPELINE

Basho Technologies | 1010

WHAT’S NEEDED TO SCALE FOR IoT

• A database optimized for IoT data

• Review your data life cycle

• Summations and aggregation

• Data expiration

• Data cleansing

• Processing close to devices

• Scale for unstructured metadata

Basho Technologies | 11

TIME SERIES (TS) DATA

• Consists of successive observations made over a time interval

• Structured• Time + State/Measurement • Metadata/Context• Frequency

Basho Technologies | 12

TIME SERIES CHALLENGES AT SCALE

• Ingestion Velocity• Data Volume• Post Ingestion Workloads

– Real time– Batch

• Lifecycle/Expiry

Basho Technologies | 13

Riak TS Overview & Architecture

Basho Technologies | 14

WHAT IS RIAK TS?

Riak TS is a distributed NoSQL key/value store optimized for time series data.

It provides a time series database solution that is extensible and scalable.

Riak TS is derived from Riak KV and adds the ability to co-locate data by composite primary key, including quanta, for efficient sequential read i/o operations.

Basho Technologies | 15

Why Riak TS?• Highly available• Fault Tolerant• Geo data locality• Scalability

– Operations– Real-time range query performance

15

Basho Technologies | 16

RIAK TS MASTERLESS ARCHITECUTURE

Riak has a masterless architecture. Every node is: • homogenous• capable of serving all read and write requests• responsible for a subset of data

Basho Technologies | 17

RIAK TS: DISTRIBUTION AND CO-LOCATION

• Variation of Dynamo• Composite key drives

grouping on disk– Partition Key– Local Key (sort)

Basho Technologies | 18

RIAK: REPLICATION OF DATA

• Intra-cluster replication• Multi-cluster replication

put(“bucket/key”)

Basho Technologies | 19

RIAK: HIGH AVAILABILITY

Hinted handoff allows Riak nodes to temporarily take over storage operations for a failed node and update that node with changes when it comes back online.

Basho Technologies | 20

RIAK TS: SCALABILITYRiak TS scales in a near-linear fashion so increasing the number of a nodes in a cluster increases the number of reads and writes a cluster can handle in a predictable fashion.

Rebalancing of the cluster is a non-blocking operation, which doesn’t require downtime to perform.

If 10 nodes can serve 40,000 Writes/Second Then 20 nodes should serve 72,000+ Writes/Second

> riak-admin cluster join riak@192.168.2.2

> riak-admin cluster plan

> riak-admin cluster commit

A d d i n g a n o d e

Basho Technologies | 21

RIAK TS: QUERY

select * from GeoCheckin where time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select MIN(temperature), AVG(temperature), MAX(temperature) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select (temperature * 2), (pressure - 1) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

Arithmetic

Aggregate

Range• SQL Interface• Arithmetic Support• Aggregate

– Count()– Sum()– Mean() & Avg()– Min() & Max()– STDDEV()

• Group By• Expanded

capabilitiesin future releases

Basho Technologies | 22

BATCH PROCESSING

• Real-time vs. Batch• Spark Connector• Parallel Extract

Basho Technologies | 23

DATA LIFECYCLE

• Global expiry• Per table expiry

coming soon• Spark batch for

rollups/aggregation

Basho Technologies | 24

Time SeriesData Modeling

Basho Technologies | 25

SUPPORTED DDL DATA TYPES• VARCHAR - Any string content is valid, including Unicode. Can only be

compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.

• BOOLEAN - true or false (any case)• TIMESTAMP - Timestamps are integer values expressing UNIX epoch time in

UTC in milliseconds. Zero is not a valid timestamp.• SINT64 - Signed 64-bit integer• DOUBLE - This type does not comply with its IEEE specification: NaN (not a

number) and INF (infinity) cannot be used.

Basho Technologies | 26

THE KEY

Consists of:• Partition Key

(node/partition)• Quantum (optional)• Local Key (sort order)

Basho Technologies | 27

RIAK TS: CREATE TABLE

CREATE TABLE GeoCheckin ( deviceID varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY (

(deviceID, quantum(time, 15, 'm')), deviceID, time

) )

Partition Key

Local Key

Basho Technologies | 28

MODELING THE KEY

Methodology:• What questions does your

application ask?• How is the data presented?

Basho Technologies | 29

USE CASE: PEDOMETER

• Questions– How many steps today

(distance) for user?– How many steps per

day this week for user?– Daily average?– Change in elevation?

• Key– Partition: UserID– Local: timestamp– Optimized for reads:

quantum of 1 week– Optimized for writes

quantum of 1 day

• Fields– timestamp– steps– device_id– elevation– geohash

Basho Technologies | 30

DEMO

• Riak TS• Python client• Jupyter Notebook

• Pandas• Matplotlib

Basho Technologies | 31

THE DATADescription Field TypeSensor Status status varchar

Exit ID exitid varchar

Timestamp ts timestamp

Average Measured Time avgMeasuredTime sint64

Average Speed avgSpeed sint64

Median Measured Time medianMeasuredTime sint64

Number of Vehicles vehicleCount sint64

Sensor ID id sint64

Report ID report_id sint64

• Vehicle traffic data• City of Aarhus,

Denmark• Two sensors placed

at each exit• 5 min intervals

Spark and Riak: In-situ analytics beyond Hadoop

33

Who is DatabricksWhy Us Our Product

• Creators of Apache Spark. Contribute 75% of the code - 10x more than others

• Trained 20K Spark users

• Largest number of customers deploying Spark (200+)

• Just-in-Time Data Platform – powered by Apache Spark.

• Empower your organization to swiftly build and deploy advanced analytics with Spark.

open source data processing engine built around speed, ease of use, and sophisticated analytics

largest open source data project with 1000+ contributors

UNIFIED ENGINE ACROSS DIVERSE WORKLOADS & ENVIRONMENTS

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

APACHE SPARK ENGINE

First Cellular Phones Unified DeviceSpecialized Devices

ANALOGY: EVOLUTION OF CONSUMER ELECTRONICS

HISTORY REPEATS: FASTER, EASIER TO USE, UNIFIED

First DistributedProcessing Engine

Specialized Data Processing Engines

Unified Data Processing Engine

Google Trends: Hadoop vs. Spark

Analytics in-situSQL

Streaming

MLEnable SQL analytics over RiakUse Riak to store streaming data

Use Riak to serve results generated by Spark

Riak Spark Connector

User application contacts the coordinating node returning the locations of the data using cluster replication and availability information.Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”.

Demo

Build a PoC on Databricks today.Professional services and training also available.

Contact sales@databricks.com

or

Sign up for a trial at https://databricks.com/try-databricks

Basho Technologies | 43

Thank You!

If you have any questions please reach out to us at basho.com/contact

Recommended