GPU-Accelerated Analytics on your Data Lake....Query 1 Query 2 Query 3 Query 4 Query 5 QUERIES NDS 1...

Preview:

Citation preview

GPU-Accelerated Analytics on your Data Lake.

Data Lake

@blazingdb

Data Swamp

@blazingdb

ETL Hell

@blazingdb

DATA LAKE0001010100001001011010110

>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>

>>>>>>

>>>>>>>>>>>>>>>>>

>>>

>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>>>>>>>

>>>

0101010100100101010101100001

0101101010010001011010100001

01010110100001

0101010100100101010101100001

0101101010010001011010100001

01010110100001

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>

>>>>>>>>>>>>>>>>>>>>>>> >>>>

>>>>>

>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>

>>>>>

>>>>>

>>>>

>>>>>>>>>>>>>>

>>>

>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>

>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>

>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>

COMMON

@blazingdb

DATALAYER

Simplify Data Storage

@blazingdb

SCHEMA

METADATA

DATA

SQL Warehouse on Data Lake

@blazingdb

BlazingDB – How it works

@blazingdb

• Compression/Decompression

• Filtering (Predicate Pushdown)

• Aggregations

• Transformations

• Joins

• Sorting/OrderingDATA LAKE0001010100001001011010110

• RAM Cache (Hot)

• Disk Cache (Medium)

• HDD

• SSDLocal DiskHDFS

AWS S3

BlazingDB Multi-nodal Cluster

@blazingdb

Shared Data Architecture

@blazingdb

DATA LAKE0001010100001001011010110

The Nays

@blazingdb

No Vendor

Lock-in

No Consistency

Management

No BlazingDB

Specific ETL

No DuplicationNo Ingest

The Yays

@blazingdb

High

Concurrency

Data Sharing

(Across Clusters

And Other Tools)

Multi-Terabyte

Queries

Scalable,

On Demand

Data Warehouse

Incredibly

Fast SQL

@blazingdb

DEMO

@blazingdb

Demo - ArchitectureHDFS on Azure Azure GPU Servers

NC24 V1• 4 Servers

Queries: BlazingDB 4 Node Query times (Lower is better)

@blazingdb

Cold

Medium

(Disk cache only)

Hot

Query 1 Query 2 Query 3 Query 4 Query 5

QUERIES

SE

CO

ND

S

142.1

281.1

380.5

135.5

46

73.6

154.1

251.8

73.8

46.3

72

63.1

14 12.214.9

Query 1

@blazingdb

Query 1

SE

CO

ND

S

Cold Medium(Disk cache only)

Hot

select l_returnflag, l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendeprice) as sum_disc_price,

sum(l_extendeprice*(1-l_discount)) as

sum_base_price,

sum(l_extendeprice*(1-l_discount)*(1+l_tax)) as

sum_charge,

avg(l_quatity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(l_quantity) as count_order

from lineitem

where l_shipdate <= ‘1995-06-01’

group by l_returnflag, l_linestatus

order by l_returnflag, l_linestatus;

1234

5

6789

10111213

Query1

Data Points• 6 billion row table

• Many aggregations/transformations

Query 2

@blazingdb

Query 2

SE

CO

ND

S

Cold Medium(Disk cache only)

Hot

select lineitem.l_orderkey,

sum(lineitem.l_extendedprice*(1-

lineitem.l_discount)) as revenue,

orders.o_orderdate, orders.o_shippriority

from customer

inner join orders on customer.c_custkey =

orders.o_custkey inner join lineitem on

lineitem.l_orderkey = orders.o_orderkey

where

customer.c_mktsegment = 'BUILDING'

and orders.o_orderdate < '1995-03-15'

and lineitem.l_shipdate > '1995-03-15'

group by lineitem.l_orderkey,

orders.o_orderdate, orders.o_shippriority

order by revenue desc,orders.o_orderdate;

1234

5

6789

10111213

Query2

Data Points• Join 6B rows to 1.5B rows to 150M rows

• Many aggregations/transformations

• Order (sorting)

Query 3

@blazingdb

Query 3

SE

CO

ND

S

Cold Medium(Disk cache only)

Hot

select nation.name, sum(lineitem.l_extendedprice *

(1 - lineitem.l_discount)) as revenue

from customer

inner join orders on customer.cust_key =

orders.o_custkey inner join lineitem on

lineitem.l_orderkey = orders.o_orderkey

inner join supplier on lineitem.l_suppkey =

supplier.s_suppkey inner join nation on

supplier.s_nationkey = nation.nation_key

inner join region on nation.region_key =

region.r_regionkey

where supplier.s_nationkey = nation.nation_key

and region.r_name = 'ASIA'

and orders.o_orderdate >= '19940101'

and orders.o_orderdate < '19950101'

group by nation.name order by revenue desc

1234

5

6789

1011121314

Query3

Data Points• Join 6B rows to 1.5B rows to 150M rows (and many

small joins)

• Multiple aggregations/transformations

• Order (sorting)

Query 4

@blazingdb

Query 4

SE

CO

ND

S

Cold Medium(Disk cache only)

Hot

select sum(l_extendedprice) as sum_exprice,

sum(l_discount) as sum_discount

from lineitem

where l_shipdate >= '19940101'

and l_shipdate < '19950101'

and l_discount >= 0.05 and l_discount <= 0.07

and l_quantity < 24

1234

5

6789

1011121314

Query4

Data Points• 6B row table

• Multiple aggregations/transformations

Query 5

@blazingdb

Query 5

SE

CO

ND

S

Cold Medium(Disk cache only)

Hot

select supplier.s_acctbal, supplier.s_suppkey, nation.name,

part.p_partkey, part.p_mfgr, supplier.s_address, supplier.s_phone,

supplier.s_comment

from supplier

inner join partsupp on supplier.s_suppkey = partsupp.ps_suppkey

inner join nation on supplier.s_nationkey = nation.nation_key

inner join region on nation.region_key = region.r_regionkey

inner join part on part.p_partkey = partsupp.ps_partkey

where part.p_size = 15

and part.p_type in ('ECONOMY ANODIZED BRASS', 'ECONOMY BRUSHED BRASS',

'ECONOMY BURNISHED BRASS', 'ECONOMY PLATED BRASS', 'ECONOMY POLISHED

BRASS', 'LARGE ANODIZED BRASS',

LARGE BRUSHED BRASS','LARGE BURNISHED BRASS','LARGE PLATED BRASS',

'LARGE POLISHED BRASS', 'SMALL ANODIZED BRASS', 'SMALL BRUSHED BRASS',

'SMALL BURNISHED BRASS',

SMALL PLATED BRASS', 'SMALL POLISHED BRASS', 'STANDARD ANODIZED

BRASS', 'STANDARD BRUSHED BRASS', 'STANDARD BURNISHED BRASS',

'STANDARD PLATED BRASS', 'STANDARD POLISHED BRASS')

and region.r_name = 'EUROPE'

order by supplier.s_acctbal desc, supplier.s_suppkey, nation.name,

part.p_partkey

Query1

Data Points• Join multiple tables

• Many aggregations/transformations

• String comparisons

@blazingdb

Data Pipeline

GPU Data Frame

Apache Arrow

CommonData Layer

INGEST

STORAGE(Data Lake)

Coming Soon

@blazingdb

Questions?