Apache Tajo on Swift: Bringing SQL to the OpenStack World

Bringing SQL to the OpenStack World

Apache Tajo on Swift

Jihoon SonApache Tajo PMC member

Who am I

● Jihoon Son○ Ph.D candidate (Computer Science & Engineering,

2010.3 ~) ○ Tajo project co-founder (2010)○ Apache Tajo PMC and Committer (2014.5.1 ~)

● Contacts○ Email: jihoonson AT apache.org○ LinkedIn: https://www.linkedin.com/in/jihoonson

Outline

● OpenStack Swift● Apache Tajo● Tajo on Swift● Location-aware Computing of Tajo● Brief Evaluation Results● Roadmap

OpenStack Swift

● Popular object storage○ Images, videos, logs, ...

● Enterprises store objects on Swift to provide their services○ Usually private clusters

SQL on Swift

● Data analysis is important to improve the quality of their services○ SQL is one of the most powerful and popular query

languages● Many enterprise data analysis tools relying on

SQL○ OLAP, visualization, data mining, …

● Need for using SQL on Swift

Apache Tajo

● Scalable, efficient, and fault-tolerant data warehouse system○ Support SQL standards compliance ○ Efficient batch execution and interactive ad-hoc

analysis■ Low latency and high throughput■ No use of MapReduce

○ No single point of failure

Apache Tajo

Pluggable Storage Layer

MasterMasterTajoMaster

TajoWorker

Apache Tajo

Tajo cluster

● Activity summary

Apache Tajo

● Active open source project○ Apache incubating

project (2013)○ Apache Top-Level

Project (2014)○ 18 committers + 16

contributors

Tajo on Swift

...Swift

Tajo cluster

Tajo on Swift

● No need to modify code of Tajo and Swift○ Tajo can access Swift with the Hadoop-openstack

library■ But, doesn’t need to install or run Hadoop

○ Just use it

Network

Tajo on Swift

● Focusing on private clusters○ Tajo and Swift may share the same cluster○ Able to optimize the cluster configuration and data

deployment● More efficient data analysis!

○ By reducing the overhead of bottlenecks

Data Locality Problem

● Network can be usually the bottleneck

Worker

StorageNode

Interconnection Network

Node A

Worker

Node B

StorageNode

SignificantNetwork

Overhead

● Network can be usually the bottleneck

Worker

StorageNode

Node A

Worker

Node B

StorageNode

SignificantNetwork

Overhead

● Important to reduce data transmission over the network

Worker

StorageNode

Node A

Worker

Node B

StorageNode

Location-aware Computing

● Moving the processing close to the data○ Avoiding the performance degradation due to the

data transfer over the network

Master

Worker Worker

Read B Read A

< Moving data to tasks >

Master

Worker Worker

Read A Read B

< Moving tasks to data >

● Three cases of user data deployment○ Case 1: small objects without segmentation○ Case 2: large objects with segmentation○ Case 3: large objects without segmentation

Case 1: Small Objects W/O Segmentation

● Each task processes an object○ The object size should be sufficiently small to be

processed by a taskStorage nodeWorker

... ...

StorageNode

1) Getting object locations from the ring

QueryMaster

MasterMasterProxyServer

Get object locations

StorageNode

2) Assigning tasks based on object locationsQueryMaster

Worker Worker Worker ...

StorageNode

Assign tasks close to objects

Directly read object data

Case 2: Large Objects with Segmentation

● Each task processes a segment○ The segment size should be sufficiently small to be

processed by a taskStorage nodeWorker

... ...

1) Getting segments of the given objects

QueryMaster

Get segments of the input objects

StorageNode

2) Getting segment locations from the ring

QueryMaster

Get segment locations

StorageNode

3) Assigning tasks based on segment locationsQueryMaster

Worker Worker Worker ...

StorageNode

Assign tasks close to segments

Directly read segments

Case 3: Large Objects W/O Segmentation

● Inevitable performance degradation due to the suboptimal data deployment○ Each Tajo worker processes large objects, which

causes coarse-grained load balancingWorker Worker Worker

Case 3: Large Objects W/O Segmentation

● Alternatively, Tajo will provide the data load feature○ Load objects into the Tajo’s optimized internal

storage

● Current status○ Support location-aware computing without

segmentation● Future support

○ Location-aware computing for segmented objects○ Data load into the Tajo’s storage

Brief Evaluations

● Performance comparison with on another distributed storage○ Swift VS Hadoop Distributed File System (HDFS)

● Scalability test of Swift

Brief Evaluations

● Cluster configuration○ 1 master + 8 slaves

■ Each worker is equipped with 1 disk○ Swift: 1 proxy server + 8 storage nodes○ HDFS: 1 namenode + 8 datanodes○ Tajo: 1 master + 8 workers

■ Each worker can process 2 tasks simultaneously

Brief Evaluations

● Data set○ Crawled twitter log

■ # of objects: 1014■ Avg size of objects: 70MB■ Total size: 69.5GB

● Query○ Full scan query

■ select * from twitter_log

Comparison with HDFS

● Scan time ● Schedule time

● Result● Data

Scalability Test

Our Roadmap

● Storage layer specialized for Swift○ Location-aware computing for segmented objects○ Data load into the Tajo’s storage

● Block storage support○ Cinder and Ceph

● Provisioning Tajo clusters○ Sahara○ Heat, TOSCA

Thanks!http://tajo.apache.org/

Apache Tajo on Swift: Bringing SQL to the OpenStack World

Engineering

[OpenStack Day in Korea 2015] Track 2-6 - Apache Tajo on Swift

What is OpenStack Swift?

Turning OpenStack Swift into a VM storage platform

OpenStack Swift Object Storage on EMC Isilon Scale-Out NAS

OpenStack Object Storage (Swift) Essentials - Sample Chapter

OpenStack - DOAG Deutsche ORACLE-Anwendergruppe … · OpenStack Cinder/Swift Data Management – ZFS •ZFS is foundation for Cinder and Swift –Leverage integrated data services

onfiguration and eployment Guide for OpenStack* Swift* Object … · 2013-10-23 · 3 Introduction OpenStack* Swift* Object Storage (swift.openstack.org) enables a scalable, redundant

Hitachi Hyper Scale-Out Platform (HSP) OpenStack Swift

Monitoring Swift - OpenStack Summit May 2015, Vancouver

OpenStack Swift usage at Turkcell

Leveraging open source tools to gain insight into OpenStack Swift

OpenStack Swift: Panoramic View

Openstack Swift Introduction

SUSE OpenStack and Ceph integration - Open Source ... · SUSE ® OpenStack and Ceph integration ... Customer Center ... ‒Swift-compatible – subset of the OpenStack Swift API

Swift-X: Accelerating OpenStack Swift with RDMA for

Openstack Swift - IBM Research People and Projects Swift Object Store Cloud ... OpenStack Rackspace contributes Swift NASA contributes Nova ... Migrated all customers Jul Jan

Implementing Cloud Storage with OpenStack Swift · www. packtpub.com/implementing- cloud- storage -with -openstack- swift /book. ... The following command is used to get the valid

Exploring Openstack Swift(Object Storage) and Swiftstack

OpenStack Object storage (Swift) - Tech Overview

REPLACEMENT FOR OPENSTACK SWIFT DEPLOY CEPH RADOS … · DEPLOY CEPH RADOS GATEWAY AS A REPLACEMENT FOR OPENSTACK SWIFT Hands On Lab - L103175 Gregory Charot - Senior Field Product