Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Interac(veQueriesonCompressedRDD

SuccinctSpark

RachitAgarwalAMPLab

ragarwal@berkeley.edu

TwiEer:@_ragarwal_

Nosecondaryindexes,nodatascans,nodatadecompression

AdistributedcompresseddatastoreSuccinct

Pointqueries

• search• randomaccess• rangequeries• regularexpressions

UnifiedInterface

• Unstructureddata• Key-valuestore• Documentstore• Tables

Interactivepointqueries

Randomaccess

Search

RangeQueries

RegularExpressions

Aggregatequeries

Updates

Graphqueries

0, 10, 14, 16, 19, 26, 29

1, 4, 5, 8, 20, 22, 24

2, 15, 17, 27

3, 6, 7, 9, 12, 13, 18, 23 ..

11, 21

DataScans Indexes

LowstorageHighLatency

HighstorageLowLatency

Existingsystems,e.g.,search()

Search( )

IndexesinslowerstorageScansin

fasterstorageexecu(ngqueriesoffslowerstorage

Inputsize

QueryLatency

Datascans

Indexes

Scansinslowerstorage

Indexesinfasterstorage

Existingsystems“atscale”(qualitatively)

Succinct

LowstorageLowLatency

Queriesexecuteddirectlyonthe

compressedrepresenta(on

WhatmakesSuccinctunique

Noaddi(onalindexes

Queryresponsesembeddedwithin

thecompressedrepresenta(on

Nodatascans Func(onalityofindexes

Nodecompression

Queriesdirectlyonthecompressedrepresenta(on(exceptfordataaccessqueries)

Succinct

Inputsize

QueryLatency

Indexes

Succinct

Avoidingdatascans

Avoidingqueriesoffslowerstorage

Datascans

Succincttradeoffs

OriginalInput

Extract:returnsdataatarbitraryoffsetsinuncompressedfileCount:returnscountofarbitrarystringsinuncompressedfile

Succinct

Search()={0,10,14,16,19,26,29}Extract(0,5)={,,,,}

Count()=7

Search:returnsoffsetsofarbitrarystringsinuncompressedfile

Input:flat(unstructured)files

Append(,,,,)Rangequeries

SuccinctDatamodelandFunctionality

Supported,buttraded-offinfavorofpointqueriesoncompresseddata

• Preprocessingtime

• CPU(dataaccess)

• Sequentialscanthroughput

• “In-place”updates

Whatdowelose?

Succincttradeoffs

Nosecondaryindexes,nodatascans,nodatadecompression

Pointqueries

• search• randomaccess• rangequeries• regularexpressions

UnifiedInterface

• Unstructureddata• Key-valuestore• Documentstore• Tables

Withallthepowerfulqueriesonvalues,documents,columns

• Unstructureddata

• Key-valuestores(Voldemort,Dynamo)

• Documentstore(Elasticsearch,MongoDB)

• Tables(Cassandra,BigTable)

• Andmanymore….

UnifiedInterface

SuccinctDataModel:FlatFileInterface

Search(Column1,)Search()

SuccinctFlatFileInterface:Unification

Wherearewe?

• Succinct• SuccinctSpark

Wherearewegoing?

• Industrycollabora(on• Succinct++

• System(prototyped&tested)

• Asalibrary

• C++,Java,Scala

• foreaseofintegration

• Allfunctionalitiessupported

Succinct

Succinct:Wherearewe?

• ASparkpackage

• Enablesnewfunctionalities

• Documentstores

• Pointqueries

• Fasterfilters

• CompressedRDDs:Morein-memory

• DataframesAPInotsomature

QueriesoncompressedRDDs

SuccinctSpark

Succinct:Wherearewe?

IfyouarealreadyusingSpark

Newfunc(onali(es

Documentstore,Key-Valuestore

searchondocuments,values

Fasteropera(onsintoRDDs

randomaccess,filters

avoidscans

Morein-memory CompressedRDDs nodecompressionoverheads

SuccinctSpark

importedu.berkeley.cs.succinct._valrdd=ctx.textFile(...).map(_.getBytes)

valbytes=succinctRDD.extract(50,100)

valcount=succinctRDD.count("Berkeley")

valoffsets=succinctRDD.search("Berkeley")

Importclasses

CreateanRDD

Extract100bytesfromoffset50

Count#occurrencesof“Berkeley”

Findalloccurrencesof“Berkeley”

valsuccinctRDD=rdd.succinct CompressusingSuccinct

SuccinctSpark:SuccinctRDD(unstructureddata)

importedu.berkeley.cs.succinct.kv._

valkvRDD=rdd.zipWithIndex.map(t=>(t._2,t._1.getBytes))

valvalue=succinctKVRDD.get(0)

valvalueData=succinctKVRDD.extract(0,50,100)

valkeys=succinctKVRDD.search("Berkeley")

Importclasses

Loaddata

Getvalueforkey0

Extract100bytesatoffset50inthevalueforkey0

Findallkeysforvaluesthatcontain

“Berkeley”

valsuccinctKVRDD=kvRDD.succinctKV CompressusingSuccinct

SuccinctSpark:SuccinctKVRDD(documentstore)

• 5xAmazonEC2servers,30GBRAMeach

• Wikipediadataset,40GB

• Spark,Elasticsearch

• searchqueries

• #occurrences1-10k

SuccinctEvaluation

Take-away:SuccinctSpark2.75xfasterthanElas(cSearchwhilebeing2.5xmorespaceefficient(datafitsinmemoryforallsystems)

SuccinctSparkEvaluation(searchlatency)

SuccinctSparknowsupportsRegularExpressions!

valmatches=succinctRDD.regexSearch("William.*Clinton")

FindallmatchesfortheRegEx

“William.*Clinton”

valmatchKeys=succinctKVRDD.regexSearch("William.*Clinton")

FindallkeysforvaluesthatcontainmatchesfortheRegEx“William.*Clinton”

SuccinctRDD

SuccinctKVRDD

Take-away:SuccinctsignificantlyspeedsupRegExqueriesevenwhenallthedatafitsinmemoryforallsystems

SuccinctSparkEvaluation(RegExlatency)

valjsonDoc=succinctJsonRDD.get(0)

valids1=succinctJsonRDD.filter("city","Berkeley")

valids2=succinctJsonRDD.search("AMPLab")

GetJSONdocumentwithid0

FilterJSONdocumentswhere“city=Berkeley”

SearchforJSONdocumentscontaining

“AMPLab”

SuccinctSparknowsupportsJSONdocuments!

• Moretesting,benchmarking

• SuccinctSparkDataframes

• Newfunctionalities

Where are we going?

Queriesoncompressedandencrypteddata

• BlowFish

• SuccinctEncryption

• SuccinctGraphs

Newfunctionalities

Succinct

BlowFish

Indexes

Queriesoncompressedgraphs

Storage

QueryLatency

ANDMANYMORE!

succinct.cs.berkeley.edu

Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal

Data & Analytics

Rectal drug delivery system [RDDS]

Succinct Quality Brochure Final 2016

Apache Spark RDDs

Succinct Data Structures

Experimental approach: B(E2) via gamma-particle coincidences (+ RDDS)

Rachit Interview (2)

Succinct Trees

Cipla..NCP 2..Rachit n Ritika Singh New

Mozart's Succinct Thorough-Bass 4

Rachit Gupta

Burda Social Media March 2012 - Rachit

Mozart Succinct Thorough Bass 1

SparkSQL: A Compiler from Queries to RDDs

There are Three types of operations on RDDs .... more RDD... · Types of spark operations There are Three types of operations on RDDs: Transformations, Actions and Shuffles. The most

Rachit kumar

Mozart's Succinct Thorough-Bass 3

Rachit m100700043 mm 2

Hangout Session with Rachit Jain

RDDS Data Protection Addendum - rddsrequest.nic.neustar - Controller - RDDS Data... · Page 1 of 12 RDDS Data Protection Addendum This Data Protection Addendum (the “DPA“) is

ABSTRACT - International Journal of Pharmacy and ... · RACHIT KHULLAR* et al Int J Pharm Bio Sci ... International Journal of Pharmacy and Biological Sciences (ISSN: 2230-7605) RACHIT