Veracity think bugdata #2 6.7.2015

Preview:

Citation preview

DWH OVER HADOOPDWH OVER HADOOP

THETHE

BASICSBASICS

COLUMNAR FORMATS (ORC/PARQUET)COLUMNAR FORMATS (ORC/PARQUET)Projection Push DownPredicate Push DownExcellent Compression RatiosColumn IndicesMax/Avg/Min valuesRows must be batched to benefit from these optimizations

PARQUETPARQUET

Strongly endorsed by ClouderaOne of the few formats Impalasupports (and the most optimalfor it)Also supported by Hive, Spark,Tajo, Drill & Presto.Speaking from myown personal experience a bitmore expensive to generate.

ORCORC

Endorsed by HortonworksMost optimal for PrestoSpark support was recentlyintroduced.

QUERYINGQUERYINGENGINESENGINES

HIVEHIVE

Hive provides a SQL like interface ofaccessing the data (files) called HiveQL.The HQL is translated intoM/R code and executed immediately. Batch Oriented Fault tolerant and thus reliableNot a DB! Does not support updates & delete and hasno transaction (or does it ?)

LOW LOW LATENCYLATENCYSQLSQL

Map-Reduce can be compared toa Tractor:It's very strong and can plow afield better than any other vehicle,but it's also very slow.As prices of memory dropped, ademand emerged to better utilizeit for faster response times.

CLOUDERA IMPALACLOUDERA IMPALAWriten in C++Utilizes Hive's metadataVery fastNot fault toleranteDoesn't support custom dataformatsDoesn't support complex datatypes (maps/arrays/structs)A bit complicated setup for nonCDH distributions

FACEBOOK PRESTOFACEBOOK PRESTOCan connect to:

CassandraHiveJMX SourcesPostgres & Mysql

Allows cross engine joins Used in Facebook to serve onlinedashboardsEasy to setup

SPARK SQLSPARK SQLNot affiliated with any HadoopvendorSupport all of the optimized fileformats (ORC/Parquet/Avro)Can auto discover schemaAims to provide second/sub-second latnecyStill not very mature

THE USUAL DATA FLOWTHE USUAL DATA FLOW

Collect -> Store -> Convert -> Select

The Data Latency conflict - lotsof fragmented small files or bigoptimized files with big latencyProcessing efforts involved inthe conversion process shouldbe minimizedExample..

A BETTER DATA FLOWA BETTER DATA FLOW

Collec-tor-vert -> Select

Convert the data as it is beingcollected where possibleOr convert the data as it isbeing stored (streaming) butwithout losing optimizationsHow can this be achieved?

SQOOPSQOOPImport data from RDBMS intoHadoopCreate java classes and hivetables on importExport data back to RDBMSRuns a "Map Only" job toperform the taskSupports incremental importsNow supports import rightaway as Parquet

HIVE & ACIDHIVE & ACID

Recently a conceptual change has beenintroduced into Hive: CRUD with ACIDTransactions.It is not meant to replace your OLTP butrather supply a better data modificationmechanism to a subset of the data.Explanation on how it worksDemo simple insertStill requires M/R :(

HIVE & STREAMING INGESTHIVE & STREAMING INGEST

With the new ACID capabilities it is nowpossible to continously insert data into hiveData apperas almost immediatelyData is optimized in a columnar formatData is compacted by different triggersCode snippet

FLUMEFLUMEDistributedDurableScalableFault ToleranteServes for ingestion and basicpre-processing of the dataComposed of source -> channel -> Sink(Draw Architecture)Utilized Hive's ACID capabilitiesto instantly stream data into hive- demo