Data Engineering with Spring, Hadoop and Hive

Faster Data Flows with Hive, Spring and Hadoop

Alex SilvaPrincipal Data Engineer

DATA ADVENTURES AT RACKSPACE• Datasets

• Data pipeline: flows and systems

• Creating a generic Hadoop ETL framework

• Integrating Hadoop with Spring

• Spring Hadoop, Spring Bach and Spring Boot

• Hive

• File formats

• Queries and performance

MAAS Dataset

• System and platform monitoring

• Pings, SSH, HTTP, HTTPS checks

• Remote monitoring

• CPU, file system, load average, disk memory

• MySQL, Apache

THE BUSINESS DOMAIN | 3

The Dataset

• Processing around 1.5B records/day

• Stored in Cassandra

• Exported to HDFS in batches

• TBs of uncompressed JSON (“raw data”) daily

• First dataset piped through ETL platform

DATA ENGINEERING STATS | 4

DATA PIPELINE• Data flow

• Stages

• ETL

• Input formats

• Generic Transformation Layer

• Outputs

Data Flow Diagram

DATA FLOW | 6

Monitoring

JSON Export

HDFS

Start

Available and well-formed?

No

Stop

EXTRACT AND TRANSFORM

BAD ROWOR ERROR?

LOGCSV

STAGING FILE

ETL

JSON DATA

HDFSYesYes No

LOAD

PartioningBucketingIndexing

Staging Table Production Table

ETL

Hive Table

Flume

Systems Diagram

SYSTEMS | 7

MonitoringEvents

HDFS

JSON

Extract

MapReduce

1.2.0.1.3.2.0

Load

Hive

0.12.0

Flume Log4JAppender

Flume

1.5.0

Access

End User

Bad records sink

Export

ETL Summary

• Extract

• JSON files in HDFS

• Transform

• Generic Java based ETL framework

• MapReduce jobs extract features

• Quality checks

• Load

• Load data into partitioned ORC Hive tables

DATA FLOW | 8

HADOOP

Hadoop: Pros

• Dataset volume

• Data is grows exponentially at a very rapid rate

• Integrates with existing ecosystem

• HiveQL

• Experimentation and exploration

• No expensive software or hardware to buy

TOOLS AND TECHNOLOGIES | 10

Hadoop: Cons

• Job monitoring and scheduling

• Data quality

• Error handling and notification

• Programming model

• Generic framework mitigates some of that


CAN WE OVERCOME SOME OF THOSE?

Keeping the Elephant “Lean”

• Job control without the complexity of external tools

• Checks and validations

• Unified configuration model

• Integration with scripts

• Automation

• Job restartability

DATA ENGINEERING | 13

HEY! WHAT ABOUT SPRING?

SPRING DATA HADOOP

What is it about?

• Part of the Spring Framework

• Run Hadoop apps as standard Java apps using DI

• Unified declarative configuration model

• APIs to run MapReduce, Hive, and Pig jobs.

• Script HDFS operations using any JVM based languages.

• Supports both classic MR and YARN


The Apache Hadoop Namespace


Also supports annotation based configuration via the @EnableHadoop annotation.

Job Configuration: Standard Hadoop APIs


Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Configuring Hadoop with Spring

SPRING HADOOP | 19

<context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“hadoop-examples.jar” mapper="examples.WordCount.WordMapper“ reducer="examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ />

input.path=/wc/input/-output.path=/wc/word/-hd.fs=hdfs://localhost:9000

SPRING HADOOP | 20

Configuration Attributes

Creating a Job

SPRING HADOOP | 21

Injecting Jobs

• Use DI to obtain reference to Spring managed Hadoop job

• Perform additional validation and configuration before submitting


public'class'WordService'{''''@Autowired'''private'Job'mapReduceJob;''''''public'void'processWords()'{'''''''''mapReduceJob.submit();'''}'}'

Running a Job


Distributed Cache


Using Scripts


Scripting Implicit Variables


Scripting Support in HDFS

• FSShell is designed to support scripting languages

• Use these for housekeeping tasks:

• Check for files, prepare input data, clean output directories, set flags, etc.


SPRING BATCH

What is it about?

• Born out of collaboration with Accenture in 2007

• Fully automated processing of large volumes of data.

• Logging, txn management, listeners, job statistics, restart, skipping, and resource management.

• Automatic retries after failure

• Synch, async and parallel processing

• Data partitioningTOOLS AND TECHNOLOGIES | 29

Hadoop Workflow Orchestration

• Complex data flows

• Reuses batch infrastructure to manage Hadoop workflows.

• Steps can be any Hadoop job type or HDFS script

• Jobs can be invoked by events or scheduled.

• Steps can be sequential, conditional, split, concurrent, or programmatically determined.

• Works with flat files, XML, or databases.


Spring Batch Configuration

• Jobs are composed of steps


<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>

Spring Data Hadoop Integration


SPRING BOOT

What is it about?

• Builds production-ready Spring applications.

• Creates a “runnable” jar with dependencies and classpath settings.

• Can embed Tomcat or Jetty within the JAR

• Automatic configuration

• Out of the box features:

• statistics, metrics, health checks and externalized configuration

• No code generation and no requirement for XML configuration.


PUTTING IT ALL TOGETHER

Spring Data Flow Components


Spring Boot

Extract

Spring Batch

2.0

Load

Spring Hadoop

2.01.1.5

HDFS

Hive

0.12.0

MapReduce

HDP 1.3

Hierarchical View


Spring Boot

Spring Batch

Job control

Spring Hadoop- Notifications- Validation- Scheduling

- Data Flow- Callbacks

HADOOP DATA FLOWS, SPRINGFIED

Spring Hadoop Configuration

• Job parameters configured by Spring

• Sensible defaults used

• Parameters can be overridden:

• External properties file.

• At runtime via system properties: -Dproperty.name = property.value


<configuration> fs.default.name=${hd.fs} io.sort.mb=${io.sort.mb:640mb} mapred.reduce.tasks=${mapred.reduce.tasks:1} mapred.job.tracker=${hd.jt:local} mapred.child.java.opts=${mapred.child.java.opts} </configuration>

MapReduce Jobs

• Configured via Spring Hadoop

• One job per entity


<job id="metricsMR" input-path="${mapred.input.path}" output-path="${mapred.output.path}" mapper="GenericETLMapper" reducer="GenericETLReducer” input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat" output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat" key="TextArrayWritable" value="org.apache.hadoop.io.NullWritable" map-key="org.apache.hadoop.io.Text" map-value="org.apache.hadoop.io.Text" jar-by-class="GenericETLMapper"> volga.etl.dto.class=Metric </job>

MapReduce Jobs

• Jobs are wrapped into Tasklet definitions


<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>

Hive Configuration

• Hive steps also defined as tasklets

• Parameters are passed from MapReduce phase to Hive phase


<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/> <hive-tasklet id="load-notifications"> <script location="classpath:hive/ddl/notifications-load.hql"/> </hive-tasklet> <hive-tasklet id="load-metrics"> <script location="classpath:hive/ddl/metrics-load.hql"> <arguments>INPUT_PATH=${mapreduce.output.path}</arguments> </script> </hive-tasklet>

Spring Batch Configuration

• One Spring Batch job per entity.


<job id="metrics" restartable="false" parent="VolgaETLJob"> <step id="cleanMetricsOutputDirectory" next="metricsMapReduce"> <tasklet ref="setUpJobTasklet"/> </step> <step id="metricsMapReduce"> <tasklet ref="metricsJobTasklet"> <listeners> <listener ref="mapReduceErrorThresholdListener"/> </listeners> </tasklet> <fail on="FAILED" exit-code="Map Reduce Step Failed"/> <end on="COMPLETED"/>  </step> <step id="loadMetricsIntoHive"> <tasklet ref="load-notifications"/> </step> </job>

Spring Batch Listeners

• Monitor job flow

• Take action on job failure

• PagerDuty notifications

• Save job counters to the audit database

• Notify team if counters are not consistent with historical audit data (based on thresholds)


Spring Boot: Pulling Everything Together

• Runnable jar created during build process

• Controlled by Maven plugin


<plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <finalName>maas-etl-${project.version}</finalName> <classifier>spring</classifier> <mainClass>com.rackspace....JobRunner</mainClass> <excludeGroupIds>org.slf4j</excludeGroupIds> </configuration> </plugin>

HIVE• Typical Use Cases

• File formats

• ORC

• Abstractions

• Hive in the monitoring pipeline

• Query performance

Overview

• Translates SQL commands into MR jobs.

• Structured and unstructured data in multiple formats

• Standard access protocols, including JDBC and Thrift

• Provides several serialization mechanisms

• Integrates seamlessly with Hadoop: HCatalog, Pig, HBase, etc.

HIVE | 47

Hive vs. RDBMS

HIVE | 48

Hive Traditional Databases

SQL Interface SQL Interface

Focus on batch analytics Mostly online, interactive analytics

No transactions Transactions are their way of life

No random insertsUpdates are not natively supported (but possible.) Random insert and updates

Distributed processing via MR Distributed processing capabilities vary

Scales to hundreds of nodes Seldom scales beyond 20 nodes

Built for commodity hardware Expensive, proprietary hardware

Low cost per petabyte What’s a petabyte?

Abstraction Layers in Hive

49HIVE |

Database

Table

PartitionSkewed Keys

Table

Partition Partition Unskewed Keys

Bucket

Bucket

Bucket

Optional

Schemas and File Formats

• We used the ORCFile format: built-in, easy to use and efficient.

• Efficient light-weight + generic compression

• Run length encoding for integers and strings, dictionary encoding, etc.

• Generic compression: Snappy, LZO, and ZLib (default)

• High performance

• Indexes value ranges within blocks of ORCFile data

• Predicate filter pushdown allows efficient scanning during queries.

• Flexible Data Model

• Hive types are supported including maps, structs and unions.

HIVE | 50

The ORC File Format

• An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer.

• Default size is 256 MB (orc.stripe.size).

• Large stripes allow for efficient reads from HDFS configured independently from the block size.

HIVE | 51

The ORC File Format: Index

• Doesn’t answer queries

• Required for skipping rows:

• Row index entries provide offsets that enable seeking

• Min and max values for each column

HIVE | 52

ORC File Index Skipping

HIVE | 53

Skipping works for number types and for string types.

Done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.

The ORC File Format: File Footer

• List of stripes in the file, the number of rows per stripe, each column's data type.

• Column-level aggregates: count, min, max, and sum.

• ORC uses files footer to find the columns data streams.

HIVE | 54

Predicate Pushdowns

• “Push down” parts of the query to where the data is.

• filter/skip as much data as possible, and

• greatly reduce input size.

• Sorting a table on its secondary keys also reduces execution time.

• Sorted columns are grouped together in one area on disk and the other pieces will be skipped very quickly.

HIVE | 55

56HIVE |

ORC File

Query Performance

• Lower latency Hive queries rely on two major factors:

• Sorting and skipping data as much as possible

• Minimizing data shuffle from mappers to reducers

HIVE | 57

Improving Query Performance

• Divide data among different files/directories

• Partitions, buckets, etc.

• Skip records using small embedded indexes.

• ORCFile format.

• Sort data ahead of time.

• Simplifies joins making ORCFile skipping more effective.

HIVE | 58

The Big Picture

DATA ENGINEERING | 59

Data Preprocessing

HDFS HDFSMapReduce

Start Here

JSON Hive File

Data Load

Dynamic Load PartioningBucketingIndexing

HDFS

Hive File

Staging Table Prod Table

Data AccessAPI Hive CLI

Apache Thrift

THANK YOU!

Get in touch:

[email protected] @thealexsilva

mailto:[email protected]

Data & Analytics

Data Engineering with Spring, Hadoop and Hive