Upload
alex-silva
View
261
Download
2
Embed Size (px)
Citation preview
Faster Data Flows with Hive, Spring and Hadoop
Alex SilvaPrincipal Data Engineer
DATA ADVENTURES AT RACKSPACE• Datasets
• Data pipeline: flows and systems
• Creating a generic Hadoop ETL framework
• Integrating Hadoop with Spring
• Spring Hadoop, Spring Bach and Spring Boot
• Hive
• File formats
• Queries and performance
MAAS Dataset
• System and platform monitoring
• Pings, SSH, HTTP, HTTPS checks
• Remote monitoring
• CPU, file system, load average, disk memory
• MySQL, Apache
THE BUSINESS DOMAIN | 3
The Dataset
• Processing around 1.5B records/day
• Stored in Cassandra
• Exported to HDFS in batches
• TBs of uncompressed JSON (“raw data”) daily
• First dataset piped through ETL platform
DATA ENGINEERING STATS | 4
DATA PIPELINE• Data flow
• Stages
• ETL
• Input formats
• Generic Transformation Layer
• Outputs
Data Flow Diagram
DATA FLOW | 6
Monitoring
JSON Export
HDFS
Start
Available and well-formed?
No
Stop
EXTRACT AND TRANSFORM
BAD ROWOR ERROR?
LOGCSV
STAGING FILE
ETL
JSON DATA
HDFSYesYes No
LOAD
PartioningBucketingIndexing
Staging Table Production Table
ETL
Hive Table
Flume
Systems Diagram
SYSTEMS | 7
MonitoringEvents
HDFS
JSON
Extract
MapReduce
1.2.0.1.3.2.0
Load
Hive
0.12.0
Flume Log4JAppender
Flume
1.5.0
Access
End User
Bad records sink
Export
ETL Summary
• Extract
• JSON files in HDFS
• Transform
• Generic Java based ETL framework
• MapReduce jobs extract features
• Quality checks
• Load
• Load data into partitioned ORC Hive tables
DATA FLOW | 8
HADOOP
Hadoop: Pros
• Dataset volume
• Data is grows exponentially at a very rapid rate
• Integrates with existing ecosystem
• HiveQL
• Experimentation and exploration
• No expensive software or hardware to buy
TOOLS AND TECHNOLOGIES | 10
Hadoop: Cons
• Job monitoring and scheduling
• Data quality
• Error handling and notification
• Programming model
• Generic framework mitigates some of that
TOOLS AND TECHNOLOGIES | 11
CAN WE OVERCOME SOME OF THOSE?
Keeping the Elephant “Lean”
• Job control without the complexity of external tools
• Checks and validations
• Unified configuration model
• Integration with scripts
• Automation
• Job restartability
DATA ENGINEERING | 13
HEY! WHAT ABOUT SPRING?
SPRING DATA HADOOP
What is it about?
• Part of the Spring Framework
• Run Hadoop apps as standard Java apps using DI
• Unified declarative configuration model
• APIs to run MapReduce, Hive, and Pig jobs.
• Script HDFS operations using any JVM based languages.
• Supports both classic MR and YARN
TOOLS AND TECHNOLOGIES | 16
The Apache Hadoop Namespace
TOOLS AND TECHNOLOGIES | 17
Also supports annotation based configuration via the @EnableHadoop annotation.
Job Configuration: Standard Hadoop APIs
TOOLS AND TECHNOLOGIES | 18
Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);
Configuring Hadoop with Spring
SPRING HADOOP | 19
<context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“hadoop-examples.jar” mapper="examples.WordCount.WordMapper“ reducer="examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ />
input.path=/wc/input/-output.path=/wc/word/-hd.fs=hdfs://localhost:9000
SPRING HADOOP | 20
Configuration Attributes
Creating a Job
SPRING HADOOP | 21
Injecting Jobs
• Use DI to obtain reference to Spring managed Hadoop job
• Perform additional validation and configuration before submitting
TOOLS AND TECHNOLOGIES | 22
public'class'WordService'{''''@Autowired'''private'Job'mapReduceJob;''''''public'void'processWords()'{'''''''''mapReduceJob.submit();'''}'}'
Running a Job
TOOLS AND TECHNOLOGIES | 23
Distributed Cache
TOOLS AND TECHNOLOGIES | 24
Using Scripts
TOOLS AND TECHNOLOGIES | 25
Scripting Implicit Variables
TOOLS AND TECHNOLOGIES | 26
Scripting Support in HDFS
• FSShell is designed to support scripting languages
• Use these for housekeeping tasks:
• Check for files, prepare input data, clean output directories, set flags, etc.
TOOLS AND TECHNOLOGIES | 27
SPRING BATCH
What is it about?
• Born out of collaboration with Accenture in 2007
• Fully automated processing of large volumes of data.
• Logging, txn management, listeners, job statistics, restart, skipping, and resource management.
• Automatic retries after failure
• Synch, async and parallel processing
• Data partitioningTOOLS AND TECHNOLOGIES | 29
Hadoop Workflow Orchestration
• Complex data flows
• Reuses batch infrastructure to manage Hadoop workflows.
• Steps can be any Hadoop job type or HDFS script
• Jobs can be invoked by events or scheduled.
• Steps can be sequential, conditional, split, concurrent, or programmatically determined.
• Works with flat files, XML, or databases.
TOOLS AND TECHNOLOGIES | 30
Spring Batch Configuration
• Jobs are composed of steps
TOOLS AND TECHNOLOGIES | 31
<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>
Spring Data Hadoop Integration
TOOLS AND TECHNOLOGIES | 32
SPRING BOOT
What is it about?
• Builds production-ready Spring applications.
• Creates a “runnable” jar with dependencies and classpath settings.
• Can embed Tomcat or Jetty within the JAR
• Automatic configuration
• Out of the box features:
• statistics, metrics, health checks and externalized configuration
• No code generation and no requirement for XML configuration.
TOOLS AND TECHNOLOGIES | 34
PUTTING IT ALL TOGETHER
Spring Data Flow Components
TOOLS AND TECHNOLOGIES | 36
Spring Boot
Extract
Spring Batch
2.0
Load
Spring Hadoop
2.01.1.5
HDFS
Hive
0.12.0
MapReduce
HDP 1.3
Hierarchical View
TOOLS AND TECHNOLOGIES | 37
Spring Boot
Spring Batch
Job control
Spring Hadoop- Notifications- Validation- Scheduling
- Data Flow- Callbacks
HADOOP DATA FLOWS, SPRINGFIED
Spring Hadoop Configuration
• Job parameters configured by Spring
• Sensible defaults used
• Parameters can be overridden:
• External properties file.
• At runtime via system properties: -Dproperty.name = property.value
TOOLS AND TECHNOLOGIES | 39
<configuration> fs.default.name=${hd.fs} io.sort.mb=${io.sort.mb:640mb} mapred.reduce.tasks=${mapred.reduce.tasks:1} mapred.job.tracker=${hd.jt:local} mapred.child.java.opts=${mapred.child.java.opts} </configuration>
MapReduce Jobs
• Configured via Spring Hadoop
• One job per entity
TOOLS AND TECHNOLOGIES | 40
<job id="metricsMR" input-path="${mapred.input.path}" output-path="${mapred.output.path}" mapper="GenericETLMapper" reducer="GenericETLReducer” input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat" output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat" key="TextArrayWritable" value="org.apache.hadoop.io.NullWritable" map-key="org.apache.hadoop.io.Text" map-value="org.apache.hadoop.io.Text" jar-by-class="GenericETLMapper"> volga.etl.dto.class=Metric </job>
MapReduce Jobs
• Jobs are wrapped into Tasklet definitions
TOOLS AND TECHNOLOGIES | 41
<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>
Hive Configuration
• Hive steps also defined as tasklets
• Parameters are passed from MapReduce phase to Hive phase
TOOLS AND TECHNOLOGIES | 42
<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/> <hive-tasklet id="load-notifications"> <script location="classpath:hive/ddl/notifications-load.hql"/> </hive-tasklet> <hive-tasklet id="load-metrics"> <script location="classpath:hive/ddl/metrics-load.hql"> <arguments>INPUT_PATH=${mapreduce.output.path}</arguments> </script> </hive-tasklet>
Spring Batch Configuration
• One Spring Batch job per entity.
TOOLS AND TECHNOLOGIES | 43
<job id="metrics" restartable="false" parent="VolgaETLJob"> <step id="cleanMetricsOutputDirectory" next="metricsMapReduce"> <tasklet ref="setUpJobTasklet"/> </step> <step id="metricsMapReduce"> <tasklet ref="metricsJobTasklet"> <listeners> <listener ref="mapReduceErrorThresholdListener"/> </listeners> </tasklet> <fail on="FAILED" exit-code="Map Reduce Step Failed"/> <end on="COMPLETED"/> <!--<next on="*" to="loadMetricsIntoHive"/>--> </step> <step id="loadMetricsIntoHive"> <tasklet ref="load-notifications"/> </step> </job>
Spring Batch Listeners
• Monitor job flow
• Take action on job failure
• PagerDuty notifications
• Save job counters to the audit database
• Notify team if counters are not consistent with historical audit data (based on thresholds)
TOOLS AND TECHNOLOGIES | 44
Spring Boot: Pulling Everything Together
• Runnable jar created during build process
• Controlled by Maven plugin
TOOLS AND TECHNOLOGIES | 45
<plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <finalName>maas-etl-${project.version}</finalName> <classifier>spring</classifier> <mainClass>com.rackspace....JobRunner</mainClass> <excludeGroupIds>org.slf4j</excludeGroupIds> </configuration> </plugin>
HIVE• Typical Use Cases
• File formats
• ORC
• Abstractions
• Hive in the monitoring pipeline
• Query performance
Overview
• Translates SQL commands into MR jobs.
• Structured and unstructured data in multiple formats
• Standard access protocols, including JDBC and Thrift
• Provides several serialization mechanisms
• Integrates seamlessly with Hadoop: HCatalog, Pig, HBase, etc.
HIVE | 47
Hive vs. RDBMS
HIVE | 48
Hive Traditional Databases
SQL Interface SQL Interface
Focus on batch analytics Mostly online, interactive analytics
No transactions Transactions are their way of life
No random insertsUpdates are not natively supported (but possible.) Random insert and updates
Distributed processing via MR Distributed processing capabilities vary
Scales to hundreds of nodes Seldom scales beyond 20 nodes
Built for commodity hardware Expensive, proprietary hardware
Low cost per petabyte What’s a petabyte?
Abstraction Layers in Hive
49HIVE |
Database
Table
PartitionSkewed Keys
Table
Partition Partition Unskewed Keys
Bucket
Bucket
Bucket
Optional
Schemas and File Formats
• We used the ORCFile format: built-in, easy to use and efficient.
• Efficient light-weight + generic compression
• Run length encoding for integers and strings, dictionary encoding, etc.
• Generic compression: Snappy, LZO, and ZLib (default)
• High performance
• Indexes value ranges within blocks of ORCFile data
• Predicate filter pushdown allows efficient scanning during queries.
• Flexible Data Model
• Hive types are supported including maps, structs and unions.
HIVE | 50
The ORC File Format
• An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer.
• Default size is 256 MB (orc.stripe.size).
• Large stripes allow for efficient reads from HDFS configured independently from the block size.
HIVE | 51
The ORC File Format: Index
• Doesn’t answer queries
• Required for skipping rows:
• Row index entries provide offsets that enable seeking
• Min and max values for each column
HIVE | 52
ORC File Index Skipping
HIVE | 53
Skipping works for number types and for string types.
Done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.
The ORC File Format: File Footer
• List of stripes in the file, the number of rows per stripe, each column's data type.
• Column-level aggregates: count, min, max, and sum.
• ORC uses files footer to find the columns data streams.
HIVE | 54
Predicate Pushdowns
• “Push down” parts of the query to where the data is.
• filter/skip as much data as possible, and
• greatly reduce input size.
• Sorting a table on its secondary keys also reduces execution time.
• Sorted columns are grouped together in one area on disk and the other pieces will be skipped very quickly.
HIVE | 55
56HIVE |
ORC File
Query Performance
• Lower latency Hive queries rely on two major factors:
• Sorting and skipping data as much as possible
• Minimizing data shuffle from mappers to reducers
HIVE | 57
Improving Query Performance
• Divide data among different files/directories
• Partitions, buckets, etc.
• Skip records using small embedded indexes.
• ORCFile format.
• Sort data ahead of time.
• Simplifies joins making ORCFile skipping more effective.
HIVE | 58
The Big Picture
DATA ENGINEERING | 59
Data Preprocessing
HDFS HDFSMapReduce
Start Here
JSON Hive File
Data Load
Dynamic Load PartioningBucketingIndexing
HDFS
Hive File
Staging Table Prod Table
Data AccessAPI Hive CLI
Apache Thrift