Upload
dataworks-summit
View
301
Download
3
Embed Size (px)
Citation preview
© 2017 IBM Corporation
Ingesting Data at Blazing Speed with Apache ORC
Gustavo Arocena
IBM Toronto Lab
© 2017 IBM Corporation3
Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They areprovided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2017. All rights reserved.
U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, BigInsights, and Big SQL are trademarks or registered trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a
trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the
Web at
▪“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
▪TPC Benchmark, TPC-DS, and QphDS are trademarks of Transaction Processing Performance Council
▪Cloudera, the Cloudera logo, Cloudera Impala are trademarks of Cloudera.
▪Hortonworks, the Hortonworks logo and other Hortonworks trademarks are trademarks of Hortonworks Inc. in the United States and other countries.
▪Other company, product, or service names may be trademarks or service marks of others.
© 2017 IBM Corporation4
Agenda
What is Big SQL?
From Parquet to ORC
Reading ORC Files Fast
Scaling Up
Tuning Big SQL for ORC
© 2017 IBM Corporation6
What is Big SQL?
Data Security
Metastore
Cluster Mgmt.
Administration
Runs on
Data Platform
© 2017 IBM Corporation8
The Big SQL Advantages
Scale
• Only engine to run TPC-DS at 100TB scale
Complex SQL
• Capable of running all 99 TPC-DS queries since 2014
• Complex queries optimized with IBM Cost-based optimizer
Concurrency
• Handles highly concurrent workloads gracefully
• 12 stream TPC-DS
Efficient Resource Utilization
• Memory
• CPU
• IO
What’s the Big Deal?
© 2017 IBM Corporation9
Metrics for Big SQL 4.2.5 vs. Spark SQL 2.1
▪ Hadoop DS @ 100TB, 4 streams
13.7
43.2
BIG SQL SPARK SQL
Ho
urs
Elapsed Time
76.4
88.2
BIG SQL SPARK SQL
%
CPU Utilization
107
388
BIG SQL SPARK SQL
MB
/Se
c
Disk Reads
25
237
BIG SQL SPARK SQL
MB
/Se
c
Disk Writes
- 15%
1/3
1/3 1/9
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
© 2017 IBM Corporation11
Big SQL Architecture (as of 2016)
Head Node
Worker NodeWorker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN
© 2017 IBM Corporation12
2017: Big SQL on HDP
Most popular data format on HDP is ORC
ORC performance becomes top priority
© 2017 IBM Corporation13
0
5000
10000
15000
20000
25000
30000
35000
1 Stream 4 Streams
Ela
psed T
ime (
sec)
Parquet vs ORC1TB TPC-DS
Parquet ORC v0
70% Slower
with ORC
ORC vs Parquet in Big SQL 4.2 > 50000
315% Slower
with ORC
© 2017 IBM Corporation14
Limitations of Hive Compatibility IO Engine
Slow Ingestion
Single row at a time
JIT unfriendly
Data values as Java objects
Low Scalability
Large memory footprint per
scan
Excessive CPU use
Overloaded disks
© 2017 IBM Corporation15
The Roadmap Towards Fast ORC ingestion
1st Phase
• Big SQL 4.2.5, Dec ‘16
• Fast ORC Ingestion
2nd Phase
• Big SQL 5.0.1, Aug ‘17
• ORC at Scale
© 2017 IBM Corporation16
1st Phase – Fast Ingestion using Apache ORC
0
5000
10000
15000
20000
25000
1 Stream 4 Streams
Ela
psed T
ime (
sec)
Parquet vs ORC1TB TPC-DS
Parquet ORC v1
Apache ORC libs key benefits
▪ Many-row-at-a-time API
▪ Enable JIT-friendly code
▪ Represent data using
primitive Java types
▪ Make projection and
selection pushdown
very easy
2% Faster
with ORC
65% Slower
with ORC
© 2017 IBM Corporation17
2nd Phase - Managing Resources
0
2000
4000
6000
8000
10000
12000
14000
1 Stream 4 Streams
Ela
psed T
ime (
sec)
Parquet vs ORC1TB TPC-DS
Parquet ORC v2
Resource Manager has global
oversight over
▪ Total number of threads
▪ Overall JVM heap
consumption
▪ Degree of parallelism
per scan
15% Faster
with ORC
3.4% Faster
with ORC
© 2017 IBM Corporation18
ORC as a First Class Citizen in 5.0.1
Head Node
Worker NodeWorker Node
Worker Node
Parquet
IO
Hive
Compat.
IO
Hive
Metastore
HDFS
HDFS
NN
ORC
IO
© 2017 IBM Corporation20
What is Apache ORC?
ORC = efficient storage + fast
ingestion
Compression
• Type-specific encodings (RLE for numbers, dictionary for strings, etc)
• Generic compression (Zlib, Snappy)
Data skipping
• Column skipping based on data layout
• Row skipping based on MIN/MAX stats and bloom filters
JIT friendly
• Vectorized APIs (retrieve data as arrays of primitive values)
▪ Engines leverage all
these features
▪ Apache ORC libs allow
applications to leverage
them too
© 2017 IBM Corporation21
ORC Physical Data Layout
CREATE HADOOP TABLE SALES(id INTEGER,
quantity INTEGER,
amount DOUBLE)
Stripe stats
Stripe stats
Stripe stats
File stats
Stripe (HDFS block)
Row group (10K rows)
10K id values
10K quantity values
10K amount values
Row group stats
© 2017 IBM Corporation23
Dependencies and Classes
▪ Java Dependencies (orc.apache.org group id in Maven)
orc-core-1.4.0-nohive.jar
aircompressor-0.3.jar
▪ Java Classes for “vectorized” processing
import org.apache.orc.OrcFile;
import org.apache.orc.Reader;
import org.apache.orc.RecordReader;
import org.apache.orc.storage.ql.exec.vector.VectorizedRowBatch;
import org.apache.orc.storage.ql.exec.vector.DoubleColumnVector;
import org.apache.orc.storage.ql.io.sarg.SearchArgument;
© 2017 IBM Corporation24
Using Vectorized ORC APIs
Reader r = OrcFile.createReader(path, OrcFile.readerOptions(conf));
RecordReader rr = r.rows();
VectorizedRowBatch batch = r.getSchema().createRowBatch(1000);
1000 Values
ID
QUANTITY
AMOUNT
long quantity[1000];
long id[1000];
double amount[1000];
▪ A vectorized row batch is a Java object that contains 1000 decoded rows
© 2017 IBM Corporation25
JIT friendly code
// Compute sum(amount)
double sum = 0;
while (rr.nextBatch(batch)) {
long[] qty = ((LongColumnVector) batch.cols[1]).vector;
double[] amt = ((DoubleColumnVector) batch.cols[2]).vector;
for (int i=0; i < batch.size; i++)if (qty[i] < 500)
sum += amt[i];
}
▪ Get the total for sales involving less than 500 items
SELECT sum(amount)
FROM sales
WHERE quantity < 500
• No objects
• No method calls
• Tight loop compiles
to machine code
© 2017 IBM Corporation26
Column Skipping/Pruning (a.k.a. Projection Pushdown)
▪ If we don’t say otherwise, ORC will read all the columns
▪ But our query is using only two columns
ID
QUANTITY
AMOUNT
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation27
Column Skipping/Pruning (a.k.a. Projection Pushdown)
// Projection
boolean projection[] = new boolean[] {false, true, true});
// ORC RecordReader with projection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection));
// Compute sum(amount)
double sum = 0;
while (rr.nextBatch(batch)) { … }
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation28
Column Skipping/Pruning (a.k.a. Projection Pushdown)
create external hadoop table web_sales (
ws_sold_date_sk int, ws_sold_time_sk int, ws_ship_date_sk int, ws_item_sk int not null, ws_bill_customer_sk int, ws_bill_cdemo_sk int, ws_bill_hdemo_sk int, ws_bill_addr_sk int, ws_ship_customer_sk int, ws_ship_cdemo_sk int, ws_ship_hdemo_sk int, ws_ship_addr_sk int, ws_web_page_sk int, ws_web_site_sk int, ws_ship_mode_sk int, ws_warehouse_sk int, ws_promo_sk int, ws_order_number bigint not null, ws_quantity bigint, ws_wholesale_cost double, ws_list_price double, ws_sales_price double, ws_ext_discount_amt double, ws_ext_sales_price double, ws_ext_wholesale_cost double, ws_ext_list_price double, ws_ext_tax double, ws_coupon_amt double, ws_ext_ship_cost double, ws_net_paid double, ws_net_paid_inc_tax double, ws_net_paid_inc_ship double, ws_net_paid_inc_ship_tax double, ws_net_profit double
)
STORED AS ORC;
0
5
10
15
20
25
30
With Projection Pushdown Without Projection Pushdown
Ela
psed t
ime (
seconds)
SELECT max(ws_sold_date_sk)FROM tpcds_1tb.web_sales
3.5
24.9
© 2017 IBM Corporation29
Row Skipping/Pruning (a.k.a. Predicate Pushdown)
▪ Row skipping leverages the MIN/MAX stats
quantityMIN:274,MAX:590
quantityMIN:603,MAX:3000
quantity
MIN:510, Max:540
quantityMIN:330, Max:420
Stripes Row groups
SELECT sum(amount)
FROM sales
WHERE quantity < 500
• Data must be sorted for
pruning to be effective!
© 2017 IBM Corporation30
Row Skipping/Pruning (a.k.a. Predicate Pushdown)
// Predicates
SearchArgument selection = SearchArgumentFactory.newBuilder()
.lessThan("quantity", PredicateLeaf.Type.LONG, 500L).build();
// ORC RecordReader with projection and selection pushdown
RecordReader rr = r.rows(
new Reader.Options()
.include(projection)
.searchArgument(selection, new String[] {}));
// Compute sum(amount)
double sum =0;
while (rr.nextBatch(batch)) { ... }
SELECT sum(amount)
FROM sales
WHERE quantity < 500
© 2017 IBM Corporation32
Scaling Up
▪ Reading too many files hurts performance and can cause OOM
▪ How Many?
Number of disks
Java Heap size
▪ Must have multiple disks AND the files must be evenly distributed
across the disks
▪ But number of threads must be limited by the Java heap size!
Need 30 to 80 MB of Java Heap per open ORC file/split
Biggest consumers
• RecordReader
• VectorizedRowBatch
© 2017 IBM Corporation33
Scaling Up In an Engine
▪ An engine must handle concurrency (multiple queries/users)
▪ Fixed # of open files per scan leads to OOMs
▪ Must gracefully degrade performance instead of OOM
▪ Need to limit total open files
▪ Multiple issues to deal with
Starvation (all queries must make reasonable progress)
Stragglers (work for a scan must be evenly balanced across nodes)
Adapting parallelism to concurrency
• Single query must take full advantage of the available resources
• As concurrency increases, parallelism decreases for ongoing scans
• When concurrency decreases, parallelism per scan increases
© 2017 IBM Corporation34
Big SQL Tuning for ORC
• The “fundamentals” (regardless of the data format)
Ensure Big SQL has enough resources
• % memory
• % CPU
• Temp storage spread across multiple disks (e.g. same disks as HDFS)
Use Partitioning
Run ANALYZE on all your tables (or ensure AUTO ANALYZE is enabled)
▪ ORC-specific tuning (Big SQL 5.0.1 & up)
Property bigsql.java.orc.preftp.size controls the max number of open ORC files
▪ ORC file creation
Data sorted by filtering columns
Stripe and row group size
Bloom filters
▪ For more Big SQL tuning tips, see
https://developer.ibm.com/hadoop/2016/11/16/top-6-big-sql-v4-2-performance-tips/
© 2017 IBM Corporation35
Summary
▪ ORC = Storage Efficiency + Fast Ingestion
▪ Fast ingestion using Vectorized APIs in Apache ORC
▪ Big SQL performs best with data in ORC format!