Upload
denis-norton
View
215
Download
3
Embed Size (px)
Citation preview
Performance and Insights on File Formats – 2.0
Luca Menichetti, Vag Motesnitsalis
2
Design and Expectations2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record)
5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro
The tests gave insights on specific advantages and dis-advantages for each format as well as their time and space performance.
3
Experiment descriptionsFor the “exhaustive” use case (UC1) we used EOS logs
“processed” data. Current default data format is CSV.
For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard.
Current default data format is JSON. For each use case all formats were generated a priori (from
the default format) and then executed the tests. Technology: Spark (Scala) with SparkSQL library. No test performed with compression.
4
FormatsCSV – text files, comma separated values, one per lineJSON – text files, JavaScript objects, one per lineSerialiazed RDD Objects (SRO) – Spark dataset serialized
on text filesAvro – serialization format with binary encodingParquet – colunmar format with binary encoding
5
Space Requirements (in GB)
CSV JSON SRO AVRO Parquet0
5
10
15
20
25
12.7
23.4
19.9
11.7
6.36.2
13.4
8.16.6
2.1
UC1UC2
Siz
e (
GB
)
Job Monitor-ing
default
EOSdefault
6
Spark executionsfor i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i
We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers).Times include reading and computation (test jobs don't write any file, they just print to stdout the result ).
7
Times: UC1 "Exhaustive"
CSV JSON SRO AVRO Parquet0
50
100
150
200
250
300
80.3
245.9
155.8
80.7
108.4
65.41
205
131.6
64.578.5
AVGMIN
Tim
e (
secon
ds)
12.7 23.4 19.9 11.7 6.3 GB
8
Times: UC2 "Selective"
CSV JSON SRO AVRO Parquet0
20
40
60
80
100
120
42
109.5
74.6
52.6
23.7
35.1
83.3
63.6
40.4
18.4
AVGMIN
Tim
e (
secon
ds)
6.2 13.4 8.1 6.6 2.1 GB
9
Time Comparison between UC1 and UC2
CSV JSON SRO AVRO Parquet0
50
100
150
200
250
300
AVG UC1 AVG UC2
secon
ds
10
Space and Time Performance Gain/Loss
[compared to current default format]
CSV JSON SRO Avro Parquet
Space UC1[EOS logs]CSV
= + 84 % + 56 % - 8 % - 51 %
Time performance UC1
= + 215 % + 93 % = + 35 %
Space UC2[Job Monitoring]JSON
- 54 % = - 40 % - 51 % - 84 %
Time performanceUC2
- 64 % = - 35 % - 54 % - 79 %
11
Pros and ConsPros Cons
CSV Always supported and easy to use.Efficient.
No schema change allowed. No type definitions.No declaration control.
JSON Encoded in plain text (easy to use).Schema changes allowed.
Inefficient. High space consuming.No declaration control.
Serialized RDD Objects
Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result.
Spark only. No compression.Schema changes allowed but to be manually implemented.
Avro Schema changes allowed.Efficiency comparable to CSV.Compression definition included in the schema.
Space consuming like CSV (not really a negative).Needs a plugin (we found an incompatibility with our Spark version and avro library, we had to fix and recompile it).
Parquet Low space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases.
Needs a plugin.Slow to be generated.
12
Data Formats - OverviewCSV JSON SRO Avro Parquet
Support Change of Schema
NO YES YES YES YES
Primitive/Complex Types
- YES (but with general numeric)
YES YES YES
Declaration control - NO YES YES YES
Support Compression YES YES NO YES YES
Storage Consumption Medium High Medium/High Medium Low (RLE)
Supported by which technologies?
All All (to be parsed from text)
Spark only All (needs plugin)
All (needs plugin)
Possilibity to print a snippet as sample
YES YES NO YES (with avro tools)
NO (yes with unofficial tools)
13
ConclusionsThere is no “ultimate” file format but…Avro shows promising results for exhaustive use cases, with performances
comparable to CSV.Parquet shows extremely good results for selective use cases and really low space
consuming. JSON is good to store directly (without any additional effort) data coming from web-
like services that might change their format in a future, but it is too inefficient and high space consuming.
CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user.
Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format.
14
Thank You
15
Spark UC1 executions
(EM,NE,EC): 2G 4 2
(EM,NE,EC): 2G 2 2
(EM,NE,EC): 2G 2 1
0 50 100 150 200 250 300 350 400 450 500
UC1
parquet sro avro json csv
16
Spark UC2 executions
(EM,NE,EC): 2G 4 2
(EM,NE,EC): 2G 2 2
(EM,NE,EC): 2G 2 1
0 50 100 150 200 250
UC2
parquet sro avro json csv