Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis

Performance and Insights on File Formats – 2.0

Luca Menichetti, Vag Motesnitsalis

2

Design and Expectations2 Use Cases: Exhaustive (operation using all values of a record) Selective (operation using limited values of a record)

5 Data Formats: CSV, Parquet, serialized RDD objects, JSON, Apache Avro

The tests gave insights on specific advantages and dis-advantages for each format as well as their time and space performance.

3

Experiment descriptionsFor the “exhaustive” use case (UC1) we used EOS logs

“processed” data. Current default data format is CSV.

For the “selective” use case (UC2) we used experiment Job Monitoring data from Dashboard.

Current default data format is JSON. For each use case all formats were generated a priori (from

the default format) and then executed the tests. Technology: Spark (Scala) with SparkSQL library. No test performed with compression.

4

FormatsCSV – text files, comma separated values, one per lineJSON – text files, JavaScript objects, one per lineSerialiazed RDD Objects (SRO) – Spark dataset serialized

on text filesAvro – serialization format with binary encodingParquet – colunmar format with binary encoding

5

Space Requirements (in GB)

CSV JSON SRO AVRO Parquet0

5

10

15

20

25

12.7

23.4

19.9

11.7

6.36.2

13.4

8.16.6

2.1

UC1UC2

Siz

e (

GB

)

Job Monitor-ing

default

EOSdefault

6

Spark executionsfor i in {1 .. 50} foreach format in {CSV, JSON, SRO, Avro, Parquet} foreach UC in {UC1, UC2} spark-submit --execution-number 2 --execution-cores 2 --execution-memory 2G --class ch.cern.awg.Test$UC$format formats-analyses.jar input-$UC-$format > output-$UC-$format-$i

We took the time from all (UC, format) jobs to calculate an average for each type of execution (deleting outliers).Times include reading and computation (test jobs don't write any file, they just print to stdout the result ).

7

Times: UC1 "Exhaustive"


50

100

150

200

250

300

80.3

245.9

155.8

80.7

108.4

65.41

205

131.6

64.578.5

AVGMIN

Tim

e (

secon

ds)

12.7 23.4 19.9 11.7 6.3 GB

8

Times: UC2 "Selective"


20

40

60

80

100

120

42

109.5

74.6

52.6

23.7

35.1

83.3

63.6

40.4

18.4

AVGMIN

Tim

e (

secon

ds)

6.2 13.4 8.1 6.6 2.1 GB

9

Time Comparison between UC1 and UC2


50

100

150

200

250

300

AVG UC1 AVG UC2

secon

ds

10

Space and Time Performance Gain/Loss

[compared to current default format]

CSV JSON SRO Avro Parquet

Space UC1[EOS logs]CSV

= + 84 % + 56 % - 8 % - 51 %

Time performance UC1

= + 215 % + 93 % = + 35 %

Space UC2[Job Monitoring]JSON

- 54 % = - 40 % - 51 % - 84 %

Time performanceUC2

- 64 % = - 35 % - 54 % - 79 %

11

Pros and ConsPros Cons

CSV Always supported and easy to use.Efficient.

No schema change allowed. No type definitions.No declaration control.

JSON Encoded in plain text (easy to use).Schema changes allowed.

Inefficient. High space consuming.No declaration control.

Serialized RDD Objects

Declaration control. Choice “between” CSV and JSON (for space and time). Good to store aggregate result.

Spark only. No compression.Schema changes allowed but to be manually implemented.

Avro Schema changes allowed.Efficiency comparable to CSV.Compression definition included in the schema.

Space consuming like CSV (not really a negative).Needs a plugin (we found an incompatibility with our Spark version and avro library, we had to fix and recompile it).

Parquet Low space consuming (RLE). Extremely efficient for “selective” use cases but good performances also in other cases.

Needs a plugin.Slow to be generated.

12

Data Formats - OverviewCSV JSON SRO Avro Parquet

Support Change of Schema

NO YES YES YES YES

Primitive/Complex Types

- YES (but with general numeric)

YES YES YES

Declaration control - NO YES YES YES

Support Compression YES YES NO YES YES

Storage Consumption Medium High Medium/High Medium Low (RLE)

Supported by which technologies?

All All (to be parsed from text)

Spark only All (needs plugin)

All (needs plugin)

Possilibity to print a snippet as sample

YES YES NO YES (with avro tools)

NO (yes with unofficial tools)

13

ConclusionsThere is no “ultimate” file format but…Avro shows promising results for exhaustive use cases, with performances

comparable to CSV.Parquet shows extremely good results for selective use cases and really low space

consuming. JSON is good to store directly (without any additional effort) data coming from web-

like services that might change their format in a future, but it is too inefficient and high space consuming.

CSV is still quite efficient in time and space, but the schema is frozen and leave the validation up to the user.

Serialized Spark RDD is a good solution to store Scala objects that need to be reused soon (like aggregated results to plot or intermediate results to save for future computation), but it is not advisable to use it as final format since it’s not a general purpose format.

14

Thank You

15

Spark UC1 executions

(EM,NE,EC): 2G 4 2

(EM,NE,EC): 2G 2 2

(EM,NE,EC): 2G 2 1

0 50 100 150 200 250 300 350 400 450 500

UC1

parquet sro avro json csv

16

Spark UC2 executions

(EM,NE,EC): 2G 4 2

(EM,NE,EC): 2G 2 2

(EM,NE,EC): 2G 2 1

0 50 100 150 200 250

UC2

parquet sro avro json csv

Documents

Performance and Insights on File Formats – 2.0 Luca Menichetti, Vag Motesnitsalis