S8443: Feeding the Big Data Engine - NVIDIA€¦ · Motivation for High Speed Data Importing This module is a part of the Helios Platform’s High Performance Data Querying & Access

S8443: Feeding the Big Data EngineHow to Import Data in Parallel

Presented By:Brian Kennedy, CTO

Providence – Atlanta

Email: [email protected]

Introduction to Simantex, Inc.

• Simantex Leadership– Experts in diverse public gaming, artificial intelligence applications, e-commerce, and software development

– Gaming industry experience in lottery, casino, horse racing, sports betting, and eSports

– Large business/enterprise pedigree complemented by start-up experience and the ability to scale up

– Track record of creating partnerships, ecosystems, and collaboration

– Global B2B and B2G experience

• Helios General Purpose AI/Simulation Platform– Helios is a revolutionary new approach to Enterprise software, forming a marriage of Wisdom and Artificial

Intelligence to provide real-world solutions

– Leveraging a proprietary simulation approach, Helios incorporates human learning, reasoning, and perceptual processes into an AI platform

– Simantex is looking to apply it to the emerging eSports industry to combat fraud, detect software weakness, and improve player performance

Motivation for High Speed Data ImportingThis module is a part of the Helios Platform’s High Performance Data Querying & Access Layer. When we began work on this module the intent to be able to achieve these objectives:

• Efficient utilization of server resources(Multitenant / Cost-savings)

• Scalability to handle clients with massive data needs

• Develop a complete enterprise solution that was 100% GPU based

Proving that just about any problem, no matter how serial in nature it appears, can be mapped to the GPU and achieve significant performance gains

Complexities of the CSV format

• The first line of data could be a column name header record

First Name,Last Name,Notes,Age,Applying From State,GPAJohn,Smith,Honnor Student,18,Nevada,3.77Marybeth,Parkins,Always "early" to class,17,Colorado,3.42Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85"Sarah",Baxter,17,New Jersey,2.90Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05白,常,专注于研究和运动: hockey,17,California,3.65

SampleCollegeApplicantCSV File


• Column widths are inconsistent from one record to the next




• Columns may be quoted (meaning they start and end with a quote) • This means that the Delimiter could be part of the data

• The quotes surrounding a column should not be treated as part of the column data




• Quotes may exist in column data where the column is not quoted

• Quoted columns may have quotes in the data which are then double quoted




• Columns may exceed target data size• Let’s say in this example the Notes column is a nvarchar(50)

Notice that we only counted the double quote characters as 1, and we made sure not to count the outer quotes.

Even still, this column exceeds our size constraint, so this record is an error.




• Number of columns may differ from one record to the next• Possible error situation

First Name,Last Name,Notes,Age,Applying From State,GPAJohn,Smith,Honnor Student,18,Nevada,3.77Marybeth,Parkins,Always "early" to class,17,Colorado,3.42Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85"Sarah",Baxter,17,New Jersey,2.90 missing Notes column! Fred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05白,常,专注于研究和运动: hockey,17,California,3.65



• UTF-8 Text support for multi-language support means:• A character may be 1 – 3 bytes long affecting how we “count” characters to

determine max size constraints

• Columns can have a mixture of 1, 2, and 3 byte characters




• Not all columns may need to be retrieved from the text• Maybe in this run we only want to import:

• Last Name, Age and Applying From State

• So the Importer needs to be able to skip columns without writing out the data

First Name,Last Name,Notes,Age,Applying From State,GPAJohn,Smith,Honnor Student,18,Nevada,3.77Marybeth,Parkins,Always "early" to class,17,Colorado,3.42Sam,Peterson,"Hobbies include: tennis, football, swimming",19,Kansas,2.85"Sarah",Baxter,17,New Jersey,2.90 error row – missing columns – not importedFred,Roberts,"Focuses on ""Extreme"" Sports like: sky diving, base jumping, etc",19,Texas,3.05白,常,专注于研究和运动: hockey,17,California,3.65


Thinking Differently, Adapting to Massively Parallel Approaches

This type of problem is traditionally handled by reading data sequentially and managing a variety of “states”.

Our approach will compute the “states” for each byte in the CSV file in parallel and store them in a series of arrays.

Let’s take a look at the general algorithm flow…

1 2 3

11111111111111111111111112222222222222222222222222333333333333333333333333

Read CSV File from disk into CPU Memory in

chunks.

CudaMalloc and cudaMemcpy CSV File

chunk into GPU Memory

CSV Reader processes the CSV File chunk in GPU Memory and outputs

results to Arrays for each column/field.

Col1

Data

Data

Data

Data

Col2

Data

Data

Data

Data

CSV Reader Program Flow

Output arrays are in GPU memory

GPU Processing and Calculations on the

Output Arrays

Results to return to the CPU via cudaMemcpy.

Col1

Data

Data

Data

Data

Col2

Data

Data

Data

DataQueriesData ConsolidationMath OperationsEtc.

A Simplified Example

To simplify the problem for now, let’s assume:

1. Field delimiters only appear in field boundaries. No commas within quotes or double quotes to escape a quote.

2. All data fit within their defined output array widths. There are no overruns.

3. All data are ASCII text characters, so we are always dealing with 1 byte per character.

4. All records or rows have the correct number of fields or columns. No column count errors.

Finding the Individual Record Boundaries

Objectives:1. Locate the record boundaries2. Assign that record number to all the characters in that record.

Tasks:• Allocate two 32bit integer arrays of the same dimension as the CSV byte array.

One array is the Header, the other is the Prefix Sum (or Scan) array.

Performance Tip: All arrays we use that hold state information are made of 32bit integers (4 bytes) to ensure optimal alignment when all 32 threads in a warp write or read data to/from the array.

• Run a kernel where each thread maps to a byte in the CSV byte array. If the byte is a Line Feed write a 1 to the Header Array, otherwise write a 0.

• Run the Prefix Sum (in this case an Exclusive Scan) on the Headers.Now the Scan Array will have the 0-based record number that corresponds to every byte in the CSV Array.

Finding the Individual Record Boundaries

• The following figure illustrates the start of a simple 3-column CSV file.• The first row is the array index for the next three rows.

• The second row is the array of bytes in the start of the CSV.

• The third row is the Linefeed Header array created by the GPU kernel.

The Linefeeds separate the records in the CSV.

• The fourth row is the Exclusive Scan. The value is the 0-based record number that the CSV byte is in.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

H e l l o , 1 2 . 2 , W o r l d CR LF M a r g a r i n e , 1 5 , B u t t e r CR LF A m y , 3 , A b l e CR LF N a

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

Finding Column Boundaries

Objectives:1. Locate the delimiter characters

2. Assign the relative column number to all the characters in that column.

Tasks:• Run a kernel that populates a Header Array for the field delimiters.

• Run a Segmented Scan.

The Segmented Scan works like a regular Scan except that the count is reset at various points, creating separate segments.

In this case our delimiters are commas, and the Segment Boundaries are the Record Boundaries - Linefeeds.

Finding Column Boundaries

• The figure below shows the Segmented Exclusive Scan.• The fifth row is the Columns (Delimiters) Header Array.

• The sixth row is the Segmented Exclusive Scan, which resets to 0 after every Linefeed. The value is the 0-based column number within the record (row).

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 2 2 2 2 2 0 0

The Records Table

Objectives:1. Build an array mapping the end of each Record in the CSV

Tasks:• Run Exclusive Scan and Stream Compact kernels on the Header array.

The Index of the array is the 0-based Record Number, and the Value in the Array is the Index of the Linefeed at the end of the record in the CSV Array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

17 38 50 66 . . . . . . . . . . . .

The Columns Table

Objectives:1. Build an array mapping the end of each Column in the CSV

Tasks:• Run Exclusive Scan and Stream Compact kernels on the Columns Header array.

Third row identifies the column delimiters in blue, and linefeeds in pink.

Fourth row is the Exclusive Scan, mapping index offsets into the Columns Table.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0

0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 5 5 5 5 5 5 5 5 6 6 6 6 7 7 8 8 8 8 8 8 9 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5 10 17 27 30 38 42 44 50 . . . . . . .

Record to Columns Mapping Table

Objectives:1. Build an array mapping the Column Table array index value that corresponds

to the last column in the record

Tasks:• Run a specialized Stream Compact kernel.

• With this we can catch and filter out records with column-count errors, and as an optimization for threads to compute where in the Columns table their information starts.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5 10 17 27 30 38 42 44 50 . . . . . . .

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

17 38 50 66 . . . . . . . . . . . .

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 5 8 11 . . . . . . . . . . . .Records-To-Columns Table:

Columns Table:

Records Table:

Printable Bytes Array

Objectives:1. Build an array mapping the end of each Record in the CSV

Tasks:• Run a set of kernels that identifies Printable Bytes.

In our simple example, it means all bytes except commas, carriage returns, or linefeeds.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 2 2 2 2 2 2 0 0

1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0 1 1

Character Position Array

Objectives:1. Build an array indicating the byte position of each character in a column

Tasks:• Run a Segmented Exclusive Scan of the Printable Bytes, with the Segment resetting on both

Column Headers and Record.Row 8 shows the values given the byte or character position of each input character.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 2 2 2 2 2 2 0 0

1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0 1 1

0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 4 5 5 0 1 2 3 4 5 6 7 8 9 0 1 2 0 1 2 3 4 5 6 6 0 1 2 3 0 1 0 1 2 3 4 4 0 1

H e l l o 1 2 . 2 W o r l d M a r g a r i n e 1 5 B u t t e r A m y 3 A b l e N a

Record:

Column:

Printable:

Position:

Putting it all together

• A final kernel creates the Output Arrays with each thread checking the Printing Headers to see if its byte should be written. • If so, it checks the Scans and Segmented Scans to identify exactly where this

byte goes in the Output:• Which Record.

• Which Column (or Output Array in this case).

• Which byte or character position within the Record within the Output Array.

• Here are the first few records in our sample CSV in the Output Arrays. 0 H e l l o 1 2 . 2 W o r l d

1 M a r g a r i n e 1 5 B u t t e r

2 A m y 3 A b l e

3 N a t i o n a l i t y 1 7 9 . 1 A r m e n i a n

4 A l a b a s t e r 2 3 4 p a l e o

5 M o p e d 7 7 . 3 3 3 3 a w a r d

6 P a r l i a m e n t 4 5 P a s c a l

7 M o v i n g 6 7 8 v a n

8 p a v l o v 3 4 5 8 d o g

Let’s Walk Through a Single Thread

0 H e l l o 1 2 . 2 W o r l d

1 M a r g a r i n e 1 5 B u t t e r

2 A m y 3 A b l e

3 N a t i o n a l i t y 1 7 9 . 1 A r m e n i a n

4 A l a b a s t e r 2 3 4 p a l e o

5 M o p e d 7 7 . 3 3 3 3 a w a r d

6 P a r l i a m e n t 4 5 P a s c a l

7 M o v i n g 6 7 8 v a n

8 p a v l o v 3 4 5 8 d o g

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52


0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3

0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 1 1 1 2 2 2 2 2 2 2 2 0 0 0 0 1 1 2 2 2 2 2 2 0 0

1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0 1 1

0 1 2 3 4 5 0 1 2 3 4 0 1 2 3 4 5 5 0 1 2 3 4 5 6 7 8 9 0 1 2 0 1 2 3 4 5 6 6 0 1 2 3 0 1 0 1 2 3 4 4 0 1

H e l l o 1 2 . 2 W o r l d M a r g a r i n e 1 5 B u t t e r A m y 3 A b l e N a

Record:

Column:

Printable:

Position:

Thread 8

Handling the More Advanced CSV Features

In the above example, we greatly simplified the features of the CSV format that we dealt with.

However, the remaining features can be supported in much the same way, building custom arrays to indicate the state of each feature.

The following are a list of additional kernels that we developed to handle these features:

More Buffers, Scans, and Compacts – Oh My• DoubleQuotes: Identify double quotes in column content

• Merge2ndQuotesAndNonPrinting: Remove 2nd quotes as printable bytes

• QuoteBoundaryHeaders: Identify if character is inside quotations

• FixColumnHeaderCommas: Include delimiters in quotes as printable bytes

• PrintingCharacters: Filters out quotes around columns and 2nd quotes from the printable bytes

• BufferPrinting: Stream compacts to remove non-printable bytes

• IdentifyColumnCountErrors: Identifies records with incorrect number of columns

• BuildCharsHeadersOnly: Creates a scan mapping for multi-byte characters

so they can be identified as a single character when counting.

• CharacterCountErrors: Counts the number of characters (not bytes) a column contains.

Optimizations of the Core Parsing Engine

Unlike our simplified example, where each thread writes one byte to the output array, the Core is based on 4 bytes per thread.

Memory alignment is critical, and reads and writes should always be configured on even 4-byte boundaries relative to their allocations. Never access a 32-bit integer at byte offsets 1, 2 or 3, but only at 0 or 4 (or multiples of 4).

Take advantage of casting byte arrays to 32-bit or 64-bit integer arrays to move 4 or 8 bytes in single instructions.

When dealing with objects that can be many bytes long (such as strings), sometimes its better to think of Warps, not Threads, as your logical worker unit.

We will look into that in detail next...

A Warp Centric Approach

• Each Warp handles one full record or row, regardless of the length of that row.

• The CSV byte array starts on an even 128-byte boundary, so the first record will start on an even 128-byte boundary. However, subsequent records are most likely not to do so.

• Each Warp will calculate its Warp Index, which is the Index of Thread 0 in the Warp divided by 32.

• The Warp Index will correspond to the 0-based Record Number which it will process.• Special additional code determines if the target record was marked invalid (having an error),

and if so, the wrap is re-assigned to the next record.

Core CSV Buffer Read Alignment1. The Warp computes the 128-byte aligned starting address for its Record from information in the

Records Table based on its Warp Index.

2. It then grabs the next value in the Records Table to find where its record ends.

3. The Warp will load one or more 128-byte chunks of memory until it encompasses one entire record.

Below shows how 128-byte chunks may map unevenly to records, whose lengths can vary. In this situation, we read the first chunk for Record 0, the first and second for Record 1, the second through fourth for Record 2, and so on.

128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 Bytes 128 BytesRecord 0 Record 1 Record 2 Record 3 Record 4 Record 5

This may seem inefficient as often time more bytes will be read than needed, however we gain several advantages:1) All reads are properly aligned on a multiple of 4 boundary.2) Several wraps may need the same 128-byte chunk, and therefore we increase the change that it is

available in cache, eliminating the slow fetch from Global memory.

Memory Types and Usage• Shared Memory

• When a record is longer than one 128-byte chunk, the next chunk is pre-fetched into Shared Memory.

• This is required to support look-ahead logic within the Core algorithm for shuffling and other calculations.

• As the algorithm moves onto chunk 2->N, it is loaded from Shared Memory, and if needed pre-fetches the next chunk 3->N into Shared Memory, continuing the cycle.

• Constant Memory• Constant memory is used when all threads within a Wrap need the same value at once (“broadcasting”).

• The following values used by all threads are copied into Constant Memory:• Field Character Widths

• Field Byte Widths

• The Pointers to each of the Output Arrays

• These values are used by all warps for all records.

IMPORTANT: There is one Output Array for each column in the CSV to be retrieved. To ensure memory alignment, we require that the widths of all output arrays be defined in multiples of 4 or even (preferably) 8. If you used array widths such as 5 or 6, this would be problematic for the Core’s logic.

Shuffling for Aligned Writes• At this point we assume that all non-printing characters, except column delimiters and record

terminators, have been removed from the CSV buffer by the previous kernels.

• The printing bytes of each column are “shuffled down” so that the column aligns with the start of a 4-byte boundary.

• As the process completes, threads that have 4 bytes ready write out to an Output Array. • If the column ends before the end of a 4-byte boundary, the unused bytes are masked off.

The figure below shows the first part of a sample chunk. The tall vertical bars simply demark the 4-byte boundaries representing individual threads.

You see the printing characters of each column in colors in the middle row.

The bottom row represents the positions of the printing characters after the shuffling is complete.

0 1 2 3 4 5 6 7 8 9 10 11 12

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

x y z CR LF a b c , H e l l o " M y " W o r l d , P l e n u m s , E x p o n e n t , e 2 3 . 1 , A B C

a b c H e l l o " M y " W o r l d P l e n u m s E x p o n e n t , e 2 3 . 1 A B C D E

Shuffling for Aligned Writes• Each column now starts at the beginning of a 4-byte boundary.

The dark gray bytes represent the masking that is done to allow full 4-byte writes each time, but eliminate extraneous bytes from the shuffle operations.

0 1 2 3 4 5 6 7 8 9 10 11 12

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

x y z CR LF a b c , H e l l o " M y " W o r l d , P l e n u m s , E x p o n e n t , e 2 3 . 1 , A B C

a b c H e l l o " M y " W o r l d P l e n u m s E x p o n e n t , e 2 3 . 1 A B C D E

a b c H e l l o " M y " W o r l d P l e n u m s E x p o n e n t , e 2 3 . 1

*The actual shuffle algorithm is fairly complex, this has been an oversimplification for this presentation.

Performance Results

Resulting Metric CPU* GPU

Total Time 02:57.917 00:04.460

Row/Second 249,148 9,938,983

Speed Increase 1x 39.9x

Test Platform

Intel Core i7 X990 @ 3.47GHzNVIDIA GeForce GTX 1080 (Pascal Architecture)

Test File

Size: 1 Gigabyte

Rows: 44,327,867

* None of the CPU based CSV importers tested supported all the complexity our engine did, thus if they added the missing functionality, their Rows/Second would drop even lower.

** We used CsvHelper for our CPU benchmark. It has large community with over 4 million downloads and is the closest to our functionality that we tested.

Future Enhancements

We will be looking to add the following future enhancements to our library:

• Multi-GPU support – sending each file “chunk” to a separate GPU for processing.

• Apache Arrow support – adapting our already columnar approach to be aligned with Apache Arrow’s format

• Further performance optimizations like potential kernel fusion

• Possible experimentation with Unified Memory performance

Simantex is please to announce that it has joined GOAi, and contributed this CSV Importer as open source to the project.

You can find the code and a whitepaperexplaining this algorithm in detail here:

https://github.com/gpuopenanalytics

We are also available for consulting and implementation services,please contact me at:

[email protected]

Documents

S8443: Feeding the Big Data Engine - NVIDIA€¦ · Motivation for High Speed Data Importing This module is a part of the Helios Platform’s High Performance Data Querying & Access