13
Taming The Data Load/Unload in Snowflake Sample Code and Best Practice (Faysal Shaarani) Loading Data Into Your Snowflake’s Database(s) from raw data files [1. CREATE YOUR APPLICABLE FILE FORMAT]: The syntax in this section below allows you to create your CSV file format if you are loading data from CSV files. Please note that the backslash escapes are coded for use from the sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the Snowflake UI, the \\\\ occurrences should be changed to \\ [CSV FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV -- Comma field delimited and \n record terminator TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\\\\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = false ERROR_ON_COLUMN_COUNT_MISMATCH = true ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\\\\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = (''); The syntax in the section below allows you to create JSON file format if you are loading JSON data into Snowflake. [JSON FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON TYPE ='JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = false ALLOW_DUPLICATE = false STRIP_OUTER_ARRAY = false; [2. CREATE YOUR DESTINATION TABLE]: Pre-create your table before loading the CSV data into that table.

Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

Embed Size (px)

Citation preview

Page 1: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

TamingTheDataLoad/UnloadinSnowflakeSampleCodeandBestPractice

(FaysalShaarani)

Loading Data Into Your Snowflake’s Database(s) from raw data files

[1. CREATE YOUR APPLICABLE FILE FORMAT]: The syntax in this section below allows you to create your CSV file format if you are loading data from CSV files. Please note that the backslash escapes are coded for use from the sfsql command line, not from the UI. If this FILE FORMAT below is to be used from the Snowflake UI, the \\\\ occurrences should be changed to \\

[CSV FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.CSV

-- Comma field delimited and \n record terminator TYPE = 'CSV' COMPRESSION = 'AUTO' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\\\\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = 'NONE' TRIM_SPACE = false ERROR_ON_COLUMN_COUNT_MISMATCH = true ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = '\\\\134' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('');

The syntax in the section below allows you to create JSON file format if you are loading JSON data into Snowflake.

[JSON FILE FORMAT] CREATE OR REPLACE FILE FORMAT DEMO_DB.PUBLIC.JSON

TYPE ='JSON' COMPRESSION = 'AUTO' ENABLE_OCTAL = false ALLOW_DUPLICATE = false STRIP_OUTER_ARRAY = false;

[2. CREATE YOUR DESTINATION TABLE]: Pre-create your table before loading the CSV data into that table.

Page 2: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

CREATE OR REPLACE TABLE exhibit (Id STRING ,Title STRING ,Year NUMBER ,Collection_Id STRING ,Timeline_Id STRING ,Depth INT); CREATE OR REPLACE TABLE timelines (Id STRING ,Title VARCHAR ,Regime STRING ,FromYear NUMBER ,ToYear NUMBER ,Height NUMBER ,Timeline_Id STRING ,Collection_Id STRING ,ForkNode NUMBER ,Depth NUMBER ,SubtreeSize NUMBER);

If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company.

Run COPY Command To Load Data From Raw CSV Files

Load the data from your CSV file into the pre-created EXHIBIT table. If you encounter a data error on any of the records continue to load what you could. If you do not specify ON_ERROR, the Default would be to skip the file on the first error it encounters on any of the records in that file. The example below would load whatever it could skipping any bad records in the file.

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue';

OR

Load the data into the pre-created EXHIBIT table from several CSV files matching the file name regular expression shown on the sample code below:

COPY INTO exhibit FROM @~/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue';

Page 3: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

To check the listing of files under the subdirectory CleanData under the @~ staging area for your Snowflake Beta Customer account, while in the sfsql command line, use the following command:

ls @~/CleanData To check on the listing of all files whose file names match the regular expression specified in the PATTERN parameter, use the command below:

ls @~/CleanData PATTERN='.*0[1-2].txt.gz';

Verify that the data was loaded successfully into the EXHIBIT table.

Select * from EXHIBIT;

Run COPY Command to Load/Parse JSON Data Raw Staged Files [1. Upload JSON File Into The Customer Account's S3 Staging Area]

PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sample_data @~/json/

[2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION='@~' STAGE_FILE_FORMAT=demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table]

COPY INTO json_data_table_ext FROM @~/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue';

Page 4: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

JSON FILE CONTENT:

Validate that the data in the JSON raw file got loaded into the table

select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...

select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ;

Output:

Page 5: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

If you are using data files that have been staged on your own company’s Amazon S3 bucket:

Run COPY Command To Load Data From Raw CSV Files This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to create a stage object. Create the staging database object.

CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.SAMPLE_STAGE

URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’)

COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake';

LoadthedatafromyourCSVfileintothepre-createdEXHIBITtable.

COPY INTO exhibit

FROM @DEMO_DB/errorsExhibit/exhibit_03.txt.gz FILE_FORMAT = CSV ON_ERROR='continue'; OR

Load the data into the pre-created EXHIBIT table from several CSV files matching the file name pattern regular expression stated on the sample code below.

COPY INTO exhibit

FROM @DEMO_DB/errorsExhibit PATTERN='.*0[1-2].txt.gz' FILE_FORMAT = CSV ON_ERROR='continue'; Verify that the data was loaded successfully into the EXHIBIT table.

Select * from EXHIBIT;

Run COPY Command to Load/Parse JSON Data Raw Staged Files

[1. Create a Stage Database Object Pointing to The Location of The JSON File] This syntax below is needed to create a stage ONLY if you are using your own company’s Amazon S3 bucket. If you are using your Snowflake assigned bucket, you do not need to

Page 6: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

create a stage object.

CREATE OR REPLACE STAGE DEMO_DB.PUBLIC.STAGE_JSON URL = 'S3://<YOUR_COMPANY_NAME>/<SUBFOLDER NAME>/' CREDENTIALS = (AWS_KEY_ID = 'YOUR KEY VALUE' AWS_SECRET_KEY = 'SECRET KEY VALUE’)

COMMENT = 'Stage object pointing to the customer's own AWS S3 bucket. Independent of Snowflake';

Place your file in the staging location defined by the above command

PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/json/json_sample_data @stage_JSON/json/

[2. Create an External Table with a VARIANT Column To Contain The JSON Data] CREATE OR REPLACE TABLE public.json_data_table_ext (json_data variant) STAGE_LOCATION=demo_db.public.stage_json STAGE_FILE_FORMAT= demo_db.public.json COMMENT='json Data preview table'; [3. COPY the JSON Raw Data Into the Table]

COPY INTO json_data_table_ext FROM @stage_json/json/json_sample_data FILE_FORMAT = 'JSON' ON_ERROR='continue';

JSON FILE CONTENT:

Page 7: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

Validate that the data in the JSON raw file got loaded into the table

select * from public.json_data_table_ext; Output: { "root": [ { "age": 22, "children": [ { "age": "6", "gender": "Female", "name": "Jane" }, { "age": "15", ...

select json_data:root[0].kind, json_data:root[0].fullName, json_data:root[0].age from public.json_data_table_ext ;

Output:

Using Snowflake to Validate Your Data Files In this section, we will go over validating the raw data files before performing the actual data load. To illustrate this, we will attempt to load raw data files containing errors and thus making it intentionally fail to load that data into Snowflake. The VALIDATION_MODE option on the COPY command would process the data without loading it in the destination table in Snowflake.

In the proceeding example, the PUT command will stage the exhibit*.txt and timelines.txt files at the default S3 staging location set up for the Beta Customer Account in Snowflake as well

Page 8: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

as illustrate how we can load files under a sub-directory below the root staging directory of a Snowflake Beta Customer Account.

PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/csv_samples/ErrorData/exhibit*.txt @~/errorsExhibit/;

----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ source | target | source_size | target_size | source_compression | target_compression | status | details | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ exhibit_03.txt | exhibit_03.txt.gz | 8353 | 3734 | NONE | GZIP | UPLOADED | | exhibit_01.txt | exhibit_01.txt.gz | 14733 | 6207 | NONE | GZIP | UPLOADED | | exhibit_02.txt | exhibit_02.txt.gz | 14730 | 6106 | NONE | GZIP | UPLOADED | | ----------------+-------------------+-------------+-------------+--------------------+--------------------+----------+---------+ 3 rows in result (first row: 1.501 sec; total: 1.504 sec) Below are three possible raw data validation scenarios and sample code:

1. The following example would allow the previewing of 10 records from the first raw data file exhibit_01.txt. This file does not have any errors.

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_01.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_10_rows';

2. The following example below simulates the scenario of having an extra delimiter in the record and how the errors that would be displayed.

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors';

3. Thefollowingexamplebelowsimulatesthescenarioofhavingacolumnvaluethatisof

thewrongdatatypeandhowtheerrorwouldlookliketheoutputbelowafterrunningtheCOPYcommandbelow:

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_03.txt.gz

Page 9: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors';

Using Snowflake to Unload Your Snowflake Data to Files To create a data file from a table in the Snowflake database, use the below command: COPY INTO S3 FROM EXHIBIT table

COPY INTO @~/giant_file/ from exhibit;

OR to overwrite the existing files in the same directory, use the OVERWRITE option as in the command below:

COPY INTO @~/giant_file/ from exhibit overwrite=true; Please note that by default, Snowflake will unload the data from the table into multiple files of a size (16 MB) per file. If you want your data to be unloaded to a single file, then you need to use the SINGLE option on the COPY command as in the example below:

COPY INTO @~/giant_file/ from exhibit Single=true overwrite=true;

Please note that AWS S3 has a limit of (5 GB) on the file size you can stage on S3. You can use the optional MAX_FILE_SIZE (in bytes) to change the Snowflake default file size. Use the command below if you want to specify bigger or smaller file sizes than the Snowflake default file size as long as you do not exceed the AWS S3 max file size. For example, the below command unloads the data in the EXHIBIT table into files of 50M each:

COPY INTO @~/giant_file/ from exhibit max_file_size= 50000000 overwrite=true;

Using Snowflake to Split Your Data Files Into Smaller Files If you are using data files that have been staged on your Snowflake’s Customer Account S3 bucket assigned to your company. When loading data into Snowflake, it is recommended that the raw data is split into as many files as possible to maximize the parallelization of the data loading process and thus

Page 10: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

completing the data load in the shortest amount of time possible. If your raw data is in one raw data file, you can use Snowflake to split your large data file, into multiple files before loading the data into Snowflake. Below are the steps for achieving this:

• Place the Snowflake sample giant_file from your local machine's directory into the @~/giant_file/ S3 bucket using the following command:

PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/csv_samples/CleanData/giant_file_sample.csv.gz @~/giant_file/;

• Createasingle-column file format for examining the data in the

data file. CREATE OR REPLACE FILE FORMAT single_column_rows TYPE='CSV' SKIP_HEADER=0 RECORD_DELIMITER='\\\\n' TRIM_SPACE=false DATE_FORMAT='AUTO' TIMESTAMP_FORMAT='AUTO' FIELD_DELIMITER='NONE' FIELD_OPTIONALLY_ENCLOSED_BY='NONE' ESCAPE_UNENCLOSED_FIELD='\\\\134' NULL_IF=('') COMMENT='copy each line into single-column row';

• Create an external table in the Snowflake database specifying the staging area and file

format to be used: CREATE OR REPLACE TABLE GiantFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/giant_file/ STAGE_FILE_FORMAT= single_column_rows COMMENT='GiantFile preview table';

• Run the COPY command below to create small files while limiting the file size to 2MB.

This would split the data across multiple small files at 2MB each from a single original data file.

COPY INTO @~/giant_file_parts/ FROM (SELECT * FROM table(stage(GiantFile_ext , pattern => '.*giant_file_sample.csv.gz'))) max_file_size= 2000000;

Page 11: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

• Verify Files of the Data You Unloaded ls @~/giant_file_parts;

To place a copy of the S3 giant file parts onto your local machine after they have been split into several files of 2 MB each, use the below command:

get @~/giant_file_parts/ file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/csv_samples/CleanData/

To remove all the files at the staging bucket location you want to clean up, use the following command:

remove @~/giant_file_parts;

To remove a specific set of files from the giant file directory whose names match a regular expression (i.e. remove all the files whose name ends with .csv.gz', use the following command:

remove @~/giant_file pattern='.*.csv.gz';

Recommended Approach to Debug and Resolve Data Load Issues Related to Data Problems

[WHAT IF YOU HAVE DATA FILES THAT HAVE PROBLEMS]: This section below suggests a recommended flow for iterating through data fix on the data file, and loading data into Snowflake via the COPY command. Snowflake’s COPY command syntax supports several parameters that are helpful in debugging or bypassing bad data files that are not possible to load due to various potential data problems, which may need to be fixed before the data file can be read and loaded

[FIRST PASS] LOAD DATA WITH ONE OF THE THREE OPTIONS BELOW:

Page 12: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

SKIPPING BAD DATA FILES: 1. Attempt to load with the ON_ERROR = 'SKIP_FILE' error handling parameter. With

thiserrorhandlingparametersetting,fileswitherrorswillbeskippedandwillnotbeloaded.

[ON_ERROR=’SKIP_FILE’]COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file';

OR SKIPPING BAD DATA FILES IF ERRORS EXCEED A SPECIFIED LIMIT: 1. Attempt to load with more tolerant error handling

ON_ERROR=SKIP_FILE_[error_limit]. With this option for error handling, files with errors could be partially loaded as long as the number of errors does not exceed the stated limit. The file is skipped when the number of errors exceeds the stated error limit. [ON_ERROR=’SKIP_FILE_[error_limit]’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='skip_file_10';

OR PERFORM PARTIAL LOAD FROM THE BAD DATA FILES: 1. Attempt to load with more tolerant error handling using ON_ERROR=’CONTINUE’.

With this option for error handling, files with errors could be partially loaded.

[ON_ERROR=’CONTINUE’] COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) ON_ERROR='continue';

[SECOND PASS] RETURN THE DATA ERRORS: Validate the files, which were skipped and failed to load from the first pass. This time, attempt to load the bad data files with VALIDATION_MODE='RETURN_ERRORS'. This allows the COPY command to return the list of errors within each data file and the position of those errors.

Page 13: Load & Unload Data TO and FROM Snowflake (By Faysal Shaarani)

COPY INTO exhibit FROM @~/errorsExhibit/exhibit_02.txt.gz FILE_FORMAT = (FIELD_DELIMITER = '|' SKIP_HEADER = 1) VALIDATION_MODE='return_errors';

[FIX THE BAD RECORDS] A. Download the bad data files containing the bad records from the staging area to

your local drive:

get @~/errorsExhibit/exhibit_02.txt.gz

file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/csv_samples/ErrorData

[PREVIEW RECORDS FROM YOUR BAD DATA FILE(S)] To get visibility into the records in the data file and few of its records, use an external table and read few record from the data file to see what a sample record looks like.

CREATE OR REPLACE TABLE PreviewFile_ext (fullrow varchar(4096) ) STAGE_LOCATION=@~/errorsExhibit/ STAGE_FILE_FORMAT= single_column_rows COMMENT='Bad data file preview table';

SELECT * FROM table(stage(PreviewFile_ext ,

pattern => '.*exhibit_02.txt.gz')) LIMIT 10;

Fix the bad records manually and write them to a new data file, or regenerate a new data file from the data source containing only the bad records that did not load (as applicable). B. Upload the fixed bad data file(s) into the staging area for re-loading and attempt

reloading from that fixed file:

PUT file:///Users/fshaarani/SVN/CUSTOMERREPO/examples/dataloading/csv_samples/ErrorData/exhibit_02.txt.gz @~/errorsExhibit/