Friday, August 16, 2019

Snowflake: PUT command and issues

Related links: 

Working with Snowflake PUT command I've uncovered a few undocumented facts worth posting.  This intended to help you to make the most effective design decision transferring data to Snowflake database effectively having the process scale to the load.


PUT uses /tmp by default

Snowflake PUT command copies/replicates your local host data source files on itself. The default directory is /tmp. You can change the temp directory to point to a larger disk space by setting local environment variables:

    export TEMP=/mnt/nfs/temp/largedisk
    export TMPDIR=$TEMP
    export TMP=$TEMP
In this case, PUT command will use indicated  directory for local temporary staging the files before transfer. 


Why this is important? 

If you transfer large files that don't fit into /tmp or your process runs many PUT commands concurrently you will get: 

    [Errno 28] No space left on device, file=/mnt/nfs/temp/largefile_1of999.txt, real file=/mnt/nfs/temp/largefile_1of999.txt
The message is misleading, because I run out of space on /tmp that PUT implicitly uses , not an indicated file, whereas source directly /mnt/nfs/temp/ still has plenty of space. 


Compress or not compress ? 

By default PUT command will compress files locally before transfer: option AUTO_COMPRESS = TRUE.  However,  running a few practical tests revealed that default PUT compression is slow and CPU hungry.  Moreover, it consumes twice the amount of local disk space on the local server apparently making uncompressed temporary copy of a file first (see above), then compressing it.  It creates a lot of disk and CPU contention. To overcome this problem, I recommend pre-compressing files locally first by gzip -1c then transfer with PUT option  AUTO_COMPRESS = False.  Like this:

    gzip -1q /tmp/largefile_1of999.txt   ## 1 - compress faster
    snowsql -q  "PUT   file:///tmp/largefile_1of999.txt.gz   @~/ AUTO_COMPRESS = False;"
    snowsql -q  "COPY INTO target_database..target_table from @~/largefile_1of999.txt.gz FILE_FORMAT = ${DBS}.public.fmt_tsv_gzip ON_ERROR = CONTINUE;"


Set snowsql option PROGRESS_BAR = False 

The default PROGRESS_BAR = TRUE for snowsql client.  This creates megabytes log files when redirecting the output to text log file  
    <LF>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.01% (0.458s, 5713.03MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.02% (0.461s, 5684.18MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.03% (0.581s, 4507.99MB/s)<CR>
    .... 12,000 records ...                                                     <CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 99.98% (37.905s, 69.08MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 99.99% (37.906s, 69.08MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 100.00% Done (37.954s, 68.99MB/s).<CR><LF>

You can remove these repeatable records after the file was generated:

    sed -i -e 's/^\d013.*$//g' ${LOGFDATA}  ### Remove progress status of file transfer

But better solution is set to false the PROGRESS_BAR option and stop producing these: 

    snowsql -q "!set progress_bar = FALSE; PUT  file:///tmp/largefile_1of999.txt.gz ..."
or
    snowsql -o progress_bar=false -q "PUT  file:///tmp/largefile_1of999.txt.gz ... "



.

No comments:

Post a Comment