Netezza: SQL and Admin: Snowflake: PUT command and issues

PUT uses /tmp by default

Snowflake PUT command copies/replicates your local host data source files on itself. The default directory is /tmp. You can change the temp directory to point to a larger disk space by setting local environment variables:

    export TEMP=/mnt/nfs/temp/largedisk
    export TMPDIR=$TEMP
    export TMP=$TEMP

In this case, PUT command will use indicated directory for local temporary staging the files before transfer.

Why this is important?

If you transfer large files that don't fit into /tmp or your process runs many PUT commands concurrently you will get:

    [Errno 28] No space left on device, file=/mnt/nfs/temp/largefile_1of999.txt, real file=/mnt/nfs/temp/largefile_1of999.txt

The message is misleading, because I run out of space on /tmp that PUT implicitly uses , not an indicated file, whereas source directly /mnt/nfs/temp/ still has plenty of space.

Compress or not compress ?

By default PUT command will compress files locally before transfer: option AUTO_COMPRESS = TRUE. However, running a few practical tests revealed that default PUT compression is slow and CPU hungry. Moreover, it consumes twice the amount of local disk space on the local server apparently making uncompressed temporary copy of a file first (see above), then compressing it. It creates a lot of disk and CPU contention. To overcome this problem, I recommend pre-compressing files locally first by gzip -1c then transfer with PUT option AUTO_COMPRESS = False. Like this:

    gzip -1q /tmp/largefile_1of999.txt   ## 1 - compress faster
    snowsql -q  "PUT   file:///tmp/largefile_1of999.txt.gz   @~/ AUTO_COMPRESS = False;"
    snowsql -q  "COPY INTO target_database..target_table from @~/largefile_1of999.txt.gz FILE_FORMAT = ${DBS}.public.fmt_tsv_gzip ON_ERROR = CONTINUE;"

Set snowsql option PROGRESS_BAR = False

The default PROGRESS_BAR = TRUE for snowsql client. This creates megabytes log files when redirecting the output to text log file

<LF>

largefile_1of999.txt.gz(2618.64MB): [----------] 0.01% (0.458s, 5713.03MB/s)<CR>

largefile_1of999.txt.gz(2618.64MB): [----------] 0.02% (0.461s, 5684.18MB/s)<CR>

largefile_1of999.txt.gz(2618.64MB): [----------] 0.03% (0.581s, 4507.99MB/s)<CR>

.... 12,000 records ... <CR>

largefile_1of999.txt.gz(2618.64MB): [##########] 99.98% (37.905s, 69.08MB/s)<CR>

largefile_1of999.txt.gz(2618.64MB): [##########] 99.99% (37.906s, 69.08MB/s)<CR>

largefile_1of999.txt.gz(2618.64MB): [##########] 100.00% Done (37.954s, 68.99MB/s).<CR><LF>

You can remove these repeatable records after the file was generated:

    sed -i -e 's/^\d013.*$//g' ${LOGFDATA}  ### Remove progress status of file transfer

But better solution is set to false the PROGRESS_BAR option and stop producing these:

    snowsql -q "!set progress_bar = FALSE; PUT  file:///tmp/largefile_1of999.txt.gz ..."

or

    snowsql -o progress_bar=false -q "PUT  file:///tmp/largefile_1of999.txt.gz ... "

Netezza: SQL and Admin

Friday, August 16, 2019

Snowflake: PUT command and issues

PUT uses /tmp by default

Why this is important?

Compress or not compress ?

Set snowsql option PROGRESS_BAR = False

No comments:

Post a Comment