Friday, August 23, 2019

AWS: parse S3 files "at place", the data transformation one-liner

Related links:



PROBLEM:

Parse text file at S3: transform characters in lines of text. Keep the same file, same compression format at S3.

SOLUTION: 

Stream data from S3 via sed and back to S3 via pipes.

This trick is useful to cleanse the data inside the files stored at S3. The typical use case, preceding the data load, is to eliminate a special characters and manipulate with characters in the lines of text, which can't be effectively handled by loading tools. "At place" means that a file will be changed at S3, however it still has to be transferred to and form the server to conduct the actual transformations. Notice, to transfer files between S3 and local AWS EC2 is free. Only EC2 usage incurred.  Since the process is pipe-based no intermediate storage needs to be provisioned. The process streams text from pipe to pipe making the transformation on-the-fly and back to S3. Main form:
 
   ### Logic: AWS copy into stout -> uncompress -> sed transform -> compress -> back to AWS S3
   aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gz
For multiple transformation that require many steps consider using  sed -f option keeping the long list of transformation in a saved file: sed -f scriptname.

The example put this all together illustrating step-by-step:

### VIEW ORIGINAL UNTRANSFORMED LINES ###
ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz  - | gzip -dc | grep '"' 
390     2019-08-20 03:49:44     48041202-e88a-3bc7-87cd-66dcb0ef566c    "Star" print porcelain box with hand and ring
492     2019-08-20 03:50:15     53b30df4-9d77-36f2-93b9-22fa35043722    Velvet cushion with "tiger" embroidery
667     2019-08-20 03:51:10     85aa916f-f4d1-3d9b-98a3-63665c54c145    Star Eye XL demitasse cup and saucer, double set
811     2019-08-20 03:52:28     811bcc15-3075-3a5a-b68a-cd5d301e7c7c    "Urtica Ferox" hand and ring mug
3161    2019-08-20 04:13:02     c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24    Freesia, XL "Maison De L'Amour" candle

### CHECK SIZE AND TIMESTAMP before THE TRANSFORMATION: ###
ubuntu@srv:~$ aws s3 ls s3://somebucket/
2019-08-23 15:42:13          0
2019-08-23 16:07:03      21091 sourcefile.csv.gz

### !!! TRANSFORM HERE !!! ###
aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gz

### CHECK SIZE AND TIMESTAMP after THE TRANSFORMATION: ###
ubuntu@srv:~$ aws s3 ls s3://somebucket/
2019-08-23 15:42:13          0
2019-08-23 16:10:02      21086 sourcefile.csv.gz  ### Time and size changed

### Verify: view transformed records:  ###
aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | grep '"'

### VIEW TRANSFORMED LINES: NOTICE ALL DOUBLE QUOTES ARE ESCAPED ###
ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz  - | gzip -dc | grep '"' 
390     2019-08-20 03:49:44     48041202-e88a-3bc7-87cd-66dcb0ef566c    \"Star\" print porcelain box with hand and ring
492     2019-08-20 03:50:15     53b30df4-9d77-36f2-93b9-22fa35043722    Velvet cushion with \"tiger\" embroidery
667     2019-08-20 03:51:10     85aa916f-f4d1-3d9b-98a3-63665c54c145    Star Eye XL demitasse cup and saucer, double set
811     2019-08-20 03:52:28     811bcc15-3075-3a5a-b68a-cd5d301e7c7c    \"Urtica Ferox\" hand and ring mug
3161    2019-08-20 04:13:02     c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24    Freesia, XL \"Maison De L'Amour\" candle

Friday, August 16, 2019

Snowflake: PUT command and issues

Related links: 

Working with Snowflake PUT command I've uncovered a few undocumented facts worth posting.  This intended to help you to make the most effective design decision transferring data to Snowflake database effectively having the process scale to the load.


PUT uses /tmp by default

Snowflake PUT command copies/replicates your local host data source files on itself. The default directory is /tmp. You can change the temp directory to point to a larger disk space by setting local environment variables:

    export TEMP=/mnt/nfs/temp/largedisk
    export TMPDIR=$TEMP
    export TMP=$TEMP
In this case, PUT command will use indicated  directory for local temporary staging the files before transfer. 


Why this is important? 

If you transfer large files that don't fit into /tmp or your process runs many PUT commands concurrently you will get: 

    [Errno 28] No space left on device, file=/mnt/nfs/temp/largefile_1of999.txt, real file=/mnt/nfs/temp/largefile_1of999.txt
The message is misleading, because I run out of space on /tmp that PUT implicitly uses , not an indicated file, whereas source directly /mnt/nfs/temp/ still has plenty of space. 


Compress or not compress ? 

By default PUT command will compress files locally before transfer: option AUTO_COMPRESS = TRUE.  However,  running a few practical tests revealed that default PUT compression is slow and CPU hungry.  Moreover, it consumes twice the amount of local disk space on the local server apparently making uncompressed temporary copy of a file first (see above), then compressing it.  It creates a lot of disk and CPU contention. To overcome this problem, I recommend pre-compressing files locally first by gzip -1c then transfer with PUT option  AUTO_COMPRESS = False.  Like this:

    gzip -1q /tmp/largefile_1of999.txt   ## 1 - compress faster
    snowsql -q  "PUT   file:///tmp/largefile_1of999.txt.gz   @~/ AUTO_COMPRESS = False;"
    snowsql -q  "COPY INTO target_database..target_table from @~/largefile_1of999.txt.gz FILE_FORMAT = ${DBS}.public.fmt_tsv_gzip ON_ERROR = CONTINUE;"


Set snowsql option PROGRESS_BAR = False 

The default PROGRESS_BAR = TRUE for snowsql client.  This creates megabytes log files when redirecting the output to text log file  
    <LF>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.01% (0.458s, 5713.03MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.02% (0.461s, 5684.18MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [----------] 0.03% (0.581s, 4507.99MB/s)<CR>
    .... 12,000 records ...                                                     <CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 99.98% (37.905s, 69.08MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 99.99% (37.906s, 69.08MB/s)<CR>
    largefile_1of999.txt.gz(2618.64MB): [##########] 100.00% Done (37.954s, 68.99MB/s).<CR><LF>

You can remove these repeatable records after the file was generated:

    sed -i -e 's/^\d013.*$//g' ${LOGFDATA}  ### Remove progress status of file transfer

But better solution is set to false the PROGRESS_BAR option and stop producing these: 

    snowsql -q "!set progress_bar = FALSE; PUT  file:///tmp/largefile_1of999.txt.gz ..."
or
    snowsql -o progress_bar=false -q "PUT  file:///tmp/largefile_1of999.txt.gz ... "



.