Friday, August 23, 2019

AWS: parse S3 files "at place", the data transformation one-liner

Related links:



PROBLEM:

Parse text file at S3: transform characters in lines of text. Keep the same file, same compression format at S3.

SOLUTION: 

Stream data from S3 via sed and back to S3 via pipes.

This trick is useful to cleanse the data inside the files stored at S3. The typical use case, preceding the data load, is to eliminate a special characters and manipulate with characters in the lines of text, which can't be effectively handled by loading tools. "At place" means that a file will be changed at S3, however it still has to be transferred to and form the server to conduct the actual transformations. Notice, to transfer files between S3 and local AWS EC2 is free. Only EC2 usage incurred.  Since the process is pipe-based no intermediate storage needs to be provisioned. The process streams text from pipe to pipe making the transformation on-the-fly and back to S3. Main form:
 
   ### Logic: AWS copy into stout -> uncompress -> sed transform -> compress -> back to AWS S3
   aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gz
For multiple transformation that require many steps consider using  sed -f option keeping the long list of transformation in a saved file: sed -f scriptname.

The example put this all together illustrating step-by-step:

### VIEW ORIGINAL UNTRANSFORMED LINES ###
ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz  - | gzip -dc | grep '"' 
390     2019-08-20 03:49:44     48041202-e88a-3bc7-87cd-66dcb0ef566c    "Star" print porcelain box with hand and ring
492     2019-08-20 03:50:15     53b30df4-9d77-36f2-93b9-22fa35043722    Velvet cushion with "tiger" embroidery
667     2019-08-20 03:51:10     85aa916f-f4d1-3d9b-98a3-63665c54c145    Star Eye XL demitasse cup and saucer, double set
811     2019-08-20 03:52:28     811bcc15-3075-3a5a-b68a-cd5d301e7c7c    "Urtica Ferox" hand and ring mug
3161    2019-08-20 04:13:02     c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24    Freesia, XL "Maison De L'Amour" candle

### CHECK SIZE AND TIMESTAMP before THE TRANSFORMATION: ###
ubuntu@srv:~$ aws s3 ls s3://somebucket/
2019-08-23 15:42:13          0
2019-08-23 16:07:03      21091 sourcefile.csv.gz

### !!! TRANSFORM HERE !!! ###
aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gz

### CHECK SIZE AND TIMESTAMP after THE TRANSFORMATION: ###
ubuntu@srv:~$ aws s3 ls s3://somebucket/
2019-08-23 15:42:13          0
2019-08-23 16:10:02      21086 sourcefile.csv.gz  ### Time and size changed

### Verify: view transformed records:  ###
aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | grep '"'

### VIEW TRANSFORMED LINES: NOTICE ALL DOUBLE QUOTES ARE ESCAPED ###
ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz  - | gzip -dc | grep '"' 
390     2019-08-20 03:49:44     48041202-e88a-3bc7-87cd-66dcb0ef566c    \"Star\" print porcelain box with hand and ring
492     2019-08-20 03:50:15     53b30df4-9d77-36f2-93b9-22fa35043722    Velvet cushion with \"tiger\" embroidery
667     2019-08-20 03:51:10     85aa916f-f4d1-3d9b-98a3-63665c54c145    Star Eye XL demitasse cup and saucer, double set
811     2019-08-20 03:52:28     811bcc15-3075-3a5a-b68a-cd5d301e7c7c    \"Urtica Ferox\" hand and ring mug
3161    2019-08-20 04:13:02     c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24    Freesia, XL \"Maison De L'Amour\" candle

No comments:

Post a Comment