Related links:
PROBLEM:
Parse text file at S3: transform characters in lines of text. Keep the same file, same compression format at S3.SOLUTION:
Stream data from S3 via sed and back to S3 via pipes.This trick is useful to cleanse the data inside the files stored at S3. The typical use case, preceding the data load, is to eliminate a special characters and manipulate with characters in the lines of text, which can't be effectively handled by loading tools. "At place" means that a file will be changed at S3, however it still has to be transferred to and form the server to conduct the actual transformations. Notice, to transfer files between S3 and local AWS EC2 is free. Only EC2 usage incurred. Since the process is pipe-based no intermediate storage needs to be provisioned. The process streams text from pipe to pipe making the transformation on-the-fly and back to S3. Main form:
### Logic: AWS copy into stout -> uncompress -> sed transform -> compress -> back to AWS S3 aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gzFor multiple transformation that require many steps consider using sed -f option keeping the long list of transformation in a saved file: sed -f scriptname.
The example put this all together illustrating step-by-step:
### VIEW ORIGINAL UNTRANSFORMED LINES ### ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | grep '"' 390 2019-08-20 03:49:44 48041202-e88a-3bc7-87cd-66dcb0ef566c "Star" print porcelain box with hand and ring 492 2019-08-20 03:50:15 53b30df4-9d77-36f2-93b9-22fa35043722 Velvet cushion with "tiger" embroidery 667 2019-08-20 03:51:10 85aa916f-f4d1-3d9b-98a3-63665c54c145 Star Eye XL demitasse cup and saucer, double set 811 2019-08-20 03:52:28 811bcc15-3075-3a5a-b68a-cd5d301e7c7c "Urtica Ferox" hand and ring mug 3161 2019-08-20 04:13:02 c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24 Freesia, XL "Maison De L'Amour" candle ### CHECK SIZE AND TIMESTAMP before THE TRANSFORMATION: ### ubuntu@srv:~$ aws s3 ls s3://somebucket/ 2019-08-23 15:42:13 0 2019-08-23 16:07:03 21091 sourcefile.csv.gz ### !!! TRANSFORM HERE !!! ### aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | sed 's/"/\\"/g' | gzip -cq | aws s3 cp - s3://somebucket/sourcefile.csv.gz ### CHECK SIZE AND TIMESTAMP after THE TRANSFORMATION: ### ubuntu@srv:~$ aws s3 ls s3://somebucket/ 2019-08-23 15:42:13 0 2019-08-23 16:10:02 21086 sourcefile.csv.gz ### Time and size changed ### Verify: view transformed records: ### aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | grep '"' ### VIEW TRANSFORMED LINES: NOTICE ALL DOUBLE QUOTES ARE ESCAPED ### ubuntu@srv:~$ aws s3 cp s3://somebucket/sourcefile.csv.gz - | gzip -dc | grep '"' 390 2019-08-20 03:49:44 48041202-e88a-3bc7-87cd-66dcb0ef566c \"Star\" print porcelain box with hand and ring 492 2019-08-20 03:50:15 53b30df4-9d77-36f2-93b9-22fa35043722 Velvet cushion with \"tiger\" embroidery 667 2019-08-20 03:51:10 85aa916f-f4d1-3d9b-98a3-63665c54c145 Star Eye XL demitasse cup and saucer, double set 811 2019-08-20 03:52:28 811bcc15-3075-3a5a-b68a-cd5d301e7c7c \"Urtica Ferox\" hand and ring mug 3161 2019-08-20 04:13:02 c7bb8a7a-10ea-3f21-9825-ddb9fc6cea24 Freesia, XL \"Maison De L'Amour\" candle