Thursday, April 28, 2016

Netezza: datatype storage CHARACTER v.s. NATIONAL CHARACTER

SUMMARY

There is NO significant storage saving (just 3.7%) when choosing between  NATIONAL CHARACTER (Unicode) and CHARACTER (ASCII) for database table design at Netezza.

I recommend choosing a datatype that pertains to design and scalability of the application.

DETAILS

Compare storage saving using CHARACTER VARYING  instead of NATIONAL CHARACTER VARYING  datatype at Netezza NPS 7.2.1.

NATIONAL CHARACTER
CHARACTER
CREATE TABLE ARCHIVE_TABLE_201405
(
       CMTS_NM     NATIONAL CHARACTER VARYING (30 ),
       INTF_NM     NATIONAL CHARACTER VARYING (30 ),
       MAC_ADDR    NATIONAL CHARACTER VARYING (20 ),
       CM_STA      NATIONAL CHARACTER VARYING (20 ),
       TM_OFST     CHARACTER VARYING (20 ),
       IP_ADDR     NATIONAL CHARACTER VARYING (20 ),
       NODE_KEY           NATIONAL CHARACTER VARYING (50 ),
       BATCH_TS    TIMESTAMP,
       AS_OF_DT    TIMESTAMP,
       DATA_ACQ_ID INTEGER
)
DISTRIBUTE ON (AS_OF_DT, MAC_ADDR);
CREATE TABLE ARCHIVE_TABLE_201405_VARCHAR
(
       CMTS_NM     CHARACTER VARYING (30 ),
       INTF_NM     CHARACTER VARYING (30 ),
       MAC_ADDR    CHARACTER VARYING (20 ),
       CM_STA      CHARACTER VARYING (20 ),
       TM_OFST     CHARACTER VARYING (20 ),
       IP_ADDR     CHARACTER VARYING (20 ),
       NODE_KEY           CHARACTER VARYING (50 ),
       BATCH_TS    TIMESTAMP,
       AS_OF_DT    TIMESTAMP,
       DATA_ACQ_ID INTEGER
)
DISTRIBUTE ON (AS_OF_DT, MAC_ADDR);
ALLOCATED_GB: 38 Gb
USED_GB     : 38 Gb
RECORDS     : 1,369,391,278
ALLOCATED_GB: 36.7 Gb
USED_GB     : 36.7 Gb
RECORDS     : 1,369,391,278


From description below, once could assume that each NVARCHAR (Unicode) character takes 4x larger storage than VARCHAR (ASCII).  The result of the experiment indicates that this is not the case.

VARCHARCHARACTER VARYING, VARCHAR(n), CHARACTER VARYING(n), CHAR VARYING(n)Variable length character string to a maximum length of n. If you do not specify n, the default is an unsized VARCHAR value. There is no blank padding, and the value is stored as entered. The maximum character string size is 64,000.
NVARCHARNATIONAL CHARACTER VARYING, NATIONAL CHAR VARYING, NATIONAL CHAR VARYING(n), NATIONAL CHARACTER VARYING(n), and NVARCHAR(n)Variable length character string to a maximum length of n. If you do not specify n, the default is an unsized NVARCHAR value. The maximum length is 16,000 characters.


Friday, April 1, 2016

Netezza: nzload how to handle NUL 0x00 and non-printable characters

PROBLEM:
nzload  responds with errors:

bad #: input row #(byte offset to last char examined) [field #, declaration] diagnostic, "text consumed"[last char examined]
----------------------------------------------------------------------------------------------------------------------------
1: 58351(52) [4, NVARCHAR(1000)] binary zero byte - ignoreZero not enabled, "OOL-US-TDMA"[0x00<NUL>]
2: 59487(104) [4, NVARCHAR(1000)] found control characters and ctrlChars option is off, "SW-BU-BOOST-US:5500000|SW-BU-BOOST-DS:37500000|#K"[0x19]
3: 146957(60) [4, NVARCHAR(1000)] binary zero byte - ignoreZero not enabled, "OOL-US"[0x00<NUL>]
4: 858805(53) [4, NVARCHAR(1000)] binary zero byte - ignoreZero not enabled, "BOOST-US"[0x00<NUL>]



SOLUTION: 

Adding nzload option -ignoreZero YES ingnores loading NUL charactes. They will be stripped out.
Adding nzload option -ctrlChars loads control characters into a table  

nzload
-host 10.10.10.100
-u  db_user
-pw db_password
-db MY_DATABASE
-t  MY_TABLE
-delim ";"
-quotedValue NO
-encoding INTERNAL
-df /usr/local/data/MY_TABLE.data
-lf /usr/local/data/$(date +'%Y%m%d%H%M%S')_$(hostname)_MY_TABLE.nzload.log
-bf /usr/local/data/$(date +'%Y%m%d%H%M%S')_$(hostname)_ MY_TABLE.nzload.bad
-maxErrors 2
-DateStyle 'DMY'
-DateDelim ' '
-TimeStyle '24HOUR'
-TimeDelim ':'
-fillRecord
-ignoreZero YES     <-- This takes care of "binary zero byte - ignoreZero not enabled"
-ctrlChars          <-- This takes care of "found control characters and ctrlChars option is off"



INVESTIGATION AND DETAILS


Initial lookup using cat command does not show NUL characters.


vi shows null characters but it's not conclusive: 


Notepad++ comes at help not only displaying non-printable ASCII characters but also identifying their type.