Oracle v SAS UTF-8 v Hadoop

LaurieF · Posted 05-20-2019 11:50 PM

The Oracle source schema is defined as utf-8
The SAS Oracle libname to the schema has encoding="utf-8" on it
SAS is running with utf-8 encoding
Hadoop naturally runs with utf-8
Because Hadoop is weird, I've got user-written transformations to do SCD-2 processing
Because Hadoop is blindingly fast, rebuilding the target table from scratch appears so far to be an extremely efficient way of moving data around. But Hadoop is weird.

When I register the metadata to an Oracle table, it shows the column length for character variables as four times the varchar2 length - to support the maximum length a variable can take. The Hadoop table registration shows the (utf-8) character length.

Assume a varchar2(512) in Oracle going to varchar(512) in Hadoop (initially in Hive, but later processed in Impala/Parquet). The registration for that column shows it as having a length of 2048. Shrug - I don't care in this case, because I know there are no special characters. It becomes an issue later with other source data that I don't have any control over. I need to see the data to check what's going on in that case. But not today.

When I do a bog-standard SAS/SQL insert of it into the Hive table, for this status_message column I'm getting:

WARNING: Character expression will be truncated when assigned to character column status_message.

This is a message I'd expect to get with catx in SQL where the variable length exceeds 200 (BTW, SAS, can you fix this?!). But I appreciate in this case that it sees a length of 2048 which it's trying to squeeze into 512, and SAS semi-helpfully wants me to know about it.

What to do? what to do? I can make the target table have the exploded lengths, but that's cheating. I don't want to have to cheat. I want to be able to avoid all warning messages if I can, and obviously preserve data integrity. What happens when I do have UTF-8 characters, like Te Reo Māori macrons, or emoji characters (apparently, there will be some!)?

Laurie

Patrick · Posted 05-21-2019 07:16 PM

@LaurieF

You're raising here a few different things so let me just throw in a few thoughts/answers.

Multiplied Lengths

SAS defines character lengths in number of bytes and not number of characters. Most databases define (or allow to define) lengths in number of characters.

UFT-8 can use up to 4 bytes of storage for a single character.

To avoid string truncation when moving data from a database to a SAS table, SAS creates SAS character variables with a multiplied length BUT there are libname options which let you change the default behavior (i.e. multiplier=...) for access to hadoop.

If you know that there won't be any multibyte characters in source then set the multiplier option to 1.

Because Hadoop is weird, I've got user-written transformations to do SCD-2 processing

HDFS is an immutable file system meaning you can only write once to a file. So: Insert and Read is as anywhere else, update and delete requires a full recreation of the file on HDFS.

I didn't get my hands on it yet but I feel that SCD2 using Hadoop should use a bit a different table load and data organisation as you would do normally.

So... What if you insert only for any change. You never update a record but you just add a new row even if it's only an expiration (like source in target but not in source). An expiration is then simply to take the most current record from target and insert it again with a modified change end date.

You then define a Hive view over this SCD2 table in Hadoop with a Window which returns per key only the row with the latest change begin date.

And then maybe also implement a housekeeping process which from time to time adds history records to a history table and then recreates the active SCD2 tables with current records only (or with history records back to a certain time only).

LaurieF · Posted 05-25-2019 12:13 AM

I'll play around with the multiplier option. Thank you.

As for the second point - I think you missed what I said. Hadoop is weird, and it's weird because you can't update data. I know that. And since the design, which I have no say over, already calls for SCD2 tables in Hadoop, I had no choice but to write transformations to handle updating existing tables with changing source data. What you're suggesting would make an already complicated process even worse.

Patrick · Posted 05-25-2019 02:18 AM

@LaurieF

According to documentation Update, Delete and Merge is implemented in Hive and you can use it if ACID is turned on. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML

Because HDFS is an immutable file system Insert operations will perform much better.

All that I'm saying is that you've got two choices when going for an Insert only approach:

Either a full table replace or then Inserts only for any change with some sort of record versioning (which can also use date columns). And to ease any queries on this table with version records then also a Hive view over the table with a partition clause to only return the most current version of a record.

Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Re: Oracle v SAS UTF-8 v Hadoop

Ready to join fellow brilliant minds for the SAS Hackathon?