Using DI Studio to load 4NF Tables

JJ_Murrin · Posted 09-24-2019 09:19 AM

I am looking for some ways (and possibly the best way) to assemble DI Studio job(s) to load a 4NF (fourth normal form) table structure in a database.

For example, let's say there is an ID table called CUSTOMER - within it is a sequential numeric ID as the primary key and the Current_Name. As its 'children' tables, there are tables called INDIVIDUAL and ORGANIZATION - within each there is the CUSTOMER_ID (FK ) and SSN or TAX_ID respectively (with a few other columns). An finally, there are tables below ORGANIZATION that categorize the ID's into groups: BUSINESS, GOVERNMENT_AGENCY, NON-PROFIT - each of these only contains the CUSTOMER_ID (FK). [JPEG attached for clarity.]

Without really questioning the 4NF structure and assuming that the data needed to populate these tables was available in a SAS dataset, what job or series of DI Studio jobs - and the transformations within - would be needed?

Hopefully, you can see the challenge of where to even begin. Let's say, for even a simple example, you have a input SAS dataset of Government Agencies you wish to load. You may first need to check the ORGANIZATION table to see if it has already been loaded (sort of a SCD2 loader). If there are rows that are not already within the table, then you must load the Current_Name into the CUSTOMER table to get the latest Customer_ID's; then, you can load the ORGANIZATION table with the other content and finally load the Customer_ID's to the GOVERNMENT_AGENCY table. If this sounds right, which exact transformations would best accomplish this?

DI Studio - 4.903

Thanks in advance!

Patrick · Posted 09-24-2019 05:40 PM

Welcome to the SAS Communities.

I'd probably go for a top-down approach here and start with an interface spec and a high-level design. On a high level don't care what DIS transformation you will need to implement the design but work out things like the data architecture/data flow and load patterns to be implemented - and also define what's in-scope/not in-scope (like: "real-time" transactional loads).

- What type of extract files will you get/are you designing for

- get some type of interface spec created and agreed on as early as possible

- How to load these extract files (logically).

- Load frequency - intra-day, daily overnight, ....?

On a technical level for a detailed design and implementation answers to questions like below are of importance. They will influence how you actually need to implement technically.

- Are the target tables in SAS or in a database?

- What are the expected data volumes and growth?

- What's your batch window? How important is "tweaking for performance"?

Looking into the picture you've posted:

- I assume customer_id is a surrogate key that you need to create. It's nothing that you already get in the source file.

- It looks like CUSTOMER will need an UPDATE/INSERT.

- I guess not but.... Do you also need to DELETE customers?

- INDIVIDUAL/ORGANIZATION: This looks like a SCD2 table so you will need to create a key here as well (like: organisation_sk)

- BUSINESS/GOVERMENT .... They will need the key from the parent (organisation_sk) as FK in order to link to the correct version of the record.

JJ_Murrin · Posted 09-25-2019 08:16 AM

Patrick,

Thanks for the reply. I certainly get your high level points, but what I am looking for is specifics within the DI Jobs themselves; this is what you starting to delve into within your last paragraph

Yes, it would seem that CUSTOMER would need loaded first. Customer_ID can be assumed that an IDENTITY setting on the table would take care of setting the next incremental Customer_ID value and Current_Name could be the field within the Update/Insert Table Loader.

JOB1_Load_CUSTOMER:

Input: SAS dataset

Step 1: Extract from SAS dataset

Step 2: Table Loader (Update/Insert) - Ignore matching rows on Current_Name / Insert new rows on Current_Name

Output: CUSTOMER table

Now... let's say that logically we know that Customer_ID's 100 thru 111 were loaded. When going into our next job (JOB2_Load_ORGANIZATION), how will it know that 100 thru 111 were the newest inserts and to add only those records into ORGANIZATION? And likewise when GOVERNMENT_AGENCY is loaded in JOB3_Load_GOVERNMENT_AGENCY?

Or... should all three tables (CUSTOMER, ORGANIZATION, GOVERNMENT_AGENCY) be loaded in a single job? How would that work with rollback on any error?

Patrick · Posted 09-25-2019 09:46 AM

@JJ_Murrin

When deciding how to implement technically then details really matter.

For example: "Customer_ID can be assumed that an IDENTITY setting on the table"

1. I translate this means your source table is in SAS but your target table is in a database (which one?)

2. ID is maintained within the database (column of type Identity or something functionally similar)

-> the ETL is not loading it. To be excluded from SAS table metadata as else DIS will generate code for this column.

On a logical level - and you really need to spell this one out logically before diving into the technical translation:

1. CUSTOMER

Business Key: ? - It can't be Name as that's the column you update, it can't be customer_id as that's the generated key

Load technique: Upsert

2. INDIVIDUAL & ORGANIZATION:

Business Key: ? - the one from Customer plus Tax_ID?

Load technique: SCD2 - Insert for new business keys, Update and Insert for existing business keys but changed values for columns under type 2 change tracking

3. Business, Government_Agency, ....

Not enough information provided yet. The only column shown is Customer_ID as FK. This column can be in this table for query convenience but it doesn't need to be logically. The column that needs to exist as FK is the PK from ORGANIZATION

Also: Aren't there any type 2 columns in this table? The model as posted doesn't tell us enough.

Soo... Please "fix" the model and provide sufficient information on this level before trying to dive deeper.

"How would that work with rollback on any error?" That very much depends what you mean by that. On a table level or for a full "transaction", i.e. rollback of all the data loaded into all tables when the load fails for BUSINESS for a new customer? What needs to be done and how also depends on how you get the data from upstream (=interface spec).

Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Re: Using DI Studio to load 4NF Tables

Registration is open