Solved: Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader hand...

Quentin · Posted 10-21-2016 03:57 PM

Hi All,

I haven't really played with DI studio transformations. In reading the docs, it's not clear to me how the Table Loader, and related SCD Type 1 and 2 Loaders handle the deletion of records from the source data. Should I expect any/all of them to handle deleted records (perhaps only if certain sub-options are selected)? By "handle", I mean if a record was deleted from the source data, I would expect it to be deleted from the target data (I suppose for Type 2 loader, it would be a logical delete using one of the variables that flags a record as archived).

More generally, I'd appreciate advice on the following situation. I have large table in a SQL Server database (say 50M records, 20 variables) that is far away from my SAS server. Every day there are a relatively small number of new records inserted into the database, say 10,000 records. On rare occasions there are records that are updated or deleted. My goal is to make a SAS dataset which mirrors the SQL server table, and update the SAS dataset nightly so that it stays in synch. That way during the day my analytic work etc can run off the SAS dataset, and I only hit the database once at night.

Which transformation in DI studio would you use to support this sort of incremental updates (lots of inserts, rarely a few updates or deletions).

In the past with smaller data, I have often just pulled the entire table every night, even though <1% of the data is new. It often didn't feel worth it to deal with identifying changes and loading just them.

But in reading a bit about the SCD stuff, it seems potentially useful. It sounds like in theory, it could do something like:

Connect to the source SQL server table, run MD5() on the non-key portion of every record, and return just the key and the hashed value to SAS. I'm assuming this works as a pass-through query, so only the key and the hashed value gets returned to SAS, otherwise I'm still pulling 50M records and 20 variables into SAS.
Compare the hashed values from the source data to the hashed values that were previously generated from my target SAS dataset to identify differences (inserts, updates, and deletions).
Query the source SQL table to pull the full records (i.e. all variables) for the reords that need to be inserted or updated.
Apply the inserts, updates, and deletions.

I know I could try to figure this out by clicking through lots and lots of DI studio options and reviewing the generated code, but would appreciate any thoughts to help guide me.

I'm open to non-DI, coding suggestions as well. I suppose it should be feasible to code the first step above as an explicit pass-through query, and the rest seems is pretty straight forward.

LinusH · Posted 10-21-2016 05:28 PM

Thansk for the kind words @SASKiwi.

@Quentin, not sure what your total requirement is.

What the SCD loader gives you in addition to the other "standard" table loader is mainly surrogate key management and record validation intervals (Type 2).

SCD Type 1 Loader should usually be avoided, since it has very poor performance.

What about deletions, do you want them deleted, marked as deleted, or to have an end date?

The Type 2 Loader can find missing data in the source to find deletion candidates, but that requires all records to be processed every time - perhaps not ideal for you. If used, the existing records will be end dated by using the system time (not the date/time you have defined for the regular date validation interval....)

The Table Loader don't have built in functionality for this. So I would recommend that you identify/track deletions in the source, and handle them separately.

If you just simply want to update a table without historizing and stuff, the Table Loader might be sufficient for you.

Your guess about the update process is somewhat correct, but not totally.

SAS wants to build a hash on all current records (in the Type 2 scenario), therefore, all current records will be processed by the SAS session. An option to the Type 2 Loader lets you keep a permanent version of the current record hashes, which sounds like an option you would be interested in to explore (my guess that in your situation you want to keep that "local" in SAS).

There are some portions of the logic that could be executed by SQL pass through, but I haven't used that much so I can't give you any real good explanation how it could apply to your situation.

If you get access to DI Studio , build a simple job, explore the different options, and look at the code that is being generated.

Data never sleeps

View solution in original post

SASKiwi · Posted 10-21-2016 04:55 PM

Have you read Linus's excellent library series on this topic? If you haven't then its well worth it:

https://communities.sas.com/t5/SAS-Communities-Library/SAS-Data-Integration-Studio-Capability-Test-S...

Quentin · Posted 10-21-2016 05:28 PM

Thanks @SASKiwi that was my introduction to the concepts (came up as one of the first hits when I googled it just a few hours ago) and agree, it's excellent. But it's still not obvious to me how it's intended to handle deletions.

I suppose since SCD is designed for to dimension tables rather than fact tables, there might not be a need to delete records from a dimension table, so may not be built in.

In my case, what I really have is a slowly changing fact table.

LinusH · Posted 10-21-2016 05:39 PM

Correct, I didn't pay much regard to deletions in that series. It was walk through of the different SCD types on as a high level aplication.

But deletions is an interesting subject as well, perhaps a matter for a future post...?

Data never sleeps

LinusH · Posted 10-21-2016 05:28 PM