SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Accepted Solution Solved
Reply
PROC Star
Posts: 1,233
Accepted Solution

How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Hi All,

 

I haven't really played with DI studio transformations.  In reading the docs, it's not clear to me how the Table Loader, and related SCD Type 1 and 2 Loaders handle the deletion of records from the source data.  Should I expect any/all of them to handle deleted records (perhaps only if certain sub-options are selected)?  By "handle", I mean if a record was deleted from the source data, I would expect it to be deleted from the target data (I suppose for Type 2 loader, it would be a logical delete using one of the variables that flags a record as archived).

 

More generally, I'd appreciate advice on the following situation.  I have large table in a SQL Server database (say 50M records, 20 variables) that is far away from my SAS server.  Every day there are a relatively small number of new records inserted into the database, say 10,000 records.  On rare occasions there are records that are updated or deleted.  My goal is to make a SAS dataset which mirrors the SQL server table, and update the SAS dataset nightly so that it stays in synch.  That way during the day my analytic work etc can run off the SAS dataset, and I only hit the database once at night.

 

Which transformation in DI studio would you use to support this sort of incremental updates (lots of inserts, rarely a few updates or deletions).

 

In the past with smaller data, I have often just pulled the entire table every night, even though <1% of the data is new.  It often didn't feel worth it to deal with identifying changes and loading just them.

 

But in reading a bit about the SCD stuff, it seems potentially useful.  It sounds like in theory, it could do something like:

  1. Connect to the source SQL server table, run MD5() on the non-key portion of every record, and return just the key and the hashed value to SAS.  I'm assuming this works as a pass-through query, so only the key and the hashed value gets returned to SAS, otherwise I'm still pulling 50M records and 20 variables into SAS.
  2. Compare the hashed values from the source data to the hashed values that were previously generated from my target SAS dataset to identify differences (inserts, updates, and deletions).
  3. Query the source SQL table to pull the full records (i.e. all variables) for the reords that need to be inserted or updated.
  4. Apply the inserts, updates, and deletions.

 

I know I could try to figure this out by clicking through lots and lots of DI studio options and reviewing the generated code, but would appreciate any thoughts to help guide me. 

 

I'm open to non-DI, coding suggestions as well.    I suppose it should be feasible to code the first step above as an explicit pass-through query, and the rest seems is pretty straight forward.

 


Accepted Solutions
Solution
‎10-26-2016 09:11 AM
Super User
Posts: 5,257

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Thansk for the kind words @SASKiwi.

 

@Quentin, not sure what your total requirement is.

What the SCD loader gives you in addition to the other "standard" table loader is mainly surrogate key management and record validation intervals (Type 2).

SCD Type 1 Loader should usually be avoided, since it has very poor performance.

 

What about deletions, do you want them deleted, marked as deleted, or to have an end date?

The Type 2 Loader can find missing data in the source to find deletion candidates, but that requires all records to be processed every time - perhaps not ideal for you. If used, the existing records will be end dated by using the system time (not the date/time you have defined for the regular date validation interval....)

The Table Loader don't have built in functionality for this. So I would recommend that you identify/track deletions in the source, and handle them separately.

 

If you just simply want to update a table without historizing and stuff, the Table Loader might be sufficient for you.

 

Your guess about the update process is somewhat correct, but not totally.

SAS wants to build a hash on all current records (in the Type 2 scenario), therefore, all current records will be processed by the SAS session. An option to the Type 2 Loader lets you keep a permanent version of the current record hashes, which sounds like an option you would be interested in to explore (my guess that in your situation you want to keep that "local" in SAS).

There are some portions of the logic that could be executed by SQL pass through, but I haven't used that much so I can't give you any real good explanation how it could apply to your situation.

 

If you get access to DI Studio , build a simple job, explore the different options, and look at the code that is being generated.

Data never sleeps

View solution in original post


All Replies
Super User
Posts: 3,112

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Have you read Linus's excellent library series on this topic? If you haven't then its well worth it:

 

https://communities.sas.com/t5/SAS-Communities-Library/SAS-Data-Integration-Studio-Capability-Test-S...

 

 

PROC Star
Posts: 1,233

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Thanks @SASKiwi that was my introduction to the concepts (came up as one of the first hits when I googled it just a few hours ago)  and agree, it's excellent.  But it's still not obvious to me how it's intended to handle deletions. 

 

I suppose since SCD is designed for to dimension tables rather than fact tables, there might not be a need to delete records from a dimension table, so may not be built in.

 

In my case, what I really have is a slowly changing fact table.

Super User
Posts: 5,257

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Correct, I didn't pay much regard to deletions in that series. It was walk through of the different SCD types on as a high level aplication.

But deletions is an interesting subject as well, perhaps a matter for a future post...?

Data never sleeps
Solution
‎10-26-2016 09:11 AM
Super User
Posts: 5,257

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Thansk for the kind words @SASKiwi.

 

@Quentin, not sure what your total requirement is.

What the SCD loader gives you in addition to the other "standard" table loader is mainly surrogate key management and record validation intervals (Type 2).

SCD Type 1 Loader should usually be avoided, since it has very poor performance.

 

What about deletions, do you want them deleted, marked as deleted, or to have an end date?

The Type 2 Loader can find missing data in the source to find deletion candidates, but that requires all records to be processed every time - perhaps not ideal for you. If used, the existing records will be end dated by using the system time (not the date/time you have defined for the regular date validation interval....)

The Table Loader don't have built in functionality for this. So I would recommend that you identify/track deletions in the source, and handle them separately.

 

If you just simply want to update a table without historizing and stuff, the Table Loader might be sufficient for you.

 

Your guess about the update process is somewhat correct, but not totally.

SAS wants to build a hash on all current records (in the Type 2 scenario), therefore, all current records will be processed by the SAS session. An option to the Type 2 Loader lets you keep a permanent version of the current record hashes, which sounds like an option you would be interested in to explore (my guess that in your situation you want to keep that "local" in SAS).

There are some portions of the logic that could be executed by SQL pass through, but I haven't used that much so I can't give you any real good explanation how it could apply to your situation.

 

If you get access to DI Studio , build a simple job, explore the different options, and look at the code that is being generated.

Data never sleeps
PROC Star
Posts: 1,233

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

Thanks much @LinusH.

 

I'm not really concerned about historizing stuff.  What I really want is an efficient way to update my SAS dataset nightly, without pulling all 50M records and 20 variables from SQL server across a [long, slow] wire into SAS.

 

Do you mean that in the Type 2 scenario, SAS would pull all 50M records and 20 variables from SQL server and then hash them on the SAS side?  If that's the case, then you're right I wouldn't gain exeuction time efficiency.

 

The challenge here is I don't have control over the source database.  So I'm hoping to find a way to identify new records, updated records, and deleted records, that doesn't start with me pulling all the data from SQL server into SAS as a first step.

 

If I could get the MD5() working as explicit pass through, that would at least let me cut down on the size of the records passed to SAS, even if I still passed 50M records.  I would guess if I dug around in SQL server enough there is probably some sort of hidden row that would have insert dates and maybe update dates, but I don't think I would be able to find records of the deletions.

 

Thanks again,

--Q.

Respected Advisor
Posts: 3,896

Re: How do SCD Type 1 Loader, SCD Type 2 Loader, and Table Loader handle deletions in source data?

[ Edited ]

Is there any way how you can work out the Inserts, Updates and Deletes on the database side?

 

There is often a create and or change date column which allows to work out the increment for Inserts and Updates BUT you would need to talk to the people loading the database table in order to get your hands on the deletes (or the DBA; may be there are some audit records you could access).

 

If you can't determine the deletes on the database side then there is no way other than to download at least the full set of primary key columns and then delete all the rows in your SAS table where there is no match to the key columns coming from the DB.

You must also be aware that with SAS tables using SQL Delete, the deletion is only logical and the deleted record remains in the table (taking up table space). The only way to get rid of such logically deleted records is to fully recreate the table.

 

Taking above into consideration I'd investigate if full download of the table isn't an option and then would spend some time in trying to get this done as efficiently as possible (readbuff, multi-treading, ...).

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 922 views
  • 6 likes
  • 4 in conversation