Solved: Re: Problem with scd2 loader in DIS 4.3 - Page 2

JohnHeaton · Posted 11-12-2013 09:14 PM

The SCDII loader does support multiple business keys. The business key is designed to match records which are the same. The detect changes is designed to determine if the attributes have changed and if so create a new record for the same business key. When reading data from a database unless you use the order by syntax for SQL you are not guaranteed to retrieve the data in any particular order. It will depend where the database starts to read the data from the disk. So if you have multiple changes to data over a day, you need to specific an order by or sort to ensure the information is retrieve in the order the changes were applied.

Scenario

I have a customer C1, they have an address. We build a dimension to track customer address changes.

In a day they may be multiple updated to the customer information. The data you send to the data warehouse may have all the change records. In this scenario you business key which is the customer key in the source will be duplicated. The change columns of address will have different values. When extracting the data if you order the data on change date then you will get the chronological order records have changed.

Setting the Option Allow Multiple updates per day will allow the SCDII to process all the change records. In this way you will receive multiple records within a load which have been end dated and are not the current record. If this is not set then then the SCDII will only load the initial update and ignore the rest of the updates.

A lot of our data sets have multiple changes within a load period and thr SCDII supports it. We even use the SCDII transform for SCDI dimensions as if we ever wish to change we can reconfigure the transform without rewriting the Job.

Hope this helps

Cheers

John

Patrick · Posted 11-13-2013 01:00 AM

What you describe works BUT this is undocumented behaviour and you have therefore zero guarantee that it won't change in the future.

The option "Allow Multiple updates per day" is meant for multiple intraday loads using DATE instead of DATETIME values. With date values you can't really expire more than one record with the same business key more than once a day. This option is in my understanding mainly there to deal nicely with cases where there are multiple rows with the same business key in source. If you examine the code generated you will see that only the first row of these multiples will get used.

I also believe that there is an issue if you would run several loads with the same business key and valid_from with date values (not datetime).

I understand what you're doing and I can see in the generated code why this is working. My point is: It is nowhere mentioned in the documentation that the SCD 2 loader is designed for loading multiple rows with the same business key in one load. The loader is designed for dealing nicely with such cases but it doesn't give you any option to add additional sort columns to the business key.

What you will find in the generated code are things like "set work.etls_source; by <business key>;" What you would need to find in the code to be certain is "by <business key> <additional sort variables>;" - and for this you would need to be able to define these additional sort variables somewhere.

I've used DIS4.6 to generate the code and found even a Proc Sort - even though I selected "no" in the option for sorting the source data set.

For what you're doing: Make sure that option SORTEQUALS is set. SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition

Question is: Should you use an undocumented behaviour in a production implementation?

JohnHeaton · Posted 11-13-2013 01:31 AM

Patrick

Thanks for the information. That is interesting. I never really look at the code as I do not has a SAS programming background. It seems we are getting around this behaviour using the extract transform and extracting the data sorted by the business key and the datetime column. We are also not using a source column as an input to the start and end date of the record in the dimension. These WHO columns which are date are maintained and managed by DIS 4.5 in our case.

Then selecting create a view on the Extract physical properties. A PROC SQL statement is created which is ordered correctly ensuring the data into the SCD II transform is in the correct order.

create view work.W217J6G as

select

Agency_Resign_Date,

Business_Key,

Chnl_CD,

Chnl_LDesc,

from &SYSLAST

order by

Business_Key,

Agency_Resign_Date

;

The code seems stable. With regards to using undocumented features from release to release, I guess it depends on the feature and how it affects the process. We have isolated our SCD II Jobs in our ETL process to very simple jobs, i.e data is staged before the SCDII Job and then a wholesale replace of our dimension after the job. This way if the functionality changes it is very isolated and once fixed all functionality down stream will work.

In situations like this I do not see much different between undocumented (not specifically stated features) capabilities and potential new features you would want to adopt in a new release. Both will cause some rework and refinement of your ETL process.

Cheers John

Patrick · Posted 11-13-2013 04:25 AM

Hi John

I wouldn't "jump" that much if this would just be an undocumented feature in the sense of a parameter or switch which is not documented. In your case it's more about what code the SCD 2 transformation generates which allows for your approach to work.

I actually don't believe that the code generation will change in the near future in a way that your approach won't work anymore. So it's more of a principal question about which I'm having a very strong opinion:

If something is not documented you can't use it for a production implementation. If you're using it anyway then you can't expect the vendor to support you should it fail in the future, so you're adding risk to what you're implementing. You need to clearly identify this risk and to communicate it to the application owner. If you're not doing this and something fails then I would blame the application designer.

I totally understand the business reason for what you're doing and it's also much better from a performance perspective than splitting the source data set and loading it in several batches. Looking at the code generated I also believe it wouldn't be too hard to give us an option where we can define additional sort variables for loading duplicate business keys in one go. I'll create an idea for this. Vote for it if you agree.

Cheers

Patrick

Re: Problem with scd2 loader in DIS 4.3

Re: Problem with scd2 loader in DIS 4.3

Re: Problem with scd2 loader in DIS 4.3

Re: Problem with scd2 loader in DIS 4.3

The 2025 SAS Hackathon has begun!