12-08-2013 06:49 PM
There's some fairly well-known code "out there" for processing Slowly Changing Dimension Type 2 (SCD2) updates. Anyone who has used SAS DI Studio would be familiar with this transformation.
In skimming this code (note: I haven't done a full in-depth analysis yet), it appears to me that the same processing could be accomplished using hash objects resulting in faster performance, especially with respect to not requiring a sort of the target table.
Assume that the transaction data will fit in memory.
If my supposition is correct, then before I reinvent a perfectly round wheel (i.e. the wheel you're willing to share ;-) ), has anyone done this already and, if so, can you share your code?
12-09-2013 06:48 AM
I had in the past twice to write my own SCD 2 loader. Once when the DIS 3.4 loader wasn't fit for purpose and once when the client didn't have DIS licenced.
In the second case I've based my code on DIS 4.2 generate code. The code I've implemented was for a full load from source expiring all data in target not in source. By stripping down the DIS 4.2 generated code to my special needs I could get rid of a full pass through the data.
I'm not sure where a sort of the target table is required. You will basically need an XREF table containing the keys and the digest values to determine the records for insert and the ones for update.
I haven't understood yet why a permanent XREF table gets stored in the target environment and not the source environment. From what I could understand if stored in a data base target environment then it's fast for a limited number of change records but sub-optimal if a lot of records change (because of moving data between the DB and SAS).
So yes: If this is about performance gains then coding your own SCD 2 loader for exactly your case will likely allow you to better performance. BUT: It comes with the price of increased implementation effort and decreased flexibility and maintainability.
I'm currently facing a challenge where the DIS SCD 2 loader can't be used. You'll probably see my "request for ideas" early next year.
12-10-2013 08:03 PM
I thought about this a bit more: Using a hash object would work when closing out an old record (i.e. h.find() gets a hit), But it wouldn't work for brand new records; AFAIK I can't retrieve the data from the hash object when there isn't a hit on the key.
Another way of saying this is a hash object join is a "left join", when what I need is a "full outer join" (or a data step merge).
I'll have to look at other ways of optimizing the performance; perhaps SPDE and its automatic support of BY variables, or using an index instead of a full sort of the entire target table.
For my purposes the source and target tables are SAS datasets, so I don't have to consider of SAS vs. DBMS performance.
From the paucity of replies I assume I'll have to code this myself ;-). I'll look for your post early next year - perhaps we can share ideas