01-21-2016 03:28 PM
I've got what's likely a basic question.
I've inherited a warehouse that has multiple records on a fact table that represent the same individual. For example, they were originally loaded with an Id number of 12345 and assigned a key that we shall call 333. A year or two later the same individual was loaded again but a different Id number was given (12354) so when loaded they were thought to be a new individual thus assigned a new foreign key (444).
Now a few months ago we were sent a new dataset telling us that ID 12345 and 12354 should be the same, so the two Ids were merged and now the fact table has the following two rows for this individual:
That part was easy enough. Trick is the keys are used on various tables. Now, some tables will link to ID 12354 via key 333 and other tables from 444. Obviously, this isn't good, especially since several Ids were incorrectly loaded originally..
Is there a better answer than deciding on one key to use, combing through every fact table we have, checking to see what key is used, then assigning it the correct key of the so that all tables use the same key for an ID, then close out either 333 or 444?
Does that make any sense? Seems to be the way to go, but am curious if anyone has had to deal with anything similar.
01-21-2016 03:54 PM
When faced with similar issues my first step has been to make sure I don't have:
The problematic key associated with yet another Id
Either of Id associated with yet another Key value.
Since about 1/3 of the time one of the two above conditions fails much fun ensues, but boils down to the following:
Backup everything so you can restart if needed.
Document what changes need to be made to make all of the data consistent
Get approval for the changes
Test and verify the changes worked everywhere
If there is a source file(s) that if re-read would reintroduce the issue, modify the corresponding records there as well.
But I've never had to deal with data with very many of these types of records, only a few dozen or so.
01-21-2016 03:55 PM
01-21-2016 04:29 PM
@LinusH, ha! Well, I wish they weren't as interesting as they are, but I'm happy to give the experts here something to think about.
@ballardw, I'm _always_ backing up...
I think I'm liking the idea of the link table. As it is, the way this data is used two tables are pretty much always joined anyway, so it would be easier for our reporting team to just use the one table, instead of having to consistently join two tables first.
In doing so, this link table would have both keys in it as two separate columns? Would our report staff need to check each key column when they join or lookup data on this table?
01-23-2016 06:15 PM
I agree it's an interesting problem, as what you've described is the very common duplicate customer issue. In my experience most IT systems that deal with customer data are prone to having duplicate customers. Often they come about because of inadequate checking of existing customers when adding a new one. Where I am working now though they are also created deliberately because of system restrictions!
I also agree that a link table or what I would call a duplicate customer table is the way to go. As a general rule it is not good practice to change historical fact table data with corrections found later. The reason for this is if you repeat reports for that same period you will get different answers. A data warehouse should give you repeatable results for the same point in time no matter when you run them.There are exceptions to this of course perhaps where the errors are so large they need to be restated.
01-21-2016 04:37 PM