Re: General Key Cleanup After Merging Records

jwhite · Posted 01-21-2016 03:28 PM

I've got what's likely a basic question.

I've inherited a warehouse that has multiple records on a fact table that represent the same individual. For example, they were originally loaded with an Id number of 12345 and assigned a key that we shall call 333. A year or two later the same individual was loaded again but a different Id number was given (12354) so when loaded they were thought to be a new individual thus assigned a new foreign key (444).

Now a few months ago we were sent a new dataset telling us that ID 12345 and 12354 should be the same, so the two Ids were merged and now the fact table has the following two rows for this individual:

Key ID

333 12354

444 12354

That part was easy enough. Trick is the keys are used on various tables. Now, some tables will link to ID 12354 via key 333 and other tables from 444. Obviously, this isn't good, especially since several Ids were incorrectly loaded originally..

Is there a better answer than deciding on one key to use, combing through every fact table we have, checking to see what key is used, then assigning it the correct key of the so that all tables use the same key for an ID, then close out either 333 or 444?

Does that make any sense? Seems to be the way to go, but am curious if anyone has had to deal with anything similar.

Thanks, all!

ballardw · Posted 01-21-2016 03:54 PM

When faced with similar issues my first step has been to make sure I don't have:

The problematic key associated with yet another Id

Either of Id associated with yet another Key value.

Since about 1/3 of the time one of the two above conditions fails much fun ensues, but boils down to the following:

Backup everything so you can restart if needed.

Document what changes need to be made to make all of the data consistent

Get approval for the changes

Change it

Test and verify the changes worked everywhere

If there is a source file(s) that if re-read would reintroduce the issue, modify the corresponding records there as well.

But I've never had to deal with data with very many of these types of records, only a few dozen or so.

LinusH · Posted 01-21-2016 03:55 PM

@jwhite, you are usually putting forward interesting questions, I have to give you that.
Since this is kinda architectural issue and I don't know the full nature of your environment I don't want to tell what you should do. I'll try to describe what would do and let us see if that can be applied to your situation.

In the detail layer of the DW (non dimensional is my preference) I would accept both keys. And create separate surrogate keys.

The information about matching/merging of keys will be stored in a separate relational table which point from the new key to the original key. If you are using fats vault terminology this is called a same-as-link table.

The actual merge is done when loading the star schemas were you need a logic which key to use. If you are using surrogate keys I would prefer to use the original key. Which business key to use is a business requirement issue. In the case of a type 2 dimension you could show the change of key over time (given that the key are being replaced).

Hopes this makes an sense 🙂

Data never sleeps

jwhite · Posted 01-21-2016 04:29 PM

@LinusH, ha! Well, I wish they weren't as interesting as they are, but I'm happy to give the experts here something to think about. 🙂

@ballardw, I'm _always_ backing up...

I think I'm liking the idea of the link table. As it is, the way this data is used two tables are pretty much always joined anyway, so it would be easier for our reporting team to just use the one table, instead of having to consistently join two tables first.

In doing so, this link table would have both keys in it as two separate columns? Would our report staff need to check each key column when they join or lookup data on this table?

Thanks!

SASKiwi · Posted 01-23-2016 06:15 PM

I agree it's an interesting problem, as what you've described is the very common duplicate customer issue. In my experience most IT systems that deal with customer data are prone to having duplicate customers. Often they come about because of inadequate checking of existing customers when adding a new one. Where I am working now though they are also created deliberately because of system restrictions!

I also agree that a link table or what I would call a duplicate customer table is the way to go. As a general rule it is not good practice to change historical fact table data with corrections found later. The reason for this is if you repeat reports for that same period you will get different answers. A data warehouse should give you repeatable results for the same point in time no matter when you run them.There are exceptions to this of course perhaps where the errors are so large they need to be restated.

LinusH · Posted 01-21-2016 04:37 PM

Yes as separate columns.
But this construct might feel a bit complex so the idea is to keep this as a reference table behind the scene and use it only for the data mart load. Report users shouldn't be forced to consider this kind of logic. But of course, they could if you chose to let them.

Data never sleeps

General Key Cleanup After Merging Records