Validate Dataset Linkage by Comparing 2 Variables

sasuser22 · Posted 04-24-2019 03:47 PM

So I'm working with two datasets of patient encounters and I've unduplicated and linked them together. I'd like to now validate my linkage based on social security number. Each observation is structured to have a unique id, a duplicate id, a social security number, and then a bunch of other variables. The unique id is an id for each observation, so a dataset of 100 observations has 100 unique ids, even if a patient has multiple observations. The duplicate id came from using de-duplication software and identifies observations that are likely the same individual, so a dataset of 100 observations might have 50 duplicate ids if each patient had two encounters. And finally social security number is just social security number. Theoretically, if the de-duplication was done correctly every duplicate id would have the same ssn.

Sample data structure:

Uniqueid Duplicateid SSN

1 1 123

2 2 456

3 2 456

4 3 789

5 3 432

6 4 432

So there are 6 encounters here, my de-duplication software identified 4 different patients, and each observation has a SSN. So I'd like to get a percentage of: does every duplicate id with value of "2" have the same value for SSN. This would be as opposed to duplicate id "3" which has two different SSN, indicating that the linkage software made an incorrect link, or duplicate id "3" and "4" which the de-duplication software identified as two different patients but they actually are the same person with same SSN.

Appreciate the help!

Astounding · Posted 04-24-2019 07:42 PM

Here's a simple answer to get things started:

PROC freq data=have;
tables ssn * duplicateid/ noprint out=counts;
run;

PROC freq data=counts noprint:
tables ssn / out=ssnlist;
tables duplicateid/ out=idlist;
run;

proc print data = ssnlist;
where count > 1;
var ssn count;
title 'SSNs having more than 1 ID';
run;
proc print data=idlist;
where count > 1;
var duplicateid count;
title 'IDs having more than 1 SSN';
run;

There's lots of ways to slice and dice the numbers. This is just a reasonable start.

Validate Dataset Linkage by Comparing 2 Variables

Re: Validate Dataset Linkage by Comparing 2 Variables

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!