BookmarkSubscribeRSS Feed
sasuser22
Calcite | Level 5

So I'm working with two datasets of patient encounters and I've unduplicated and linked them together. I'd like to now validate my linkage based on social security number. Each observation is structured to have a unique id, a duplicate id, a social security number, and then a bunch of other variables. The unique id is an id for each observation, so a dataset of 100 observations has 100 unique ids, even if a patient has multiple observations. The duplicate id came from using de-duplication software and identifies observations that are likely the same individual, so a dataset of 100 observations might have 50 duplicate ids if each patient had two encounters. And finally social security number is just social security number. Theoretically, if the de-duplication was done correctly every duplicate id would have the same ssn.

 

Sample data structure:

Uniqueid  Duplicateid   SSN

1                1                    123  

2                2                    456

3                2                    456

4                3                    789

5                3                    432

6                4                    432

 

So there are 6 encounters here, my de-duplication software identified 4 different patients, and each observation has a SSN. So I'd like to get a percentage of: does every duplicate id with value of "2" have the same value for SSN. This would be as opposed to duplicate id "3" which has two different SSN, indicating that the linkage software made an incorrect link, or duplicate id "3" and "4" which the de-duplication software identified as two different patients but they actually are the same person with same SSN.

 

Appreciate the help!

1 REPLY 1
Astounding
PROC Star
Here's a simple answer to get things started:

PROC freq data=have;
tables ssn * duplicateid/ noprint out=counts;
run;

PROC freq data=counts noprint:
tables ssn / out=ssnlist;
tables duplicateid/ out=idlist;
run;

proc print data = ssnlist;
where count > 1;
var ssn count;
title 'SSNs having more than 1 ID';
run;
proc print data=idlist;
where count > 1;
var duplicateid count;
title 'IDs having more than 1 SSN';
run;

There's lots of ways to slice and dice the numbers. This is just a reasonable start.

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 1 reply
  • 562 views
  • 0 likes
  • 2 in conversation