BookmarkSubscribeRSS Feed
sasuser22
Calcite | Level 5

So I'm working with two datasets of patient encounters and I've unduplicated and linked them together. I'd like to now validate my linkage based on social security number. Each observation is structured to have a unique id, a duplicate id, a social security number, and then a bunch of other variables. The unique id is an id for each observation, so a dataset of 100 observations has 100 unique ids, even if a patient has multiple observations. The duplicate id came from using de-duplication software and identifies observations that are likely the same individual, so a dataset of 100 observations might have 50 duplicate ids if each patient had two encounters. And finally social security number is just social security number. Theoretically, if the de-duplication was done correctly every duplicate id would have the same ssn.

 

Sample data structure:

Uniqueid  Duplicateid   SSN

1                1                    123  

2                2                    456

3                2                    456

4                3                    789

5                3                    432

6                4                    432

 

So there are 6 encounters here, my de-duplication software identified 4 different patients, and each observation has a SSN. So I'd like to get a percentage of: does every duplicate id with value of "2" have the same value for SSN. This would be as opposed to duplicate id "3" which has two different SSN, indicating that the linkage software made an incorrect link, or duplicate id "3" and "4" which the de-duplication software identified as two different patients but they actually are the same person with same SSN.

 

Appreciate the help!

1 REPLY 1
Astounding
PROC Star
Here's a simple answer to get things started:

PROC freq data=have;
tables ssn * duplicateid/ noprint out=counts;
run;

PROC freq data=counts noprint:
tables ssn / out=ssnlist;
tables duplicateid/ out=idlist;
run;

proc print data = ssnlist;
where count > 1;
var ssn count;
title 'SSNs having more than 1 ID';
run;
proc print data=idlist;
where count > 1;
var duplicateid count;
title 'IDs having more than 1 SSN';
run;

There's lots of ways to slice and dice the numbers. This is just a reasonable start.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 1 reply
  • 402 views
  • 0 likes
  • 2 in conversation