So I'm working with two datasets of patient encounters and I've unduplicated and linked them together. I'd like to now validate my linkage based on social security number. Each observation is structured to have a unique id, a duplicate id, a social security number, and then a bunch of other variables. The unique id is an id for each observation, so a dataset of 100 observations has 100 unique ids, even if a patient has multiple observations. The duplicate id came from using de-duplication software and identifies observations that are likely the same individual, so a dataset of 100 observations might have 50 duplicate ids if each patient had two encounters. And finally social security number is just social security number. Theoretically, if the de-duplication was done correctly every duplicate id would have the same ssn.
Sample data structure:
Uniqueid Duplicateid SSN
1 1 123
2 2 456
3 2 456
4 3 789
5 3 432
6 4 432
So there are 6 encounters here, my de-duplication software identified 4 different patients, and each observation has a SSN. So I'd like to get a percentage of: does every duplicate id with value of "2" have the same value for SSN. This would be as opposed to duplicate id "3" which has two different SSN, indicating that the linkage software made an incorrect link, or duplicate id "3" and "4" which the de-duplication software identified as two different patients but they actually are the same person with same SSN.
Appreciate the help!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.