Hi -
I am matching cases in my dataset based on the approximate similarity in participants' first and last names. I did the matching, but I am not sure how to add a grouping variable that would indicate a set of matched cases (i.e., cases from the same participants).
Copy_1 and Copy_2 are the same data sets with participants' ID, first names and last names. In Copy_2, variables have a "c" suffix.
proc sql;
create table dataset as
select copy_1.* , copy_2.*
from copy_1, copy_2
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c ;
quit;
Thank you for your help!
Oh, OK, I think I get what you want.
Results that look something like this, yes? Code is below. See comments in code.
I think 30 is a little on the tight side. You're not going to get much in the way of matching with a score of 30. I see good, relatively conservative matching working well at 105. You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith. I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment. The macro variable Minimum_GED controls the minimum acceptable GED score.
Jim
%LET Minimum_GED = 115;
**------------------------------------------------------------------------------**;
** Read in copy 1 of the data. **;
DATA COPY_1;
LENGTH ID $3
Student_First $25
Student_Last $25
;
INPUT ID $
Student_First $
Student_Last $
;
DATALINES;
100 Mary Smith
101 Mary Smit
102 Peter Gray
103 Marie Smith
104 Pete Grey
;
RUN;
**------------------------------------------------------------------------------**;
** Read in copy 2 of the data. **;
DATA COPY_2;
LENGTH ID_c $3
Student_First_c $25
Student_Last_c $25
;
INPUT ID_c $
Student_First_c $
Student_Last_c $
;
DATALINES;
100 Mary Smith
101 Mary Smit
102 Peter Gray
103 Marie Smith
104 Pete Grey
;
RUN;
**------------------------------------------------------------------------------**;
** Identify all GED scores within a minimum distance of each other. **;
proc sql NOPRINT;
create table dataset as
select copy_1.*
,copy_2.*
,(compged(student_first , student_first_c)) AS First_Score
,(compged(student_last , student_last_c)) AS Last_Score
from copy_1, copy_2
where (ID ^= ID_c)
and (compged(student_last , student_last_c) < &Minimum_GED)
and (compged(student_first , student_first_c) < &Minimum_GED)
;
quit;
**------------------------------------------------------------------------------**;
** Remove duplicates and group associated names together. **;
proc sql NOPRINT;
Create Table DeDup_Dataset AS
select Student_Last
,Student_First
,ID
from Dataset
UNION
select Student_Last_c AS Student_Last
,Student_First_c AS Student_First
,ID_c AS ID
from Dataset
ORDER BY Student_Last
,Student_First
,ID
;
quit;
**------------------------------------------------------------------------------**;
** Assign person ID numbers. **;
DATA Dataset_With_Person_ID;
DROP _:;
SET DeDup_Dataset;
LENGTH _Prev_Last $25;
LENGTH _Prev_First $25;
_Prev_Last = LAG(Student_Last);
_Prev_First = LAG(Student_First);
IF COMPGED(_Prev_Last, Student_Last) > &Minimum_GED OR
COMPGED(_Prev_First, Student_First) > &Minimum_GED THEN
Person_ID + 1;
RUN;
**------------------------------------------------------------------------------**;
I would think you would want to change the continuous variable that would be created by COMPGED into a categorical variable by using Proc Format and then do a GROUP BY that new categorical variable.
Actually, I take that back. A little. 🙂 CompGED isn't continuous if I'm understanding correctly. I see discrete values 0, 10, and 20 that would fit your criteria. Therefore, that's something you can group by right there without modification. If you add your two CompGED results, I believe you'd have 0, 10, 20, 30, and 40 as possible values. With that you could try some SQL along these lines (not tested since I don't have any data):
proc sql NOPRINT;
create table dataset as
select copy_1.*
,copy_2.*
,(compged(student_first , student_first_c) + (compged(student_last , student_last_c)) AS combined_score
from copy_1, copy_2
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c
group by CALCULATED combined_score
;
quit;
Jim
Sample CompGED values:
Seems like all you need is sorting
proc sql;
create table dataset as
select copy_1.* , copy_2.*
from copy_1, copy_2
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c
order by ID, ID_c;
quit;
Thank you, both, for your replies, but I think I need something a little bit different.
So I have data like:
ID Student_first Student_last
100 Mary Smith
101 Mary Smit
102 Peter Gray
103 Marie Smith
104 Pete Grey
With the matching code, I am able to identify that cases 100, 101, and 103 belong to one person, and cases 102 and 104 belong to another person.
The matching table looks something like:
ID Student_first Student_last ID_c Student_first_c Student_last_c
100 Mary Smith 101 Mary Smit
100 Mary Smith 103 Marie Smith
101 Mary Smit 103 Marie Smith
102 Peter Gray 104 Pete Grey
104 Pete Grey 102 Peter Gray
I want to be able to include unique person IDs, something like:
ID Student_first Student_last ID_c Student_first_c Student_last_c Person_ID
100 Mary Smith 101 Mary Smit 1
100 Mary Smith 103 Marie Smith 1
101 Mary Smit 103 Marie Smith 1
102 Peter Gray 104 Pete Grey 2
104 Pete Grey 102 Peter Gray 2
The end goal is to find cases from the same people in the original dataset and assign person-specific IDs. Like:
ID Student_first Student_last Person_ID
100 Mary Smith 1
101 Mary Smit 1
102 Peter Gray 2
103 Marie Smith 1
104 Pete Grey 2
When names are exactly the same from the same person across entries, I just used the following code:
proc sort data = dataset;
by student_first student_last;
run;
data dataset;
set dataset;
by student_first student_last;
if first.student_last then Person_ID + 1;
run;
But with this dataset, the complexity is that the names of the same people were not entered in exactly the same way.
Thanks again for your help!
Oh, OK, I think I get what you want.
Results that look something like this, yes? Code is below. See comments in code.
I think 30 is a little on the tight side. You're not going to get much in the way of matching with a score of 30. I see good, relatively conservative matching working well at 105. You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith. I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment. The macro variable Minimum_GED controls the minimum acceptable GED score.
Jim
%LET Minimum_GED = 115;
**------------------------------------------------------------------------------**;
** Read in copy 1 of the data. **;
DATA COPY_1;
LENGTH ID $3
Student_First $25
Student_Last $25
;
INPUT ID $
Student_First $
Student_Last $
;
DATALINES;
100 Mary Smith
101 Mary Smit
102 Peter Gray
103 Marie Smith
104 Pete Grey
;
RUN;
**------------------------------------------------------------------------------**;
** Read in copy 2 of the data. **;
DATA COPY_2;
LENGTH ID_c $3
Student_First_c $25
Student_Last_c $25
;
INPUT ID_c $
Student_First_c $
Student_Last_c $
;
DATALINES;
100 Mary Smith
101 Mary Smit
102 Peter Gray
103 Marie Smith
104 Pete Grey
;
RUN;
**------------------------------------------------------------------------------**;
** Identify all GED scores within a minimum distance of each other. **;
proc sql NOPRINT;
create table dataset as
select copy_1.*
,copy_2.*
,(compged(student_first , student_first_c)) AS First_Score
,(compged(student_last , student_last_c)) AS Last_Score
from copy_1, copy_2
where (ID ^= ID_c)
and (compged(student_last , student_last_c) < &Minimum_GED)
and (compged(student_first , student_first_c) < &Minimum_GED)
;
quit;
**------------------------------------------------------------------------------**;
** Remove duplicates and group associated names together. **;
proc sql NOPRINT;
Create Table DeDup_Dataset AS
select Student_Last
,Student_First
,ID
from Dataset
UNION
select Student_Last_c AS Student_Last
,Student_First_c AS Student_First
,ID_c AS ID
from Dataset
ORDER BY Student_Last
,Student_First
,ID
;
quit;
**------------------------------------------------------------------------------**;
** Assign person ID numbers. **;
DATA Dataset_With_Person_ID;
DROP _:;
SET DeDup_Dataset;
LENGTH _Prev_Last $25;
LENGTH _Prev_First $25;
_Prev_Last = LAG(Student_Last);
_Prev_First = LAG(Student_First);
IF COMPGED(_Prev_Last, Student_Last) > &Minimum_GED OR
COMPGED(_Prev_First, Student_First) > &Minimum_GED THEN
Person_ID + 1;
RUN;
**------------------------------------------------------------------------------**;
Hi, @Amanda_Lemon,
Oh, good. I was hoping that it was. You probably could tune it significantly more in terms of both performance and match quality, but I don't know how big your data sets are or how strict your matching needs to be.
I thought this SAS paper on using CompGED in conjunction with Spedis was interesting:
https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf
@ballardw's suggestion sounds like it has a great deal of merit as well.
Good luck!
Jim
Try https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm for free software that will give a probability of match. You can ignore the cancer related documentation but this matches on names, addresses, dates and other fields that might contain matching data and will give a probability of match for not-exact matched cases.
I know it isn't SAS but the price is right, none, and has features that will take a lot of SAS programming duplicate.
You will need to provide text files for input but you should be able to make them easily from SAS.
You don't even have to have the same variable names/ column headers as the application lets you pick what to compare between files.
I used this to do a data quality report when a data provider "moved" data to a new data system without retaining unique identifiers so I could use the data for reporting across the conversion. And to identify the records where people's names, birth dates, gender, race and ethnicity changed in the process.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.