BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Amanda_Lemon
Quartz | Level 8

Hi - 

 

I am matching cases in my dataset based on the approximate similarity in participants' first and last names. I did the matching, but I am not sure how to add a grouping variable that would indicate a set of matched cases (i.e., cases from the same participants). 

 

Copy_1 and Copy_2 are the same data sets with participants' ID, first names and last names. In Copy_2, variables have a "c" suffix.   

 

proc sql;
create table dataset as
select copy_1.* , copy_2.* 
from copy_1, copy_2 
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c ; 
quit; 

Thank you for your help!

 

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

@Amanda_Lemon,

 

Oh, OK, I think I get what you want.  

 

Results that look something like this, yes?  Code is below.  See comments in code.

jimbarbour_0-1602217401324.png

 

I think 30 is a little on the tight side.  You're not going to get much in the way of matching with a score of 30.  I see good, relatively conservative matching working well at 105.  You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith.  I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment.  The macro variable Minimum_GED controls the minimum acceptable GED score.

 

Jim

 

%LET	Minimum_GED	=	115;

**------------------------------------------------------------------------------**;

**	Read in copy 1 of the data.	**;
DATA	COPY_1;
	LENGTH	ID				$3
			Student_First	$25
			Student_Last	$25
			;

	INPUT	ID				$
			Student_First	$
			Student_Last	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Read in copy 2 of the data.	**;
DATA	COPY_2;
	LENGTH	ID_c			$3
			Student_First_c	$25
			Student_Last_c	$25
			;

	INPUT	ID_c			$
			Student_First_c	$
			Student_Last_c	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Identify all GED scores within a minimum distance of each other.	**;
proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c)) AS	First_Score
				,(compged(student_last , student_last_c))	AS	Last_Score
			from 	copy_1, copy_2 
			where 	(ID ^= ID_c)
				and (compged(student_last , student_last_c) 	<	&Minimum_GED)
				and (compged(student_first , student_first_c)	<	&Minimum_GED)
			; 
quit; 

**------------------------------------------------------------------------------**;

**	Remove duplicates and group associated names together.	**;
proc sql	NOPRINT;
	Create	Table	DeDup_Dataset	AS
		select 	Student_Last
				,Student_First
				,ID
			from 	Dataset
		UNION
		select 	Student_Last_c		AS	Student_Last
				,Student_First_c	AS	Student_First
				,ID_c				AS	ID
			from 	Dataset
			ORDER BY	Student_Last
						,Student_First
						,ID
						;
quit; 

**------------------------------------------------------------------------------**;

**	Assign person ID numbers.	**;
DATA	Dataset_With_Person_ID;
	DROP	_:;

	SET	DeDup_Dataset;

	LENGTH	_Prev_Last	$25;
	LENGTH	_Prev_First	$25;

	_Prev_Last		=	LAG(Student_Last);
	_Prev_First		=	LAG(Student_First);

	IF	COMPGED(_Prev_Last, Student_Last)	>	&Minimum_GED	OR
		COMPGED(_Prev_First, Student_First)	>	&Minimum_GED	THEN
		Person_ID	+	1;
RUN;

**------------------------------------------------------------------------------**;

View solution in original post

8 REPLIES 8
jimbarbour
Meteorite | Level 14

I would think you would want to change the continuous variable that would be created by COMPGED into a categorical variable by using Proc Format and then do a GROUP BY that new categorical variable.

 

Actually, I take that back.  A little.  🙂  CompGED isn't continuous if I'm understanding correctly.  I see discrete values 0, 10, and 20 that would fit your criteria.  Therefore, that's something you can group by right there without modification.  If you add your two CompGED results, I believe you'd have 0, 10, 20, 30, and 40 as possible values.  With that you could try some SQL along these lines (not tested since I don't have any data):

proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c) + (compged(student_last , student_last_c))	AS combined_score
			from copy_1, copy_2 
			where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c
			group	by	CALCULATED	combined_score
			; 
quit; 

Jim

 

Sample CompGED values:

 

jimbarbour_0-1602203622784.png

 

 

 

PGStats
Opal | Level 21

Seems like all you need is sorting

 

proc sql;
create table dataset as
select copy_1.* , copy_2.* 
from copy_1, copy_2 
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c 
order by ID, ID_c; 
quit; 
PG
Amanda_Lemon
Quartz | Level 8

Thank you, both, for your replies, but I think I need something a little bit different. 

 

So I have data like: 

 

ID Student_first Student_last

100 Mary Smith 

101 Mary Smit 

102 Peter Gray

103 Marie Smith 

104 Pete Grey

 

With the matching code, I am able to identify that cases 100, 101, and 103 belong to one person, and cases 102 and 104 belong to another person. 

 

The matching table looks something like: 

ID Student_first Student_last ID_c Student_first_c Student_last_c

100 Mary Smith 101 Mary Smit

100 Mary Smith 103 Marie Smith 

101 Mary Smit 103 Marie Smith 

102 Peter Gray 104 Pete Grey 

104 Pete Grey 102 Peter Gray  

 

I want to be able to include unique person IDs, something like: 

ID Student_first Student_last ID_c Student_first_c Student_last_c Person_ID

100 Mary Smith 101 Mary Smit 1

100 Mary Smith 103 Marie Smith 1 

101 Mary Smit 103 Marie Smith 1

102 Peter Gray 104 Pete Grey 2

104 Pete Grey 102 Peter Gray 2

 

The end goal is to find cases from the same people in the original dataset and assign person-specific IDs. Like: 

ID Student_first Student_last Person_ID

100 Mary Smith 1

101 Mary Smit 1

102 Peter Gray 2

103 Marie Smith 1

104 Pete Grey 2

 

When names are exactly the same from the same person across entries, I just used the following code: 

proc sort data = dataset;
by student_first student_last; 
run; 
data dataset;
set dataset;
by student_first student_last;
if first.student_last then Person_ID + 1;  
run; 

But with this dataset, the complexity is that the names of the same people were not entered in exactly the same way. 

 

Thanks again for your help! 

jimbarbour
Meteorite | Level 14

@Amanda_Lemon,

 

Oh, OK, I think I get what you want.  

 

Results that look something like this, yes?  Code is below.  See comments in code.

jimbarbour_0-1602217401324.png

 

I think 30 is a little on the tight side.  You're not going to get much in the way of matching with a score of 30.  I see good, relatively conservative matching working well at 105.  You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith.  I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment.  The macro variable Minimum_GED controls the minimum acceptable GED score.

 

Jim

 

%LET	Minimum_GED	=	115;

**------------------------------------------------------------------------------**;

**	Read in copy 1 of the data.	**;
DATA	COPY_1;
	LENGTH	ID				$3
			Student_First	$25
			Student_Last	$25
			;

	INPUT	ID				$
			Student_First	$
			Student_Last	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Read in copy 2 of the data.	**;
DATA	COPY_2;
	LENGTH	ID_c			$3
			Student_First_c	$25
			Student_Last_c	$25
			;

	INPUT	ID_c			$
			Student_First_c	$
			Student_Last_c	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Identify all GED scores within a minimum distance of each other.	**;
proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c)) AS	First_Score
				,(compged(student_last , student_last_c))	AS	Last_Score
			from 	copy_1, copy_2 
			where 	(ID ^= ID_c)
				and (compged(student_last , student_last_c) 	<	&Minimum_GED)
				and (compged(student_first , student_first_c)	<	&Minimum_GED)
			; 
quit; 

**------------------------------------------------------------------------------**;

**	Remove duplicates and group associated names together.	**;
proc sql	NOPRINT;
	Create	Table	DeDup_Dataset	AS
		select 	Student_Last
				,Student_First
				,ID
			from 	Dataset
		UNION
		select 	Student_Last_c		AS	Student_Last
				,Student_First_c	AS	Student_First
				,ID_c				AS	ID
			from 	Dataset
			ORDER BY	Student_Last
						,Student_First
						,ID
						;
quit; 

**------------------------------------------------------------------------------**;

**	Assign person ID numbers.	**;
DATA	Dataset_With_Person_ID;
	DROP	_:;

	SET	DeDup_Dataset;

	LENGTH	_Prev_Last	$25;
	LENGTH	_Prev_First	$25;

	_Prev_Last		=	LAG(Student_Last);
	_Prev_First		=	LAG(Student_First);

	IF	COMPGED(_Prev_Last, Student_Last)	>	&Minimum_GED	OR
		COMPGED(_Prev_First, Student_First)	>	&Minimum_GED	THEN
		Person_ID	+	1;
RUN;

**------------------------------------------------------------------------------**;
Amanda_Lemon
Quartz | Level 8
@jimbarbour, THANK YOU SO MUCH!! This is perfect! Exactly what I needed.
jimbarbour
Meteorite | Level 14

Hi, @Amanda_Lemon,

 

Oh, good.  I was hoping that it was.  You probably could tune it significantly more in terms of both performance and match quality, but I don't know how big your data sets are or how strict your matching needs to be.

 

I thought this SAS paper on using CompGED in conjunction with Spedis was interesting:

https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf

 

@ballardw's suggestion sounds like it has a great deal of merit as well.  

 

Good luck!

 

Jim

ballardw
Super User

 

 

 

Try https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm for free software that will give a probability of match. You can ignore the cancer related documentation but this matches on names, addresses, dates and other fields that might contain matching data and will give a probability of match for not-exact matched cases.

 

I know it isn't SAS but the price is right, none, and has features that will take a lot of SAS programming duplicate. 

 

You will need to provide text files for input but you should be able to make them easily from SAS.

You don't even have to have the same variable names/ column headers as the application lets you pick what to compare between files.

 

I used this to do a data quality report when a data provider "moved" data to a new data system without retaining unique identifiers so I could use the data for reporting across the conversion. And to identify the records where people's names, birth dates, gender, race and ethnicity changed in the process.

Amanda_Lemon
Quartz | Level 8
@ballardw, thank you for the information! I had never heard of this tool before -- certainly worth exploring.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2639 views
  • 5 likes
  • 4 in conversation