Solved: Computing a grouping variable in PROC SQL

Amanda_Lemon · Posted 10-08-2020 08:11 PM

Hi -

I am matching cases in my dataset based on the approximate similarity in participants' first and last names. I did the matching, but I am not sure how to add a grouping variable that would indicate a set of matched cases (i.e., cases from the same participants).

Copy_1 and Copy_2 are the same data sets with participants' ID, first names and last names. In Copy_2, variables have a "c" suffix.

proc sql;
create table dataset as
select copy_1.* , copy_2.* 
from copy_1, copy_2 
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c ; 
quit;

Thank you for your help!

jimbarbour · Posted 10-09-2020 12:38 AM

@Amanda_Lemon,

Oh, OK, I think I get what you want.

Results that look something like this, yes? Code is below. See comments in code.

I think 30 is a little on the tight side. You're not going to get much in the way of matching with a score of 30. I see good, relatively conservative matching working well at 105. You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith. I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment. The macro variable Minimum_GED controls the minimum acceptable GED score.

Jim

%LET	Minimum_GED	=	115;

**------------------------------------------------------------------------------**;

**	Read in copy 1 of the data.	**;
DATA	COPY_1;
	LENGTH	ID				$3
			Student_First	$25
			Student_Last	$25
			;

	INPUT	ID				$
			Student_First	$
			Student_Last	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Read in copy 2 of the data.	**;
DATA	COPY_2;
	LENGTH	ID_c			$3
			Student_First_c	$25
			Student_Last_c	$25
			;

	INPUT	ID_c			$
			Student_First_c	$
			Student_Last_c	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Identify all GED scores within a minimum distance of each other.	**;
proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c)) AS	First_Score
				,(compged(student_last , student_last_c))	AS	Last_Score
			from 	copy_1, copy_2 
			where 	(ID ^= ID_c)
				and (compged(student_last , student_last_c) 	<	&Minimum_GED)
				and (compged(student_first , student_first_c)	<	&Minimum_GED)
			; 
quit; 

**------------------------------------------------------------------------------**;

**	Remove duplicates and group associated names together.	**;
proc sql	NOPRINT;
	Create	Table	DeDup_Dataset	AS
		select 	Student_Last
				,Student_First
				,ID
			from 	Dataset
		UNION
		select 	Student_Last_c		AS	Student_Last
				,Student_First_c	AS	Student_First
				,ID_c				AS	ID
			from 	Dataset
			ORDER BY	Student_Last
						,Student_First
						,ID
						;
quit; 

**------------------------------------------------------------------------------**;

**	Assign person ID numbers.	**;
DATA	Dataset_With_Person_ID;
	DROP	_:;

	SET	DeDup_Dataset;

	LENGTH	_Prev_Last	$25;
	LENGTH	_Prev_First	$25;

	_Prev_Last		=	LAG(Student_Last);
	_Prev_First		=	LAG(Student_First);

	IF	COMPGED(_Prev_Last, Student_Last)	>	&Minimum_GED	OR
		COMPGED(_Prev_First, Student_First)	>	&Minimum_GED	THEN
		Person_ID	+	1;
RUN;

**------------------------------------------------------------------------------**;

View solution in original post

jimbarbour · Posted 10-08-2020 08:18 PM

I would think you would want to change the continuous variable that would be created by COMPGED into a categorical variable by using Proc Format and then do a GROUP BY that new categorical variable.

Actually, I take that back. A little. 🙂 CompGED isn't continuous if I'm understanding correctly. I see discrete values 0, 10, and 20 that would fit your criteria. Therefore, that's something you can group by right there without modification. If you add your two CompGED results, I believe you'd have 0, 10, 20, 30, and 40 as possible values. With that you could try some SQL along these lines (not tested since I don't have any data):

proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c) + (compged(student_last , student_last_c))	AS combined_score
			from copy_1, copy_2 
			where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c
			group	by	CALCULATED	combined_score
			; 
quit;

Jim

Sample CompGED values:

PGStats · Posted 10-08-2020 10:57 PM

Seems like all you need is sorting

proc sql;
create table dataset as
select copy_1.* , copy_2.* 
from copy_1, copy_2 
where compged(student_first , student_first_c) < 30 and compged(student_last , student_last_c) < 30 and ID ^= ID_c 
order by ID, ID_c; 
quit;

PG

Amanda_Lemon · Posted 10-08-2020 11:17 PM

Thank you, both, for your replies, but I think I need something a little bit different.

So I have data like:

ID Student_first Student_last

100 Mary Smith

101 Mary Smit

102 Peter Gray

103 Marie Smith

104 Pete Grey

With the matching code, I am able to identify that cases 100, 101, and 103 belong to one person, and cases 102 and 104 belong to another person.

The matching table looks something like:

ID Student_first Student_last ID_c Student_first_c Student_last_c

100 Mary Smith 101 Mary Smit

100 Mary Smith 103 Marie Smith

101 Mary Smit 103 Marie Smith

102 Peter Gray 104 Pete Grey

104 Pete Grey 102 Peter Gray

I want to be able to include unique person IDs, something like:

ID Student_first Student_last ID_c Student_first_c Student_last_c Person_ID

100 Mary Smith 101 Mary Smit 1

100 Mary Smith 103 Marie Smith 1

101 Mary Smit 103 Marie Smith 1

102 Peter Gray 104 Pete Grey 2

104 Pete Grey 102 Peter Gray 2

The end goal is to find cases from the same people in the original dataset and assign person-specific IDs. Like:

ID Student_first Student_last Person_ID

100 Mary Smith 1

101 Mary Smit 1

102 Peter Gray 2

103 Marie Smith 1

104 Pete Grey 2

When names are exactly the same from the same person across entries, I just used the following code:

proc sort data = dataset;
by student_first student_last; 
run; 
data dataset;
set dataset;
by student_first student_last;
if first.student_last then Person_ID + 1;  
run;

But with this dataset, the complexity is that the names of the same people were not entered in exactly the same way.

Thanks again for your help!

jimbarbour · Posted 10-09-2020 12:38 AM

@Amanda_Lemon,

Oh, OK, I think I get what you want.

Results that look something like this, yes? Code is below. See comments in code.

I think 30 is a little on the tight side. You're not going to get much in the way of matching with a score of 30. I see good, relatively conservative matching working well at 105. You get more matches at 115 although it's questionable whether or not Marie Smith is really the same person as Mary Smith. I set things at 115 for the screen shot above, but play with it using your real data and use your best judgment. The macro variable Minimum_GED controls the minimum acceptable GED score.

Jim

%LET	Minimum_GED	=	115;

**------------------------------------------------------------------------------**;

**	Read in copy 1 of the data.	**;
DATA	COPY_1;
	LENGTH	ID				$3
			Student_First	$25
			Student_Last	$25
			;

	INPUT	ID				$
			Student_First	$
			Student_Last	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Read in copy 2 of the data.	**;
DATA	COPY_2;
	LENGTH	ID_c			$3
			Student_First_c	$25
			Student_Last_c	$25
			;

	INPUT	ID_c			$
			Student_First_c	$
			Student_Last_c	$
			;

DATALINES;
100 Mary Smith 
101 Mary Smit 
102 Peter Gray
103 Marie Smith 
104 Pete Grey
;
RUN;

**------------------------------------------------------------------------------**;

**	Identify all GED scores within a minimum distance of each other.	**;
proc sql	NOPRINT;
	create table dataset as
		select 	copy_1.* 
				,copy_2.* 
				,(compged(student_first , student_first_c)) AS	First_Score
				,(compged(student_last , student_last_c))	AS	Last_Score
			from 	copy_1, copy_2 
			where 	(ID ^= ID_c)
				and (compged(student_last , student_last_c) 	<	&Minimum_GED)
				and (compged(student_first , student_first_c)	<	&Minimum_GED)
			; 
quit; 

**------------------------------------------------------------------------------**;

**	Remove duplicates and group associated names together.	**;
proc sql	NOPRINT;
	Create	Table	DeDup_Dataset	AS
		select 	Student_Last
				,Student_First
				,ID
			from 	Dataset
		UNION
		select 	Student_Last_c		AS	Student_Last
				,Student_First_c	AS	Student_First
				,ID_c				AS	ID
			from 	Dataset
			ORDER BY	Student_Last
						,Student_First
						,ID
						;
quit; 

**------------------------------------------------------------------------------**;

**	Assign person ID numbers.	**;
DATA	Dataset_With_Person_ID;
	DROP	_:;

	SET	DeDup_Dataset;

	LENGTH	_Prev_Last	$25;
	LENGTH	_Prev_First	$25;

	_Prev_Last		=	LAG(Student_Last);
	_Prev_First		=	LAG(Student_First);

	IF	COMPGED(_Prev_Last, Student_Last)	>	&Minimum_GED	OR
		COMPGED(_Prev_First, Student_First)	>	&Minimum_GED	THEN
		Person_ID	+	1;
RUN;

**------------------------------------------------------------------------------**;

Amanda_Lemon · Posted 10-09-2020 08:52 PM

@jimbarbour, THANK YOU SO MUCH!! This is perfect! Exactly what I needed.

jimbarbour · Posted 10-09-2020 10:19 PM

Hi, @Amanda_Lemon,

Oh, good. I was hoping that it was. You probably could tune it significantly more in terms of both performance and match quality, but I don't know how big your data sets are or how strict your matching needs to be.

I thought this SAS paper on using CompGED in conjunction with Spedis was interesting:

https://support.sas.com/resources/papers/proceedings14/1674-2014.pdf

@ballardw's suggestion sounds like it has a great deal of merit as well.

Good luck!

Jim

ballardw · Posted 10-09-2020 10:43 AM

Try https://www.cdc.gov/cancer/npcr/tools/registryplus/lp.htm for free software that will give a probability of match. You can ignore the cancer related documentation but this matches on names, addresses, dates and other fields that might contain matching data and will give a probability of match for not-exact matched cases.

I know it isn't SAS but the price is right, none, and has features that will take a lot of SAS programming duplicate.

You will need to provide text files for input but you should be able to make them easily from SAS.

You don't even have to have the same variable names/ column headers as the application lets you pick what to compare between files.

I used this to do a data quality report when a data provider "moved" data to a new data system without retaining unique identifiers so I could use the data for reporting across the conversion. And to identify the records where people's names, birth dates, gender, race and ethnicity changed in the process.

Amanda_Lemon · Posted 10-09-2020 08:54 PM

@ballardw, thank you for the information! I had never heard of this tool before -- certainly worth exploring.

Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Re: Computing a grouping variable in PROC SQL

Classroom Training Available!