DATA Step, Macro, Functions and more

Matching rows in two columns using a loop

Reply
Occasional Contributor
Posts: 8

Matching rows in two columns using a loop

Hi,

 

I’m trying to determine the race of subjects in my dataset based on their last name. I have a list of last names that indicate a given race. My dataset looks something like this:

 

White_name               Subject_name             Race

Johnson                      Appleby                      White

MacDonald                 Brians                         White

Burns                          MacDonald                 [blank]

Manning                     Murphy                        White

 

White_name represents a databank of all potential white last names.

Subject_name represents the last name of the subjects in my sample.

Race is a partially complete variable with information on their race.

 

I would like to fill in the blanks in the Race variable based on the Subject_name being found in the White_name list.

 

My logic would flow something like this:

if subject_name=any of the White_name then race2=’White’; else race2=race;

 

However, I believe I need to loop through all the names in the White_name variable in order to find a match. Note that I have a large cohort with many repetitions of last names.

 

Would appreciate any help! Please note that I am a very new SAS user!

 

Thank you!

Super User
Posts: 19,850

Re: Matching rows in two columns using a loop

Posted in reply to Shinypants

I assume you have a dataset that maps the names to race?

I would recommend proc format, using the dataset to create your format and then apply it.

 

Here's a paper that's worth reading regarding proc format. 

 

FYI, this method of assigning race by last names is flawed beyond belief and has been thoroughly shown to be incorrect and unreliable.  

 

data lookup;
informat name race $20.;
input name $ race $;
cards;
Johnson White
McDonald White
Burns White
Ortez Mexican
;
run;


data have;
input name $20.;
cards;
Johnson
McDonald
Burns 
Ortez
Wrong
;
run;

data create_format;
set lookup end=last;
fmtname='NameToRace';
type='C';
start=name;
label=race;

if last then do;
	output;
	hlo='O';
	label='Other';
	output;
end;
else output;
run;

proc format cntlin=create_format;
run;

data want;
set have;

race=put(name, $NameToRace.);
run;
Super Contributor
Posts: 298

Re: Matching rows in two columns using a loop

Posted in reply to Shinypants

The use of Proc Format or Hash Objects will be suitable for your problem. I advise you to read on creating Fornats from a Data Set as Hash objects may be much for you now.

PROC Star
Posts: 1,760

Re: Matching rows in two columns using a loop

[ Edited ]
Posted in reply to Shinypants

I can't believe I am writing this. This would be totally illegal in some countries! Race? Really?

 

Anyway, this may do:

 


data T;
infile cards missover;
input White_name $ Subject_name $ Race $;
cards;
Johnson       Appleby      White
MacDonald     Brians       White
Burns         MacDonald  
Manning       Murphy       White
xx            zz           
run;
proc sql;
  select a.*
        ,coalescec(b.RACE,a.RACE) as RACE2
   from T    a
   left join
        T b on a.Subject_name=b.White_name;
quit;

 

 

White_name Subject_name Race RACE2
Johnson Appleby White White
MacDonal Brians White White
Burns MacDonal   White
Manning Murphy White White
xx zz    
Occasional Contributor
Posts: 8

Re: Matching rows in two columns using a loop

Thank you for your response!

 

My data is quite large (8000+) names. Would this method require me to write all the names in SAS? 

PROC Star
Posts: 1,760

Re: Matching rows in two columns using a loop

Posted in reply to Shinypants

Where would the names be if not in SAS?

This is a dodgy process. I think I'll be more comfortable staying out of it actually.

Occasional Contributor
Posts: 8

Re: Matching rows in two columns using a loop

This project is designed to prove how unreliable it is to base race off last names.

 

My data are in an excel file which I have imported into SAS. It looks like in your code that you manually type in all those names. Is there any way to select the variable as a whole without typing out all the names in the code?

Super User
Posts: 11,343

Re: Matching rows in two columns using a loop

Posted in reply to Shinypants

@ChrisNZ typed in lines of data to match your example data as you didn't provide that to us. The set that has the names you have already read into SAS can be used.

 

There may be some other issues on how to use your exact data as we do not have that but general approach is what was demonstrated.

Super User
Posts: 19,850

Re: Matching rows in two columns using a loop

I typed in the names to generate a sample data set. Replace your dataset with the names of the sample datasets.

Ask a Question
Discussion stats
  • 8 replies
  • 533 views
  • 5 likes
  • 5 in conversation