07-28-2016 05:40 PM
I’m trying to determine the race of subjects in my dataset based on their last name. I have a list of last names that indicate a given race. My dataset looks something like this:
White_name Subject_name Race
Johnson Appleby White
MacDonald Brians White
Burns MacDonald [blank]
Manning Murphy White
White_name represents a databank of all potential white last names.
Subject_name represents the last name of the subjects in my sample.
Race is a partially complete variable with information on their race.
I would like to fill in the blanks in the Race variable based on the Subject_name being found in the White_name list.
My logic would flow something like this:
if subject_name=any of the White_name then race2=’White’; else race2=race;
However, I believe I need to loop through all the names in the White_name variable in order to find a match. Note that I have a large cohort with many repetitions of last names.
Would appreciate any help! Please note that I am a very new SAS user!
07-28-2016 06:25 PM
I assume you have a dataset that maps the names to race?
I would recommend proc format, using the dataset to create your format and then apply it.
Here's a paper that's worth reading regarding proc format.
FYI, this method of assigning race by last names is flawed beyond belief and has been thoroughly shown to be incorrect and unreliable.
data lookup; informat name race $20.; input name $ race $; cards; Johnson White McDonald White Burns White Ortez Mexican ; run; data have; input name $20.; cards; Johnson McDonald Burns Ortez Wrong ; run; data create_format; set lookup end=last; fmtname='NameToRace'; type='C'; start=name; label=race; if last then do; output; hlo='O'; label='Other'; output; end; else output; run; proc format cntlin=create_format; run; data want; set have; race=put(name, $NameToRace.); run;
07-28-2016 06:25 PM
The use of Proc Format or Hash Objects will be suitable for your problem. I advise you to read on creating Fornats from a Data Set as Hash objects may be much for you now.
07-28-2016 06:32 PM - edited 07-28-2016 06:34 PM
I can't believe I am writing this. This would be totally illegal in some countries! Race? Really?
Anyway, this may do:
data T; infile cards missover; input White_name $ Subject_name $ Race $; cards; Johnson Appleby White MacDonald Brians White Burns MacDonald Manning Murphy White xx zz run; proc sql; select a.* ,coalescec(b.RACE,a.RACE) as RACE2 from T a left join T b on a.Subject_name=b.White_name; quit;
07-28-2016 06:52 PM
Where would the names be if not in SAS?
This is a dodgy process. I think I'll be more comfortable staying out of it actually.
07-28-2016 07:24 PM
This project is designed to prove how unreliable it is to base race off last names.
My data are in an excel file which I have imported into SAS. It looks like in your code that you manually type in all those names. Is there any way to select the variable as a whole without typing out all the names in the code?
07-28-2016 07:34 PM
@ChrisNZ typed in lines of data to match your example data as you didn't provide that to us. The set that has the names you have already read into SAS can be used.
There may be some other issues on how to use your exact data as we do not have that but general approach is what was demonstrated.