BookmarkSubscribeRSS Feed
Shinypants
Fluorite | Level 6

Hi,

 

I’m trying to determine the race of subjects in my dataset based on their last name. I have a list of last names that indicate a given race. My dataset looks something like this:

 

White_name               Subject_name             Race

Johnson                      Appleby                      White

MacDonald                 Brians                         White

Burns                          MacDonald                 [blank]

Manning                     Murphy                        White

 

White_name represents a databank of all potential white last names.

Subject_name represents the last name of the subjects in my sample.

Race is a partially complete variable with information on their race.

 

I would like to fill in the blanks in the Race variable based on the Subject_name being found in the White_name list.

 

My logic would flow something like this:

if subject_name=any of the White_name then race2=’White’; else race2=race;

 

However, I believe I need to loop through all the names in the White_name variable in order to find a match. Note that I have a large cohort with many repetitions of last names.

 

Would appreciate any help! Please note that I am a very new SAS user!

 

Thank you!

8 REPLIES 8
Reeza
Super User

I assume you have a dataset that maps the names to race?

I would recommend proc format, using the dataset to create your format and then apply it.

 

Here's a paper that's worth reading regarding proc format. 

 

FYI, this method of assigning race by last names is flawed beyond belief and has been thoroughly shown to be incorrect and unreliable.  

 

data lookup;
informat name race $20.;
input name $ race $;
cards;
Johnson White
McDonald White
Burns White
Ortez Mexican
;
run;


data have;
input name $20.;
cards;
Johnson
McDonald
Burns 
Ortez
Wrong
;
run;

data create_format;
set lookup end=last;
fmtname='NameToRace';
type='C';
start=name;
label=race;

if last then do;
	output;
	hlo='O';
	label='Other';
	output;
end;
else output;
run;

proc format cntlin=create_format;
run;

data want;
set have;

race=put(name, $NameToRace.);
run;
KachiM
Rhodochrosite | Level 12

The use of Proc Format or Hash Objects will be suitable for your problem. I advise you to read on creating Fornats from a Data Set as Hash objects may be much for you now.

ChrisNZ
Tourmaline | Level 20

I can't believe I am writing this. This would be totally illegal in some countries! Race? Really?

 

Anyway, this may do:

 


data T;
infile cards missover;
input White_name $ Subject_name $ Race $;
cards;
Johnson       Appleby      White
MacDonald     Brians       White
Burns         MacDonald  
Manning       Murphy       White
xx            zz           
run;
proc sql;
  select a.*
        ,coalescec(b.RACE,a.RACE) as RACE2
   from T    a
   left join
        T b on a.Subject_name=b.White_name;
quit;

 

 

White_name Subject_name Race RACE2
Johnson Appleby White White
MacDonal Brians White White
Burns MacDonal   White
Manning Murphy White White
xx zz    
Shinypants
Fluorite | Level 6

Thank you for your response!

 

My data is quite large (8000+) names. Would this method require me to write all the names in SAS? 

ChrisNZ
Tourmaline | Level 20

Where would the names be if not in SAS?

This is a dodgy process. I think I'll be more comfortable staying out of it actually.

Shinypants
Fluorite | Level 6

This project is designed to prove how unreliable it is to base race off last names.

 

My data are in an excel file which I have imported into SAS. It looks like in your code that you manually type in all those names. Is there any way to select the variable as a whole without typing out all the names in the code?

ballardw
Super User

@ChrisNZ typed in lines of data to match your example data as you didn't provide that to us. The set that has the names you have already read into SAS can be used.

 

There may be some other issues on how to use your exact data as we do not have that but general approach is what was demonstrated.

Reeza
Super User

I typed in the names to generate a sample data set. Replace your dataset with the names of the sample datasets.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1529 views
  • 5 likes
  • 5 in conversation