When code involves setting values based on a single variable then a Format or Informat may be the more flexible approach. For a great many purposes you don't even need to change the value in the data set at all for a format to be flexible.
Note: the Macro language basically does not see the values of the variables so any "%if &variable" will not in generally work if &variable is supposed to reference the name of a variable in a data set.
If you have never looked at the SASHELP.CLASS data set please do so. It is a small, 19 observations and 5 variables.
Consider the variable SEX which has values of F and M. Perhaps I want to display values of 'Female' and 'Male' as those would be nicer to read.
Then run this code:
Proc format library=work;
value $class_sex
'F'='Female'
'M'='Male'
other='Unknown'
;
run;
proc freq data=sashelp.class;
tables sex*age ;
format sex $class_sex.;
run;
However formats are limited to mapping single variable values.
Your example for race needs a complete description of values and possibly meanings. You appear to have as many as 8 race variables since your code shows a Race8. The implication you have is that your variables are numeric so there are some interesting approaches
For example you can use the RANGE function on the list of all the variables: Range(race1, race2,race3,...,race8);
If the result from range is something other than 0 then at least two of the Race variables have different values which I suspect will take care of your "2 or more races" logic.
If the RANGE result is missing then all of the variables have missing values and Race_ethnicity='Unknown'.
If the Range is 0 that means only one value appeared in the list. I am going to guess that means that only one of Race1, Race2, etc will have a value (expecting Race1=1 , Race2=2, Race3=3 etc when present). Which means that in the Range=0 case that Race_ethnicity could be :
catx(' ', 'Race',put(max(race1,race2,race3...,race8),1.) )
In my personal code I would perhaps have Race_ethnicity as numeric and have format display "Race 1" whatever that text should be and have missing assigned to Race_ethnicity with the format .
Also personally I wouldn't do this in Proc SQL because you lose some of the tools that allow writing shorter code such as ARRAY. In a data step you can write something like this:
data want;
set have;
array r (*) race1-race8;
if range(of r(*)) >0 then Race_ethnicity=9; /* where 9 is going to mean the 2 or more races in a format*/
else if range(of r(*)) = then Race_ethnicity= max(of r(*));
run;
If you have many of those invalid values that get mapped to different valid values then an informat is extremely flexible when only one variable is involved.
Going back to the SASHELP.CLASS set.
Try this example:
Proc format library=work;
invalue $Newname
'Alfred' = 'Fred'
'Henry' = 'Hank'
'Carol','Louise' ='Dorothy'
other=_same_;
run;
data want;
set sashelp.class;
newname = input(name,$newname.);
run;
The above example in Want creates a new variable so you can see the resulting behavior side by side the old Name and Newname variables. Rerun the code with Name= instead of Newname= to replace values in place.
The Input statement would work in SQL as well:
Proc sql;
create table example as
select input(name,$newname.) as name,sex, age
from sashelp.class
;
run;
You can see with this sort of example that the code in the SQL would not have to change if I want to use the same rules in the Format or Informat created in Proc Format. I provided an example of mapping two incoming values to the same out come name for Dorothy as an example. The Case when would either require adding WHENs for new values or possibly changing a When var=value to When var in (<value list>). Since the list can be provided in the format or informat definition you move logic into a place that is easier to update.
If the results of the Proc Format code are assigned to permanent library, the LIbrary=work example, you can typically add that library to the system options looking for formats to search there when SAS is running. Or just execute the Proc Format code at start up of SAS and would always be available.
The formats and informats can be applied to multiple variables if the ranges of values are the same or a similar subset. I have an informat for reading "Yes" "No" "True" "False" "Y" "N" "T" "F" as numeric 1/0 coded values as that is very handy for some things.
I have another place where I have a list of "reasons" entered into an open text field that I use to create a new variable that has 4 code values. The list has about 120 text values up to 60 or so characters long.
You can use this code to IDENTIFY invalid values as well.
Suppose my "rules" say that the values for Sex are supposed to be "G" instead of "F" and expect "B" instead of "M"
Proc format library=work;
invalue $newsex
'F' = 'G'
'B' = _same_
other=_error_
;
run;
data example;
set sashelp.class;
sex= input(sex,$newsex.);
run;
The keyword _SAME_ in and INVALUE statement means basically use the existing value as the result.
The OTHER means this rule is applied to any value not explicitly stated prior. _ERROR_ means that those values other than 'F' and 'B' are errors and the LOG will show invalid data messages and usually enough information to determine what needs to be added to the invalue. In this case we would need to add a "M"="B".
I have a long running project, as in 20+years, that has a site location code in the data. I use an informat similar to that for $newsex whenI read the data files that has the expected values (_same_) and an other=_error_. That error triggers me to ask the program providing data for additional details we need about the site so I can update other code (some is formats) as needed. Then add the "new" site codes to the data validation invalue and rerun the code to build the proper data sets.
... View more