I have a patient dataset where variables dx_code1-dx_code41 represent 41 diagnosis code variables. My objective is to output only observations in which at least one of the 41 diagnosis codes matches a particular code. I have 20 codes in mind, and I am trying to find an efficient way to program this in SAS.
I am trying to set up a macro variable that is a list of constants, but the following code does not work:
%let codes = '78060', '7784', '4659', '07999', '46611', '77989', '5990', '53081', '27651', '77189', '44619', '77982', '77931', '7746', 'V290', '77182', '7755', 'V070', '6910', '78791';
data patients;
array a_dx dx_code1-dx_code41;
do j = 1 to 20;
flag = 0;
do i = 1 to dim(a_dx);
if a_dx(i) in: codes(j) then flag = 1;
end;
if flag > 0 then output;
drop i flag;
end;
run;
Any suggestions? Thanks!
Hi chuakp;
I've attached some untested code based on my memory of something I desgned years ago. The code is rudimentary and has many opportunities for typo errors on my part and improvement and enhancement on your part. I would make this concept code work before you tried to finesse it out. This concept code should allow you room to expand it if you need more than 20 output datasets or if you need to read more than 41 diagnosis fields. I see a good opportunity to convert the code to a macro should the code work.
I think you're looking for a line like:
if a_dx in (&codes) then flag= 1.
or flag = (a_dx in (&codes);
However the values you have in the macrovariable codes are not all numeric so the dx_code variables better be character AND have the cases match, ie V070 not v070.
Hi,
I solved the problem using a proc format style. I found it was easier to update and change the formats than the macros. Also the format style provided me with the "OTHER" option.
Thanks for the replies. I apologize, but I just realized that my initial question was not clear. Ideally, my code would produce twenty datasets. The first dataset would be any patients where one of the 41 diagnosis codes was '78060', the second dataset would be any patients where one of the 41 diagnosis codes was '7784', etc.
How would I do this?
proc format is generally considered the better approach from the last time I posted a question on here.
Also, be careful with the in: as the matches may not be exactly what you want.
How would I use PROC FORMAT to solve this problem?
Hi chuakp,
Below is one style of coding. Coding the various possible values of diagnosis codes saves the need for using the upper, lower or proper case functions. If you have to add codes or change the output locations of codes, it is easier to manipulate the proc format than the data step code. The notfound and error statements are redundant. I added them just to show you what is possible. If a new 'type' is ever added to SASHELP.CARS, the notfound dataset would contain any rows with the new type.
proc format;
  value $diagx
     'SEDAN','Sedan','sedan'        = 1
     'SUV','Suv','suv'                      = 2
     'SPORTS','Sports','sports'    = 3
     'WAGON','Wagon','wagon'    = 4
     'TRUCK','Truck','truck'           = 5
     'HYBRID','Hybrid','hybrid'      = 6
      OTHER                                  = 7
  run;;
data want1 want2 want3 want4 want5 want6 notfound error;
set sashelp.cars;
if put(type,$diagx.) = 1 then output want1;
else if put(type,$diagx.) = 2 then output want2;
else if put(type,$diagx.) = 3 then output want3;
else if put(type,$diagx.) = 4 then output want4;
else if put(type,$diagx.) = 5 then output want5;
else if put(type,$diagx.) = 6 then output want6;
else if put(type,$diagx.) = 7 then output notfound;
else output error;
run;
NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
354:5 355:10 356:10 357:10 358:10 359:10 360:10
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.WANT1 has 262 observations and 15 variables.
NOTE: The data set WORK.WANT2 has 60 observations and 15 variables.
NOTE: The data set WORK.WANT3 has 49 observations and 15 variables.
NOTE: The data set WORK.WANT4 has 30 observations and 15 variables.
NOTE: The data set WORK.WANT5 has 24 observations and 15 variables.
NOTE: The data set WORK.WANT6 has 3 observations and 15 variables.
NOTE: The data set WORK.NOTFOUND has 0 observations and 15 variables.
NOTE: The data set WORK.ERROR has 0 observations and 15 variables.
NOTE: DATA statement used (Total process time):
real time 1.21 seconds
cpu time 0.20 seconds
Everytime you have a succession of else if ..., you might consider using select:
select (put(type,$diagx.));
when (1) output want1;
...
when (7) output notfound;
otherwise output error;
end;
It's easier to read, and more efficient, because it has to do the put function only once.
Hi Kurt,
Thank you for providing your suggestion! My style suggestion is based on the "Comparisons" paragraph in the definition of the SELECT statement of the Dictionary of SAS statements in the SAS 9.3 Statements Reference book. Page 343. My style has become habit and I traditionally work with smaller datasets.
".. and more efficient, because it has to do the put function only once."
Care to elaborate more? Me thinks efficiency wise both approach are the same.
Haikuo
When you have a chain of "else if"s where you evaluate the same function in every "if", the function is evaluated once for every "false" you get. With a select(), the function is evaluated exactly once.
You could avoid that by assigning the result of the function to a variable and comparing against that.
OTOH, a "select" enables a compiler (and SAS turns into a compiler when encountering a data step) to do certain optimizations which usually speed up at runtime.
And, it is that much easier to read and conveys the intention of the programmer better. Readable code is a must for proper maintenance.
Hi jwillis, I tried this approach, but the problem is that some of the values in the 41 diagnosis code variables are character variables like 'E19', and you can't use a PUT statement for them. Most of the values are 5-digit numbers like 93819 but SAS trips up on the character variables.
Hi chuakp,
I've encountered your issue in the past. I'll see if I can dig up the code I wrote. The problem I had was that SAS was interpreting the 'read from' input diagnosis code variable as numeric. I was reading from EXCEL and the first 10 diagnosis positions were numeric so SAS said all the values in the EXCEL column were numeric. The format statement should handle the E19 code if you code all the values in quotes. If you can post the error portion of your log or the proc contents output of your input dataset, I will be better able to advise you.
Hi jwillis, this is the code that I used based on your suggestion. The proc format contains the 20 codes of interest - I want SAS to search the 41 diagnosis code variables for these 20 codes.
In the log, I essentially get a deluge of the following error statements:
NOTE: Invalid numeric data, 'E9' , at line 3613 column 8.
NOTE: Invalid numeric data, 'V1' , at line 3613 column 8.
etc.
proc format;
value $ dxcodes
'78060' = 1
'7784' = 2
'4659' = 3
'07999' = 4
'46611' = 5
'77989' = 6
'5990' = 7
'53081' = 8
'27651' = 9
'77189' = 10
'46619' = 11
'79982' = 12
'77931' = 13
'7746' = 14
'V290' = 15
'77182' = 16
'7755' = 17
'V070' = 18
'6910' = 19
'78791' = 20;
run;
data want1 want2 want3 want4 want5 want6 want7 want8 want9 want10 want11 want12 want13 want14 want15 want16 want17 want18 want19 want20;
set main;
array dx dx_code1-dx_code41;
flag=0;
do i=1 to dim(dx);
if put(dx(i), $dxcodes.)= 1 then flag=1;
end;
if flag>0 then output;
drop i flag;
run;
Hi chuakp;
The issue with the 'Enn' dx code appears to be caused by the Array variable being defined as numeric variables. (SAS Dictionary; Array statement syntax; page 24; SAS 9.3 Statements Reference). Try the array statement change I made and then tell me what your log says.
proc format;
value $ dxcodes
'78060' = 1
'7784'  = 2
'4659'  = 3
'07999' = 4
'46611' = 5
'77989' = 6
'5990' = 7
'53081' = 8
'27651' = 9
'77189' = 10
'46619' = 11
'79982' = 12
'77931' = 13
'7746' = 14
'V290' = 15
'77182' = 16
'7755' = 17
'V070' = 18
'6910' = 19
'78791' = 20
other = 99;
run;
data want1 want2 want3 want4 want5 
     want6 want7 want8 want9 want10 
     want11 want12 want13 want14 want15 
     want16 want17 want18 want19 want20;
set main;
array dx $ dx_code1-dx_code41;        /***  arrays default to numeric variable values if not defined previously or not defined as character ***/
flag=0;
do i=1 to dim(dx);
if put(dx(i), $dxcodes.)= 1 then flag=1;
end;
if flag>0 then output;      /*** What stops every matching row from being written to every output dataset?
                                                Where are non-matching rows written to? ***/
                            /*** I understood that you want: 
                                 if put(dx(i), $dxcodes.)= 1 then output want1;
                                 if put(dx(i), $dxcodes.)= ... then output want...;
                                 if put(dx(i), $dxcodes.)= 20 then output want20;
                             ***/
drop i flag;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
