BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SAS_student2
Calcite | Level 5

I have a table with 141 columns. I need to look through 25 specific columns that list a patients diagnosis (Diag1-Diag25). Each column has a single diagnosis code in it which is in character form. I have a list of specific diagnosis codes that I'm interested in and I want to keep any row that has these diagnosis codes in any of the Diag1-Diag25 columns in my filtered.I have done this successfully; however, my approach is very inefficient and does not scale well.  Right now my best approach consists of using a PROC SQL statement with a CREATE TABLE and WHERE statement to individually specify each column I want to look at. Then I have to use an IN statement to specify the range of diagnostic codes I'm interested in and join each individual column analysis with an OR statement. Here is an example of my code:

 

proc sql noprint;
create table SUBGROUP as select * from FILE.UNFILTERED_TABLE where(Diag1 
IN("4275") OR Diag2 IN("4275") OR Diag3 IN("4275") OR Diag4 
IN("4275") OR Diag5 IN("4275") OR Diag6 IN("4275") OR
Diag7 IN("4275") OR Diag8 IN("4275") OR Diag9 IN("4275") 
OR Diag10 IN("4275") OR Diag11 IN("4275") OR Diag12 IN("4275")
OR Diag13 IN("4275") OR Diag14 IN("4275") OR Diag15 IN("4275")
OR Diag16 IN("4275") OR Diag17 IN("4275") OR Diag18 IN("4275")
OR Diag19 IN("4275") OR Diag20 IN("4275") OR Diag21 IN("4275")
OR Diag22 IN("4275") OR Diag23 IN("4275") OR Diag24 IN("4275")
OR Diag25 IN("4275")); 

quit;

 

This is the most simple example where I only care about 1 diagnostic code but sometimes I need to look at 30 codes which makes this approach kind of cumbersome. I have tried to find a way to reference multiple columns in SAS in a DATA step or using looping and such but nothing has worked for me so any advice/help would be greatly appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

You did not provide any example data so here is one approach with a small dummy data set.

data have;
   informat diag1 - diag4 $4.;
   input diag1 - diag4;
datalines;
1245 4567 2333 8345
6666 1246 4444 4356
5432 5543 4275 6666
8325 4275 1245 9999
run;

data want;
   set have ;
   array d diag1-diag4;
   array c{2} $ 4 _temporary_ ('1245' '4275');
   do i= 1 to dim(d);
      if whichc(d[i],of c(*))>0 then do; 
         output;
         leave;
      end;
   end;
run;

The first data set just creates some dummy data.

 

You would place the variables like your diag variables on the array d.

For the second array you have to 1) specify the number of values you want to compare in the c{x}, the $ 4 says the values with be character of length 4 (or less) _temporary_ says that the variables created by the array are temporary and not added to the data set, the values in the () at the end are the ones you want to search for. The number really should match the first number in c{x} though there are fewer this won't fail.

The WHICHC function searches for the first parameter in the remaining values. The of c(*) says to use all of the elements in the array c to search. The result is a number if found or 0 if not found. The test >0 says one of the values is found so a)output the record and b) leave the test loop when the first match is found.

You may or may not want to keep the variable I. In this instance it would tell you which of the diag variables matched in the list of search values.

 

This code may have an issue with your data set if you have existing variables named c1, c2, etc. in which case you can change the C to something else you aren't using such as _c or what ever you prefer.

Sql is crappy for this type of comparison as you have discovered.

View solution in original post

6 REPLIES 6
Haikuo
Onyx | Level 15

Look up 'Array' in SAS docs.

Astounding
PROC Star

As was suggested, arrays will come in handy here (so long as you are using a DATA step).

 

Another piece of the solution:  the IN operator lets you specify a list of values.  For example:

 

where Diag1 in ("4275" "4280" "4285" "4290");

 

Arrays will let you specify that longer list only once instead of 25 times.

ballardw
Super User

You did not provide any example data so here is one approach with a small dummy data set.

data have;
   informat diag1 - diag4 $4.;
   input diag1 - diag4;
datalines;
1245 4567 2333 8345
6666 1246 4444 4356
5432 5543 4275 6666
8325 4275 1245 9999
run;

data want;
   set have ;
   array d diag1-diag4;
   array c{2} $ 4 _temporary_ ('1245' '4275');
   do i= 1 to dim(d);
      if whichc(d[i],of c(*))>0 then do; 
         output;
         leave;
      end;
   end;
run;

The first data set just creates some dummy data.

 

You would place the variables like your diag variables on the array d.

For the second array you have to 1) specify the number of values you want to compare in the c{x}, the $ 4 says the values with be character of length 4 (or less) _temporary_ says that the variables created by the array are temporary and not added to the data set, the values in the () at the end are the ones you want to search for. The number really should match the first number in c{x} though there are fewer this won't fail.

The WHICHC function searches for the first parameter in the remaining values. The of c(*) says to use all of the elements in the array c to search. The result is a number if found or 0 if not found. The test >0 says one of the values is found so a)output the record and b) leave the test loop when the first match is found.

You may or may not want to keep the variable I. In this instance it would tell you which of the diag variables matched in the list of search values.

 

This code may have an issue with your data set if you have existing variables named c1, c2, etc. in which case you can change the C to something else you aren't using such as _c or what ever you prefer.

Sql is crappy for this type of comparison as you have discovered.

SAS_student2
Calcite | Level 5

Thanks so much this makes a lot of sense and helps a lot!

miheret
Calcite | Level 5

what if I want to run the loop until all matches are found in an observation? DO I just not use the leave function?

will the variable i then list all the variables that match the search list instead of stopping at the first match?

I am trying to identify an observation with same/redundant diagnostic codes present under different variable columns. 

 

Thanks!

Reeza
Super User

Do you have your diagnosis codes in your head, or in a separate file? If they're in a file, you can 'automate' this so that next time it works the same way if you have a different list. You would create custom format with your list and then loop through and check each diagnosis code. There are several ways to work with clinical data and you can find lots of examples on LexJansen.com

 


@SAS_student2 wrote:

I have a table with 141 columns. I need to look through 25 specific columns that list a patients diagnosis (Diag1-Diag25). Each column has a single diagnosis code in it which is in character form. I have a list of specific diagnosis codes that I'm interested in and I want to keep any row that has these diagnosis codes in any of the Diag1-Diag25 columns in my filtered.I have done this successfully; however, my approach is very inefficient and does not scale well.  Right now my best approach consists of using a PROC SQL statement with a CREATE TABLE and WHERE statement to individually specify each column I want to look at. Then I have to use an IN statement to specify the range of diagnostic codes I'm interested in and join each individual column analysis with an OR statement. Here is an example of my code:

 

proc sql noprint;
create table SUBGROUP as select * from FILE.UNFILTERED_TABLE where(Diag1 
IN("4275") OR Diag2 IN("4275") OR Diag3 IN("4275") OR Diag4 
IN("4275") OR Diag5 IN("4275") OR Diag6 IN("4275") OR
Diag7 IN("4275") OR Diag8 IN("4275") OR Diag9 IN("4275") 
OR Diag10 IN("4275") OR Diag11 IN("4275") OR Diag12 IN("4275")
OR Diag13 IN("4275") OR Diag14 IN("4275") OR Diag15 IN("4275")
OR Diag16 IN("4275") OR Diag17 IN("4275") OR Diag18 IN("4275")
OR Diag19 IN("4275") OR Diag20 IN("4275") OR Diag21 IN("4275")
OR Diag22 IN("4275") OR Diag23 IN("4275") OR Diag24 IN("4275")
OR Diag25 IN("4275")); 

quit;

 

This is the most simple example where I only care about 1 diagnostic code but sometimes I need to look at 30 codes which makes this approach kind of cumbersome. I have tried to find a way to reference multiple columns in SAS in a DATA step or using looping and such but nothing has worked for me so any advice/help would be greatly appreciated.


 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 12514 views
  • 4 likes
  • 6 in conversation