DATA Step, Macro, Functions and more

Searching multiple variables for multiple character strings

Reply
New Contributor
Posts: 3

Searching multiple variables for multiple character strings

Hello.

 

I have a data set with 57 character variables that need to be searched for over 100 distinct ICD 9 codes that are character strings.  I need to create a new variable=1 if person has at least 1 of the desired codes in any of the 57 variables or variable=0 if none of the codes are in any of the variables.  I could use some advice. All I know is that I need to use an array(s) to create the new variable.  I have tried the code below, but it gives me a 0 for every observation which I know is wrong.  I'm not sure where I am going wrong. Any advice would be appreciated.

 

 

data test;
set qft_int;
length diagnosis9 $1379 ;
diagnosis9=catx(" ", "ICD9:042", "ICD9:136.3", "ICD9:199.2", "ICD9:238.73", "ICD9:238.76", "ICD9:238.77", "ICD9:238.79", "ICD9:260", "ICD9:261", "ICD9:262", "ICD9:279.00", "ICD9:279.01", "ICD9:279.02",
"ICD9:279.03", "ICD9:279.04", "ICD9:279.05", "ICD9:279.06", "ICD9:279.09", "ICD9:279.10", "ICD9:279.11", "ICD9:279.12", "ICD9:279.13", "ICD9:279.19", "ICD9:279.2", "ICD9:279.3",
"ICD9:279.4", "ICD9:279.41", "ICD9:279.49", "ICD9:279.50", "ICD9:279.51", "ICD9:279.52", "ICD9:279.53", "ICD9:279.8", "ICD9:279.9", "ICD9:284.09", "ICD9:284.1", "ICD9:284.11",
"ICD9:284.12", "ICD9:284.19", "ICD9:288.0", "ICD9:288.00", "ICD9:288.01", "ICD9:288.02", "ICD9:288.03", "ICD9:288.09", "ICD9:288.1", "ICD9:288.2", "ICD9:288.4", "ICD9:288.50",
"ICD9:288.51", "ICD9:288.59", "ICD9:289.53", "ICD9:289.83", "ICD9:403.01", "ICD9:403.11", "ICD9:403.91", "ICD9:404.02", "ICD9:404.03", "ICD9:404.12", "ICD9:404.13", "ICD9:404.92",
"ICD9:404.93", "ICD9:579.3", "ICD9:585", "ICD9:585.5", "ICD9:585.6", "ICD9:996.8", "ICD9:996.80", "ICD9:996.81", "ICD9:996.81", "ICD9:996.82", "ICD9:996.83", "ICD9:996.84", "ICD9:996.85",
"ICD9:996.86", "ICD9:996.87", "ICD9:996.88", "ICD9:996.89", "ICD9:V42.0", "ICD9:V42.1", "ICD9:V42.6", "ICD9:V42.7", "ICD9:V42.8", "ICD9:V42.81", "ICD9:V42.82", "ICD9:V42.83", "ICD9:V42.84",
"ICD9:V42.89", "ICD9:V45.1", "ICD9:V45.11", "ICD9:V56.0", "ICD9:V56.1", "ICD9:V56.2", "ICD9:00.18", "ICD9:33.5", "ICD9:33.50", "ICD9:33.51", "ICD9:33.52", "ICD9:33.6", "ICD9:37.5",
"ICD9:37.51", "ICD9:41.0", "ICD9:41.00", "ICD9:41.01", "ICD9:41.02", "ICD9:41.03", "ICD9:41.04", "ICD9:41.05", "ICD9:41.06", "ICD9:41.07", "ICD9:41.08", "ICD9:41.09", "ICD9:50.51",
"ICD9:50.59", "ICD9:52.80", "ICD9:52.81", "ICD9:52.82", "ICD9:52.83", "ICD9:52.85", "ICD9:52.86", "ICD9:55.69");


immune9=0;
retain immune9;
array diagnosisimmune{57} PrimaryDiag SecondaryDiag_1-SecondaryDiag_56 ;

do i= 1 to dim(diagnosisimmune);
if index(diagnosisimmune(i), diagnosis9) then immune9 =1;
end;

drop i;
run;

Super User
Posts: 5,085

Re: Searching multiple variables for multiple character strings

The biggest thing to correct is your use of INDEX.  You have the parameters reversed.  The string to be searched comes first, and the string you are searching for comes second.

 

Secondarily, INDEX does not search for words.  It searches for strings.  So if one of your variables had a value of "28" or "CD9" it would be found.  That sounds like the wrong result.  There are functions that search for words instead of strings ... I think INDEXW and FINDW would be on the list but you might have to doublecheck that part.

New Contributor
Posts: 3

Re: Searching multiple variables for multiple character strings

Thank you for your suggestion. I don't believe that the parameters are reversed in the index statement.  I want to search the 57 variables that I have entered in the array diagnosisimmune for the strings entered in the diagnosis9 variable. Each string is seperated by a space. I tried INDEXW but I still get all observations of immune9=0.

Super User
Posts: 10,516

Re: Searching multiple variables for multiple character strings

Please look at this somewhat reduced version of what you are attempting with index and why it isn't going to work the way your code is structured. You are using the INDEXRESULTW form in the code below:

 

 data junk;
   x='one code';
   y='one code, two code';
   indexresultw= index(x,y);
   indexresultc= index(y,x);
   put indexresultw= indexresultc=;
run;

the LONG string goes first with Index and the short string second (if you expect to find the short string in the long string ever)

 

Super User
Posts: 5,085

Re: Searching multiple variables for multiple character strings

You'll still need to switch the parameters.  Here's the documentation:

 

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000212235.htm

 

After switching, note that your 57 elements must contain the full word.  This will match:

 

ICD9:136.3

 

Neither of these will match:

 

ICD9:136

136.3

New Contributor
Posts: 3

Re: Searching multiple variables for multiple character strings

Thank you.  You were right.  Switching the parameters and a few other small changes to my code were all it took to get this to work.

Occasional Contributor
Posts: 6

Re: Searching multiple variables for multiple character strings

Hello,

 

I have a similar task that I am trying to perform that I was hoping you could also provide some insight on. I want to state up front that I am new to SAS (and this community) so I apologize if my question is not clear or if I am not asking it in the correct forum.

 

I have a data set with ~10 million records/rows. I want to check for the presence of any one of multiple strings across 9 of the variables in the data set. I then want to create a new variable (dis2) in my data set that will take the value of either 0 or 1. Dis2 should have a value of 1 if at least one of the 9 variables has any one of the strings and 0 if none of the 9 variables have any of the strings.

 

I created a macro statement to hold the strings I want to seach for:

%let dc2 = %str('28730', '28739', '2875','280', '281', '282', '283');

 

I then want to check each of these 9 varibales to see if any one of the strings are in any one of the variables:

c_dx1, c_dx2, c_dx3, c_dx4, c_dx5, c_dx6, c_dx7, c_dx8, c_hcpc

 

I was thinking that I could put the 9 variables in an array but I am not sure how to put everything together and create dis2. Any help is greatly appreciated. If you can provide example code of how to solve this problem that would be ideal. Something like:

 

%let dc2 = %str('28730', '28739', '2875','280', '281', '282', '283');

 

data example_w_dis2;

     set example;

     .......;

     .......;

 

Thanks!

 

Super User
Posts: 5,085

Re: Searching multiple variables for multiple character strings

It's important to clarify what you are looking for.  If "280" is in c_dx1 does that mean that c_dx1 equals "280" (with maybe a few blanks at the end), or could either of these also be a match:

 

28055

12802

Occasional Contributor
Posts: 6

Re: Searching multiple variables for multiple character strings

Astounding,

 

Thanks for pointing this out. "280" should match strings that start with "280" that may then also have additional numbers in the fourth and fifth position. For example, "280", "2801", 2802" and "28041" should all be recognized as matches. Conversely, "12802" should not be a match.

Super User
Posts: 5,085

Re: Searching multiple variables for multiple character strings

Thanks for clarifying.  Here's a straightforward way to approach the problem:

 

data want;

set have;

array diags {9} c_dx1-c_dx8 c_hcpc;

found = 0;

do _n_=1 to 9 until (found=1);

   if diags{_n_} in : (&dc2) then found=1;

end;

run;

 

Obviously, this depends on having set up a macro variable &DC2.  But you could just as easily hard-code the list of values within the program if necessary.

 

The key is the colon.  The comparison operator in : will find values that begin with the list of values within &DC2.

Occasional Contributor
Posts: 6

Re: Searching multiple variables for multiple character strings

Astounding,

 

Thanks! I created a macro with the strings in it I want to search for and then tried to include it in the data step. Can you please tell me if this looks right to you:

 

%macro check_gastro_hemg(dx);
(substr(&dx,1,4) in ('5310', '5312', '5314',
'5316', '5320', '5322', '5324', '5326', '5330',
'5332', '5334', '5336', '5340', '5342', '5344', '5346', '5693')) or
(&dx in ('53082', '53783', '53784', '56985', '56986')) or
(substr(&dx,1,3)='578') or
(substr(&dx,1,3)='535' and substr(&dx,5,1)='1')

%mend;

 

 

data want;

set have;

array diags {9} c_dx1-c_dx8 c_hcpc;

found = 0;

do _n_=1 to 9 until (found=1);

  if %check_gastro_hemg(diags) then then found=1;

end;

run;

Super User
Posts: 5,085

Re: Searching multiple variables for multiple character strings

Just a couple of comments ... you'll have to do the testing because I can't (at least not this week).

 

You're trying to use "diags" to refer to one of the elements in the array ... rather than using, for example, diags{1} to refer to the first element.  That's the older style of arrays and can be done, but might require some changes to the program.  Things you might have to change ...

 

The array statement might have to remove {9} and just list the elements.  

 

The index for the array in the DO loop might have to use _i_ instead of _n_.

 

That should take care of the syntax.  However, if your data set is large, note that you are using SUBSTR a number of times.  Since you are examining the beginning of a string, it would be faster to use in : as in my original examples.  Here's what your macro could look like if speed is an issue:

 

%macro check_gastro_hemg(dx);

 

&dx in : ('5310', '5312', '5314',

'5316', '5320', '5322', '5324', '5326', '5330',
'5332', '5334', '5336', '5340', '5342', '5344', '5346', '5693',
'53082', '53783', '53784', '56985', '56986',
'578') or
(&dx =: '535' and substr(&dx,5,1)='1')

 

%mend;

 

The colon modifies character comparisons, by basing the comparison on whichever string is shorter.

Occasional Contributor
Posts: 6

Re: Searching multiple variables for multiple character strings

Thanks Astounding!

 

Your last post clarified the last couple points I was getting hung up on and I was able to make the few changes I needed to get the program to run. Thanks for also explaining the different styles or arrays and the advantages of using the style you suggested.

Super User
Posts: 17,868

Re: Searching multiple variables for multiple character strings

This is much, much easier if you have your data in a long format rather than a wide format. As mentioned, only exact matches will be found. 

 

If you transpose your data and place the codes you're looking for into a dataset, list_chosen then this would work:

 

 

proc sql;
create table want as
select *
from have 
where idx in (select idx from list_chosen);
quit;

 

Ask a Question
Discussion stats
  • 13 replies
  • 713 views
  • 2 likes
  • 5 in conversation