I'm trying to subset some data with the following code:
data want;
set have;
array fx(12) fx1-fx12;
do i=1 to 12;
if substr(dx(i),1,4) in ('1115')
or substr(dx(i),1,5) in ('1146%')
then output;
end;
run;
I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'). I thought '%' is a wildcard that I can use?
You've gotten plenty of good advice so far. Let me add to that.
Once you fix the problem, you could easily be outputting the same observation multiple times. It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match. Here's a way to save the essential data so you can sort it out later. This will give you the information needed to verify that the counts are coming out correctly.
data want;
DROP i;
set have;
array fx(12) fx1-fx12;
length diagnosis_category $ 4;
do i=1 to 12;
if Fx(i) in: ('1115', '1146') then do;
diagnosis_code = Fx(i);
diagnosis_category = Fx(i);
output;
end;
end;
run;
I would change your Data step as follows:
data want;
DROP i;
set have;
array fx(12) $5 fx1-fx12;
do i=1 to 12;
if substr(Fx(i),1,4) in ('1115') OR
Fx(i) =: ('1146') THEN
output;
end;
run;
Notice that I changed your use of substr(Fx(i),1,5) in ('1146%') to Fx(i) =: ('1146'). The use of the colon instructs SAS to evaluate the condition as true for any characters following the specified characters. I think this should give you the wildcard functionality that you're looking for. The "%" you were using before would be taken as a literal if I'm not mistaken.
Jim
Word of caution, if the array contains more than one match you will be outputting replicate rows
The %
wildcard is recognized by the WHERE
statement LIKE
operator. For the IF
statement you will want to use the string prefix equality (i.e. starts with) operator =:
or the prefix in set operator IN:
Also, since you are prefix checking only 4 of 5 characters substringed out you could `substring` 4 characters and check with ='1146'. Furthermore, since you are `substr` from position 1 (1st character) you won't need to do substr
at all (see 3rd example).
In order to use Perl regular expression pattern matching use the PRXMATCH
function. The pattern /^1146\d*/
does not need \d*
(0 or more digits).
/^1146/'
will match anything that /^1146\d*/
does.
Example(s) all the same outcome:
if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;
if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;
/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;
if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;
You've gotten plenty of good advice so far. Let me add to that.
Once you fix the problem, you could easily be outputting the same observation multiple times. It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match. Here's a way to save the essential data so you can sort it out later. This will give you the information needed to verify that the counts are coming out correctly.
data want;
DROP i;
set have;
array fx(12) fx1-fx12;
length diagnosis_category $ 4;
do i=1 to 12;
if Fx(i) in: ('1115', '1146') then do;
diagnosis_code = Fx(i);
diagnosis_category = Fx(i);
output;
end;
end;
run;
That's a good point that you could have multiple rows in your output if you get more than one match. There's a reasonably simple way to make sure that you only get one row in you output per one row in your input: Set a flag when you get a match and only perform an OUTPUT after all values have been checked.
Putting it all together:
data want;
DROP _:;
set have;
array fx(12) $5 fx1 - fx12;
Match = 0;
do _i = 1 to 12;
if Fx(_i) IN: ('1115' '1146') THEN
Match = 1;
end;
IF Match THEN
output;
run;
The above should give you a reasonably compact notation, the correct results including wildcard functionality, and one and only one output row for each input row.
The cool thing here is the colon operator. The equals sign followed by the colon acts as a wildcard. Notice also that I've changed my DROP statement such that it now contains a colon. Coded this way, all variables starting with an underscore will be dropped without having to name them variable by variable. In any program that has this DROP statement, all I have to do to make a variable temporary (i.e. not written out as part of the Data statement) is to prefix it with an underscore. I find this bit of code very handy in a program with a lot of intermediate work variables. A word of caution: Some SAS procedure (e.g. Proc Compare) produce variables prefixed with an underscore. If you want to keep those variables, you have to change to two underscores (DROP __:;) or something along those lines.
Jim
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.