Solved: Re: using % wildcard with substr

daufoi · Posted 09-10-2020 01:17 AM

I'm trying to subset some data with the following code:

data want;
set have;
array fx(12) fx1-fx12;
    do i=1 to 12;
    if substr(dx(i),1,4) in ('1115')
       or substr(dx(i),1,5) in ('1146%')
    then output;
end;
run;

I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'). I thought '%' is a wildcard that I can use?

Astounding · Posted 09-10-2020 07:46 AM

You've gotten plenty of good advice so far. Let me add to that.

Once you fix the problem, you could easily be outputting the same observation multiple times. It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match. Here's a way to save the essential data so you can sort it out later. This will give you the information needed to verify that the counts are coming out correctly.

data want;
	DROP  i;
	set have;
	array fx(12)  fx1-fx12;
	length diagnosis_category $ 4;
	do i=1 to 12;
		if  Fx(i) in: ('1115', '1146') then do;
		   diagnosis_code = Fx(i);
		   diagnosis_category = Fx(i);
 		   output;
        end;
	end;
run;

View solution in original post

daufoi · Posted 09-10-2020 01:19 AM

That should read substr(fx(i),1,4) and substr(fx(i),1,5)

jimbarbour · Posted 09-10-2020 02:34 AM

I would change your Data step as follows:

data want;
	DROP  i;
	set have;
	array fx(12) $5 fx1-fx12;

	do i=1 to 12;
		if 	substr(Fx(i),1,4) in ('1115')	OR
			Fx(i) =: ('1146')				THEN
			output;
	end;
run;

Notice that I changed your use of substr(Fx(i),1,5) in ('1146%') to Fx(i) =: ('1146'). The use of the colon instructs SAS to evaluate the condition as true for any characters following the specified characters. I think this should give you the wildcard functionality that you're looking for. The "%" you were using before would be taken as a literal if I'm not mistaken.

Jim

RichardDeVen · Posted 09-10-2020 07:27 AM

Word of caution, if the array contains more than one match you will be outputting replicate rows

The % wildcard is recognized by the WHERE statement LIKE operator. For the IF statement you will want to use the string prefix equality (i.e. starts with) operator =: or the prefix in set operator IN:

Also, since you are prefix checking only 4 of 5 characters substringed out you could `substring` 4 characters and check with ='1146'. Furthermore, since you are `substr` from position 1 (1st character) you won't need to do substr at all (see 3rd example).

In order to use Perl regular expression pattern matching use the PRXMATCH function. The pattern /^1146\d*/ does not need \d* (0 or more digits).

/^1146/' will match anything that /^1146\d*/ does.

Example(s) all the same outcome:

if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;

if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;

/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;

if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;

Astounding · Posted 09-10-2020 07:46 AM

You've gotten plenty of good advice so far. Let me add to that.

Once you fix the problem, you could easily be outputting the same observation multiple times. It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match. Here's a way to save the essential data so you can sort it out later. This will give you the information needed to verify that the counts are coming out correctly.

data want;
	DROP  i;
	set have;
	array fx(12)  fx1-fx12;
	length diagnosis_category $ 4;
	do i=1 to 12;
		if  Fx(i) in: ('1115', '1146') then do;
		   diagnosis_code = Fx(i);
		   diagnosis_category = Fx(i);
 		   output;
        end;
	end;
run;

jimbarbour · Posted 09-10-2020 10:02 AM

That's a good point that you could have multiple rows in your output if you get more than one match. There's a reasonably simple way to make sure that you only get one row in you output per one row in your input: Set a flag when you get a match and only perform an OUTPUT after all values have been checked.

Putting it all together:

data	want;
	DROP	_:;
	set		have;
	array	fx(12)	$5	fx1 - fx12;
	Match	=	0;
	do	_i	=	1	to	12;
		if	Fx(_i) IN: ('1115' '1146')	THEN
			Match	=	1;
	end;
	IF	Match	THEN
		output;
run;

The above should give you a reasonably compact notation, the correct results including wildcard functionality, and one and only one output row for each input row.

The cool thing here is the colon operator. The equals sign followed by the colon acts as a wildcard. Notice also that I've changed my DROP statement such that it now contains a colon. Coded this way, all variables starting with an underscore will be dropped without having to name them variable by variable. In any program that has this DROP statement, all I have to do to make a variable temporary (i.e. not written out as part of the Data statement) is to prefix it with an underscore. I find this bit of code very handy in a program with a lot of intermediate work variables. A word of caution: Some SAS procedure (e.g. Proc Compare) produce variables prefixed with an underscore. If you want to keep those variables, you have to change to two underscores (DROP __:;) or something along those lines.

Jim

using % wildcard with substr

Re: using % wildcard with substr

Re: using % wildcard with substr

Re: using % wildcard with substr

Re: using % wildcard with substr

Re: using % wildcard with substr

Re: using % wildcard with substr

Registration is open

SAS Training: Just a Click Away