BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daufoi
Fluorite | Level 6

I'm trying to subset some data with the following code:

 

data want;
set have;
array fx(12) fx1-fx12;
    do i=1 to 12;
    if substr(dx(i),1,4) in ('1115')
       or substr(dx(i),1,5) in ('1146%')
    then output;
end;
run;

I cross reference the data output using proc freq to the original dataset. The frequency counts for '1115' matches as they should. They don't for '1146%'). I thought '%' is a wildcard that I can use?

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

You've gotten plenty of good advice so far.  Let me add to that.

 

Once you fix the problem, you could easily be outputting the same observation multiple times.  It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match.  Here's a way to save the essential data so you can sort it out later.  This will give you the information needed to verify that the counts are coming out correctly.

data want;
	DROP  i;
	set have;
	array fx(12)  fx1-fx12;
	length diagnosis_category $ 4;
	do i=1 to 12;
		if  Fx(i) in: ('1115', '1146') then do;
		   diagnosis_code = Fx(i);
		   diagnosis_category = Fx(i);
 		   output;
        end;
	end;
run;

View solution in original post

5 REPLIES 5
daufoi
Fluorite | Level 6
That should read substr(fx(i),1,4) and substr(fx(i),1,5)
jimbarbour
Meteorite | Level 14

I would change your Data step as follows:

data want;
	DROP  i;
	set have;
	array fx(12) $5 fx1-fx12;

	do i=1 to 12;
		if 	substr(Fx(i),1,4) in ('1115')	OR
			Fx(i) =: ('1146')				THEN
			output;
	end;
run;

Notice that I changed your use of substr(Fx(i),1,5) in ('1146%') to Fx(i) =: ('1146').  The use of the colon instructs SAS to evaluate the condition as true for any characters following the specified characters.  I think this should give you the wildcard functionality that you're looking for.  The "%" you were using before would be taken as a literal if I'm not mistaken.

 

Jim

RichardDeVen
Barite | Level 11

Word of caution, if the array contains more than one match you will be outputting replicate rows

 

The % wildcard is recognized by the WHERE statement LIKE operator. For the IF statement you will want to use the string prefix equality (i.e. starts with) operator =: or the prefix in set operator IN:

Also, since you are prefix checking only 4 of 5 characters substringed out you could `substring` 4 characters and check with ='1146'. Furthermore, since you are `substr` from position 1 (1st character) you won't need to do substr at all (see 3rd example).

In order to use Perl regular expression pattern matching use the PRXMATCH function. The pattern /^1146\d*/ does not need \d* (0 or more digits).

/^1146/' will match anything that /^1146\d*/ does.

 

Example(s) all the same outcome:

 

if substr(dx(i),1,4) in ('1115') or fx(i) =: '1146' then output;
if substr(dx(i),1,4) in ('1115') or substr(fx(i),1,4) = '1146' then output;
/* expanded example for case of checking two prefix possibilities */
if dx(i) in: ('1115') or fx(i) in: ('1146', '124') then output;
if dx(i) =: '1115' or prxmatch('/^1146/', fx(i)) then output;

 

 

 

 

Astounding
PROC Star

You've gotten plenty of good advice so far.  Let me add to that.

 

Once you fix the problem, you could easily be outputting the same observation multiple times.  It's at least theoretically possible that multiple diagnosis codes from the same observation will produce a match.  Here's a way to save the essential data so you can sort it out later.  This will give you the information needed to verify that the counts are coming out correctly.

data want;
	DROP  i;
	set have;
	array fx(12)  fx1-fx12;
	length diagnosis_category $ 4;
	do i=1 to 12;
		if  Fx(i) in: ('1115', '1146') then do;
		   diagnosis_code = Fx(i);
		   diagnosis_category = Fx(i);
 		   output;
        end;
	end;
run;
jimbarbour
Meteorite | Level 14

That's a good point that you could have multiple rows in your output if you get more than one match.  There's a reasonably simple way to make sure that you only get one row in you output per one row in your input:  Set a flag when you get a match and only perform an OUTPUT after all values have been checked.  

 

Putting it all together:

data	want;
	DROP	_:;
	set		have;
	array	fx(12)	$5	fx1 - fx12;
	Match	=	0;
	do	_i	=	1	to	12;
		if	Fx(_i) IN: ('1115' '1146')	THEN
			Match	=	1;
	end;
	IF	Match	THEN
		output;
run;

The above should give you a reasonably compact notation, the correct results including wildcard functionality, and one and only one output row for each input row.

 

The cool thing here is the colon operator.  The equals sign followed by the colon acts as a wildcard.  Notice also that I've changed my DROP statement such that it now contains a colon.  Coded this way, all variables starting with an underscore will be dropped without having to name them variable by variable.  In any program that has this DROP statement, all I have to do to make a variable temporary (i.e. not written out as part of the Data statement) is to prefix it with an underscore.  I find this bit of code very handy in a program with a lot of intermediate work variables.  A word of caution:  Some SAS procedure (e.g. Proc Compare) produce variables prefixed with an underscore.  If you want to keep those variables, you have to change to two underscores (DROP __:;) or something along those lines.

 

Jim

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 3079 views
  • 1 like
  • 4 in conversation