DATA Step, Macro, Functions and more

Single Character Wildcards in find function

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 8
Accepted Solution

Single Character Wildcards in find function

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

 

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

 

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!


Accepted Solutions
Solution
‎01-29-2018 03:40 PM
Regular Contributor
Posts: 213

Re: Single Character Wildcards in find function

Posted in reply to jeanderson

Regular expressions aren't that difficult, if you can bend your mind in some or another way ;-)

 

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
  • Expressions start and end with slash.
  • .* matches any char, 0 to n-times
  • \d matches a digit, {4,4} matches four digits
  • , the comma after the digit

View solution in original post


All Replies
PROC Star
Posts: 1,845

Re: Single Character Wildcards in find function

Posted in reply to jeanderson

best approach is use regular expressions (PRX) pattern functions than find for your case. PRX syntax hurts my eyes/sinuses but I am sure somebody else will help you shortly with the syntax. Hang in there!

Super User
Posts: 23,778

Re: Single Character Wildcards in find function

Posted in reply to jeanderson

% isn't a wildcard in the FIND statement, that works in SQL or WHERE statements. 

 

Your best bet is PRX or a regular expression but I find those difficult. 

I usually use CHAR() instead with a loop and like in this question:

https://stackoverflow.com/questions/48469465/check-specific-sequence-of-alphanumeric-string-in-sas/4...

 


jeanderson wrote:

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

 

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

 

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!


 

Super User
Posts: 6,785

Re: Single Character Wildcards in find function

[ Edited ]
Posted in reply to jeanderson

PRX functions are great ... but there is a learning curve.  Here's a way that uses simple enough tools that  you probably already have used before:

 

data test2 wonkydata;

set test;

if var4 =: 'H' and substr(var4, 6, 1) = ',' and (1000 <= input(substr(var4, 2, 4), ?? 4.) <= 9999) then output test2;

else output wonkydata;

run;

 

But what does it mean when you say that VAR4 begins with these character, yet there can be other text before or after this string?

Occasional Contributor
Posts: 8

Re: Single Character Wildcards in find function

Posted in reply to Astounding

I meant that the condition I need to parse on begins with H and is followed by 4 integers and a comma, but the data is so inconsistent that I can't rely on it being the first characters in the field.

 

I may be able to use your solution but will have to add something to allow the H____, to start at any point in the string.

Super User
Posts: 6,785

Re: Single Character Wildcards in find function

[ Edited ]
Posted in reply to jeanderson

While my code could be modified to look for the first "H" and perform the same inspection, that might not be enough.  You might have multiple sections of your string that begin with "H".  Since you have a sample PRX solution, I would go that route instead ... assuming you test it and find it is working for all the cases:

 

H at the beginning

H in the middle

multiple words that begin with H (especially when the first one is wonky but the second one is good)

 

********** EDITED:

 

One more thing to consider:  If "H" begins the last word within the text, will there be a comma after it?  Will the proposed solution work when there is no comma at the end of the line?

Occasional Contributor
Posts: 8

Re: Single Character Wildcards in find function

Posted in reply to Astounding
Without the comma, it would be missing other data I need to extract from the field and would be wonky anyway.
Solution
‎01-29-2018 03:40 PM
Regular Contributor
Posts: 213

Re: Single Character Wildcards in find function

Posted in reply to jeanderson

Regular expressions aren't that difficult, if you can bend your mind in some or another way ;-)

 

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
  • Expressions start and end with slash.
  • .* matches any char, 0 to n-times
  • \d matches a digit, {4,4} matches four digits
  • , the comma after the digit
Occasional Contributor
Posts: 8

Re: Single Character Wildcards in find function

Posted in reply to error_prone

This sounds perfect. Is there any way I can use it in an if statement? If it fails, what does it return?

Regular Contributor
Posts: 213

Re: Single Character Wildcards in find function

Posted in reply to jeanderson

From the documentation: "Searches for a pattern match and returns the position at which the pattern is found." If the string is not found 0 is returned.

Occasional Contributor
Posts: 8

Re: Single Character Wildcards in find function

Posted in reply to error_prone
This is absolutely perfect. Thank you so much!!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 352 views
  • 0 likes
  • 5 in conversation