SAS Programming

jeanderson · Posted 01-29-2018 02:42 PM

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!

error_prone · Posted 01-29-2018 03:20 PM

Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;

Expressions start and end with slash.
.* matches any char, 0 to n-times
\d matches a digit, {4,4} matches four digits
, the comma after the digit

View solution in original post

novinosrin · Posted 01-29-2018 02:53 PM

best approach is use regular expressions (PRX) pattern functions than find for your case. PRX syntax hurts my eyes/sinuses but I am sure somebody else will help you shortly with the syntax. Hang in there!

Reeza · Posted 01-29-2018 02:54 PM

% isn't a wildcard in the FIND statement, that works in SQL or WHERE statements.

Your best bet is PRX or a regular expression but I find those difficult.

I usually use CHAR() instead with a loop and like in this question:

https://stackoverflow.com/questions/48469465/check-specific-sequence-of-alphanumeric-string-in-sas/4...

@jeanderson wrote:

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!

Astounding · Posted 01-29-2018 03:04 PM

PRX functions are great ... but there is a learning curve. Here's a way that uses simple enough tools that you probably already have used before:

data test2 wonkydata;

set test;

if var4 =: 'H' and substr(var4, 6, 1) = ',' and (1000 <= input(substr(var4, 2, 4), ?? 4.) <= 9999) then output test2;

else output wonkydata;

run;

But what does it mean when you say that VAR4 begins with these character, yet there can be other text before or after this string?

jeanderson · Posted 01-29-2018 03:19 PM

I meant that the condition I need to parse on begins with H and is followed by 4 integers and a comma, but the data is so inconsistent that I can't rely on it being the first characters in the field.

I may be able to use your solution but will have to add something to allow the H____, to start at any point in the string.

Astounding · Posted 01-29-2018 03:25 PM

While my code could be modified to look for the first "H" and perform the same inspection, that might not be enough. You might have multiple sections of your string that begin with "H". Since you have a sample PRX solution, I would go that route instead ... assuming you test it and find it is working for all the cases:

H at the beginning

H in the middle

multiple words that begin with H (especially when the first one is wonky but the second one is good)

********** EDITED:

One more thing to consider: If "H" begins the last word within the text, will there be a comma after it? Will the proposed solution work when there is no comma at the end of the line?

jeanderson · Posted 01-29-2018 03:41 PM

Without the comma, it would be missing other data I need to extract from the field and would be wonky anyway.

error_prone · Posted 01-29-2018 03:20 PM

Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;

Expressions start and end with slash.
.* matches any char, 0 to n-times
\d matches a digit, {4,4} matches four digits
, the comma after the digit

jeanderson · Posted 01-29-2018 03:25 PM

This sounds perfect. Is there any way I can use it in an if statement? If it fails, what does it return?

error_prone · Posted 01-29-2018 03:28 PM

From the documentation: "Searches for a pattern match and returns the position at which the pattern is found." If the string is not found 0 is returned.

jeanderson · Posted 01-29-2018 03:42 PM

This is absolutely perfect. Thank you so much!!

SAS Programming

Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Re: Single Character Wildcards in find function

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...