BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jeanderson
Fluorite | Level 6

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

 

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

 

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
error_prone
Barite | Level 11

Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉

 

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
  • Expressions start and end with slash.
  • .* matches any char, 0 to n-times
  • \d matches a digit, {4,4} matches four digits
  • , the comma after the digit

View solution in original post

10 REPLIES 10
novinosrin
Tourmaline | Level 20

best approach is use regular expressions (PRX) pattern functions than find for your case. PRX syntax hurts my eyes/sinuses but I am sure somebody else will help you shortly with the syntax. Hang in there!

Reeza
Super User

% isn't a wildcard in the FIND statement, that works in SQL or WHERE statements. 

 

Your best bet is PRX or a regular expression but I find those difficult. 

I usually use CHAR() instead with a loop and like in this question:

https://stackoverflow.com/questions/48469465/check-specific-sequence-of-alphanumeric-string-in-sas/4...

 


@jeanderson wrote:

In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.

Examples:

Good data: H2009, V7WK48

Bad data: HV2030, V5WK24

 

Code I thought would work:

data test2 wonkydata;
set test;

if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/

run;

However, both good and bad data are being output to wonkydata.

 

This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!


 

Astounding
PROC Star

PRX functions are great ... but there is a learning curve.  Here's a way that uses simple enough tools that  you probably already have used before:

 

data test2 wonkydata;

set test;

if var4 =: 'H' and substr(var4, 6, 1) = ',' and (1000 <= input(substr(var4, 2, 4), ?? 4.) <= 9999) then output test2;

else output wonkydata;

run;

 

But what does it mean when you say that VAR4 begins with these character, yet there can be other text before or after this string?

jeanderson
Fluorite | Level 6

I meant that the condition I need to parse on begins with H and is followed by 4 integers and a comma, but the data is so inconsistent that I can't rely on it being the first characters in the field.

 

I may be able to use your solution but will have to add something to allow the H____, to start at any point in the string.

Astounding
PROC Star

While my code could be modified to look for the first "H" and perform the same inspection, that might not be enough.  You might have multiple sections of your string that begin with "H".  Since you have a sample PRX solution, I would go that route instead ... assuming you test it and find it is working for all the cases:

 

H at the beginning

H in the middle

multiple words that begin with H (especially when the first one is wonky but the second one is good)

 

********** EDITED:

 

One more thing to consider:  If "H" begins the last word within the text, will there be a comma after it?  Will the proposed solution work when there is no comma at the end of the line?

jeanderson
Fluorite | Level 6
Without the comma, it would be missing other data I need to extract from the field and would be wonky anyway.
error_prone
Barite | Level 11

Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉

 

data want;
	set have;
	
	isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
  • Expressions start and end with slash.
  • .* matches any char, 0 to n-times
  • \d matches a digit, {4,4} matches four digits
  • , the comma after the digit
jeanderson
Fluorite | Level 6

This sounds perfect. Is there any way I can use it in an if statement? If it fails, what does it return?

error_prone
Barite | Level 11

From the documentation: "Searches for a pattern match and returns the position at which the pattern is found." If the string is not found 0 is returned.

jeanderson
Fluorite | Level 6
This is absolutely perfect. Thank you so much!!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 5663 views
  • 0 likes
  • 5 in conversation