- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.
Examples:
Good data: H2009, V7WK48
Bad data: HV2030, V5WK24
Code I thought would work:
data test2 wonkydata;
set test;
if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/
run;
However, both good and bad data are being output to wonkydata.
This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉
data want;
set have;
isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
- Expressions start and end with slash.
- .* matches any char, 0 to n-times
- \d matches a digit, {4,4} matches four digits
- , the comma after the digit
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
best approach is use regular expressions (PRX) pattern functions than find for your case. PRX syntax hurts my eyes/sinuses but I am sure somebody else will help you shortly with the syntax. Hang in there!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
% isn't a wildcard in the FIND statement, that works in SQL or WHERE statements.
Your best bet is PRX or a regular expression but I find those difficult.
I usually use CHAR() instead with a loop and like in this question:
@jeanderson wrote:
In a data step (Base SAS 9.4), I am trying to parse my good data from my bad data. One condition of good data is that var4 contains a string that starts with H then is followed by 4 integers and a comma. Other text may be before or after this 6-character string.
Examples:
Good data: H2009, V7WK48
Bad data: HV2030, V5WK24
Code I thought would work:
data test2 wonkydata;
set test;if find(var4,'%H____,%')=0 then output wonkydata; /*Remove data in wrong format for extracting ID*/
run;
However, both good and bad data are being output to wonkydata.
This is my first time posting a question here, so if I need to provide more info please let me know. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PRX functions are great ... but there is a learning curve. Here's a way that uses simple enough tools that you probably already have used before:
data test2 wonkydata;
set test;
if var4 =: 'H' and substr(var4, 6, 1) = ',' and (1000 <= input(substr(var4, 2, 4), ?? 4.) <= 9999) then output test2;
else output wonkydata;
run;
But what does it mean when you say that VAR4 begins with these character, yet there can be other text before or after this string?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I meant that the condition I need to parse on begins with H and is followed by 4 integers and a comma, but the data is so inconsistent that I can't rely on it being the first characters in the field.
I may be able to use your solution but will have to add something to allow the H____, to start at any point in the string.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
While my code could be modified to look for the first "H" and perform the same inspection, that might not be enough. You might have multiple sections of your string that begin with "H". Since you have a sample PRX solution, I would go that route instead ... assuming you test it and find it is working for all the cases:
H at the beginning
H in the middle
multiple words that begin with H (especially when the first one is wonky but the second one is good)
********** EDITED:
One more thing to consider: If "H" begins the last word within the text, will there be a comma after it? Will the proposed solution work when there is no comma at the end of the line?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Regular expressions aren't that difficult, if you can bend your mind in some or another way 😉
data want;
set have;
isGood = prxmatch('/.*H\d{4,4},.*/', trim(var4));
run;
- Expressions start and end with slash.
- .* matches any char, 0 to n-times
- \d matches a digit, {4,4} matches four digits
- , the comma after the digit
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This sounds perfect. Is there any way I can use it in an if statement? If it fails, what does it return?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
From the documentation: "Searches for a pattern match and returns the position at which the pattern is found." If the string is not found 0 is returned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content