I am a newbie to Perl discussions and hoping that someone can help me with this question. I am trying to identfiy and remove Patient ids in a free form text string using Perl expression - the conditions are as follows:
1. The ids are always 8 bytes and can contains letters(upper or lowercase) and numbers.
2. The ids will always contain at least 1 digit.
Here's the expression I tried to build but I end up picking all 8 letter words. The negated class does not seem to work either.
Here's my code:
id_re=prxparse('s/\b[a-zA-Z0-9]{8}\b/ &id removed& /')
Any ideas - how can find the 8 byte ids.
I crossposted your request on a similar forum (i.e., SAS-L) and a friend/sas/Perl expert (i.e., Toby Dunn) offered the following solution to the problem you raised:
data have;
length stuff $ 80;
input Stuff & ;
cards;
Now is the time for all good men and women
to come to the gh5567AA aid of their party
Or, was it 4567890 or 45678901 that caused
the problems problems.
4567890
45678901
1234_678
1234/678
ABC4EFGH
;
data want;
set Have ;
stuff2 = PrxChange( 's/(?=\b[A-Z0-9]{8}\b)\b[A-Z0-9]*\d[A-Z0-9]*\b//oi' , -1 , Stuff ) ;
run;
I'm just starting to learn regular expressions, thus can't be of much help.
I think the expression you want is: ^(?=[a-zA-z0-9]*\d).{8,8}$
That would match an eight character string that contained only letters and or numbers, and contained at least one number.
Unfortunately, I don't know how to implement it.
Try this:
data ID_REDACTED; infile datalines truncover; input text1 $100.; text2 = prxchange('s/\s\w1\d+\s/ *REDACTED* /', -1, text1); datalines; This patent ID abcdef01 should be removed. This is NOT a patent ID abcdefgh and should remain. This is NOT a patent ID 123455 and should remain. ; run;
Message was edited by: Mark Jordan to improve the Regular Expression
I don't know if the following is quite what you had in mind, but it does appear to eliminate the unwanted IDs:
data have;
length stuff $20;
input;
i=1;
do until (scan(_infile_,i," ") eq "");
stuff=scan(_infile_,i," ");
i+1;
output;
end;
cards;
Now is the time for all good men and women
to come to the GG5567AA aid of their party
Or, was it 4567890 or 45678901 that caused
the problems.
4567890
45678901
ABC4EFGH
;
data want notwant;
set have;
want = PrxChange( 's/^\b(?=[A-Z0-9]*\d).{8,8}\b//io' ,
-1 , Stuff );
if want ne "" then output want;
else output notwant;
run;
I find this page to be quite helpful in building/testing regex: http://gskinner.com/RegExr/ (uses Flash).
Sometimes SAS needs a few adjustments to work, either for non-standard stuff or it's own quirks - but it's a good site nonetheless. There's allot of examples, and the tool will help show you the meanings of the different codes.
I crossposted your request on a similar forum (i.e., SAS-L) and a friend/sas/Perl expert (i.e., Toby Dunn) offered the following solution to the problem you raised:
data have;
length stuff $ 80;
input Stuff & ;
cards;
Now is the time for all good men and women
to come to the gh5567AA aid of their party
Or, was it 4567890 or 45678901 that caused
the problems problems.
4567890
45678901
1234_678
1234/678
ABC4EFGH
;
data want;
set Have ;
stuff2 = PrxChange( 's/(?=\b[A-Z0-9]{8}\b)\b[A-Z0-9]*\d[A-Z0-9]*\b//oi' , -1 , Stuff ) ;
run;
Thank you so much!! This seems to do it!
Thanks to everyone who replied - I have learnt a lot from all of you suggestions!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.