Hi SAS Experts i have below string and i am trying to extract the number that is in the following highlighted format, the issue is the format is not consistent but i would like to extract anything after "reference # " just till the numbers as shown below.
string:
“Send email to reference, Mailbox..* Email Subject Line - reference Follow-Up....Body of email (complete template below).... ....student Name.... James GENEVIEVE...Student Phone #.... 12345712457 best, 5458712458....reference #.... 012000222.1.... .... ONFIDENTIALITY NOTICE..This communication may contain privileged or confidential information. If you are not the intended recipient or received this communication by error, please notify the sender and delete the message without copying or disclosing it. Thank you..... .... .."
desired output:
"012000221.1"
in advance, thank you for your effort.
Depending on the exact requirement, another option:
data WANT;
LINE="Send email to reference, Mailbox..* Email Subject Line - reference Follow-Up....Body of email (complete template below).... ....student Name.... James GENEVIEVE...Student Phone #.... 12345712457 best, 5458712458....reference #.... 012000222.1.... .... ONFIDENTIALITY NOTICE..This communication may contain privileged or confidential information. If you are not the intended recipient or received this communication by error, please notify the sender and delete the message without copying or disclosing it. Thank you..... .... ..";
STR=prxchange('s/' %* substitution requested;
||'.*' %* match anything;
||'reference #' %* then reference space hash;
||'[^\d]*' %* then more optional text except digits ;
||'(\d+\.?\d*)' %* then digits including an optional embedded dot <= capture this;
||'.*' %* then the rest of the string;
||'/\1' %* replace all with captured group ;
||'/',1,LINE);
putlog STR=;
run;
STR=012000222.1
A regular expression could help. What exactly will work depends on the details/variations in your actual data.
data have;
infile datalines truncover;
length string $2000;
retain string;
input;
string=catx(' ',string,_infile_);
if _n_=10 then output;
datalines;
Send email to reference, Mailbox..*
Email Subject Line - reference Follow-Up....
Body of email (complete template below).... ....
student Name.... James GENEVIEVE...Student Phone #....
12345712457 best, 5458712458....reference #.... 012000222.1.... ....
ONFIDENTIALITY NOTICE..This communication may contain privileged or
confidential information. If you are not the intended recipient or
received this communication by error, please notify the sender
and delete the message without copying or disclosing it.
Thank you..... .... ..
;
data want;
set have;
length want_str $32;
_prxid=prxparse('/reference #[^#\d]*(\d+\.?\d*)/oi');
if prxmatch(_prxid,trim(string))>0 then
want_str=prxposn(_prxid,1,trim(string));
run;
proc print data=want;
var want_str;
run;
Depending on the exact requirement, another option:
data WANT;
LINE="Send email to reference, Mailbox..* Email Subject Line - reference Follow-Up....Body of email (complete template below).... ....student Name.... James GENEVIEVE...Student Phone #.... 12345712457 best, 5458712458....reference #.... 012000222.1.... .... ONFIDENTIALITY NOTICE..This communication may contain privileged or confidential information. If you are not the intended recipient or received this communication by error, please notify the sender and delete the message without copying or disclosing it. Thank you..... .... ..";
STR=prxchange('s/' %* substitution requested;
||'.*' %* match anything;
||'reference #' %* then reference space hash;
||'[^\d]*' %* then more optional text except digits ;
||'(\d+\.?\d*)' %* then digits including an optional embedded dot <= capture this;
||'.*' %* then the rest of the string;
||'/\1' %* replace all with captured group ;
||'/',1,LINE);
putlog STR=;
run;
STR=012000222.1
Using prxchange() is what I've done first as well but then realized that you'll end up with the source string if there is no match.
@Patrick True
if STR=LINE then STR=' ';
Not saying this is better. Just another option. 🙂
RegEx are expensive, I prefer to use them just once if possible.
> Doesn’t prxposn() just retrieve the capture buffer without parsing the source string again?
You are right. Your solution is actually more efficient. I never used to use this function but I'll keep in mind now.
Thanks for the heads up. 🙂
> a reference page
This syntax is called a regular expression and there a tons of didactic resources online. That's how I learnt. 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.