Good Morning,
I have a column with text data (comments) and I would like to extract information from it. e.g. "Several documents for shipments of UN1791 missing date of shipment". I would like to extract UN1791. Also, the comments can have a space eg. UN 1791 and can have more than one UN numbers mentioned in a single comment eg. "please correct the following: UN 1202, UN 1992, UN 2794"
I need to extract only UN numbers from the comments.
my sample code is basic (I am a new to SAS less than 6 months):
data UNmatches;
set work.un;
UN_matches=index(violation_comments_txt, 'UN');
run;
I understand that this code gives me the position where UN appears in the comments.
But I need to get the numbers.
Any help is greatly appreciated.
Maitri
One possibility is to use Pearl regular expressions (PRX):
data have;
infile cards truncover;
input str $100.;
cards;
giffen gaffen guffen UN 3042 pilaffen dyp UN4444 tititi UN 1111
Several documents for shipments of UN1791 missing date of shipment
;run;
data want;
set have;
prxid=prxparse('/\bUN\s*\d{4}\b/');
start=1;
stop=length(str);
length UN_no $6;
do while(1);
call prxnext(prxid,start,stop,str,pos,len);
if pos=0 then leave;
UN_no=compress(substr(str,pos,len));
output;
end;
drop start stop pos len prxid;
run;
An explanation of the PRX expression (the parameter to PRXPARSE):
I have 280 comments in column. Is there a way to not include them in the data step?
Are you just talking about the data step in the answer(s) that is being used to create an example source dataset? If so then skip that step and pull in your existing dataset with a SET statement instead.
If instead you mean that you want to exclude observations being read in from your existing dataset please explain the selection logic.
Based on @s_lassen's code, using PRX function, you may prefer next code with its output result:
data have;
input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;
data want1;
set have;
length UN_match $50 txt1 $6;
regexid = prxparse('/\bUN\s*\d{4}\b/');
pos1=1;
posend = length(comment);
UN_match = ' ';
do while (1);
call prxnext(regexid,pos1,posend,comment,pos,lent);
if pos=0 then leave;
txt1 = compress(substr(comment,pos,lent));
*put pos= txt1=;
UN_match = catx(' ',UN_match, txt1);
end;
*putlog UN_match= comment=;
keep UN_match comment;
run;
If you don't need the comment in the output remove it from the keep statement.
Try next tested code
data have;
input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;
data want(keep=comment UN_match);
set have;
length UN_match $7;
text = upcase(compbl(comment));
pos = index(text,'UN');
if pos = 0 then delete; else
if pos > 1 then do;
text = substr(text,pos);
pos=1;
end;
do until (pos=0);
if input(substr(text,pos+2,5),?? best5.) ne . then do;
if substr(text,pos+2,1)=' '
then unlen=7;
else unlen=6;
UN_match = substr(text,pos,unlen);
output;
text = substr(text,pos+unlen);
end;
else text = substr(text,pos+3); * skip UNx *;
if length(strip(text)) < 6 then leave;
pos = index(text,'UN');
end;
run;
It will be useful if you provide more sample data in the first step.
The code need be adapted to maximum length of the comment.
Do you need the comment in the output?
You may prefer sort the result by UN_match with NODUPKEY option.
The code will skip any word starting with 'UN' and not followed by numbers (like UNxyzsd / UN abcd )
@shahm wrote:
Hello,
Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.
Sincerely,
Maitri
Have you tried the solution posted by @s_lassen ?
You just need to add the COMMENT to the drop statement at the bottom of the program
I think you can take advantage of the fact that, say, compress("UN 1234") =compress("UN1234 "), so you can always take 7 characters at a time when you find a "UN". Using the sample data set provided by @Shmuel, consider this code:
data have;
input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;
data want (drop=_:);
set have;
length un_match $50;
_c=findw(comment,'UN',' 0123456789');
if f^=0 then do until(_c_increment=0);
un_match=catx(',',un_match,compress(substr(comment,_c,7)));
_c_increment=findw(substr(comment,_c+1),'UN',' 0123456789');
_c=_c+_c_increment;
end;
run;
The other useful tool here is telling the FINDW function that not only blanks are word separators, but so is any digit. So the findw function above will locate a UN followed by a blank or a digit (and preceded by any of them too). This would allow you to skip a string like "UNITED", while capturing both variation of UN codes that you identify.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.