BookmarkSubscribeRSS Feed
shahm
Fluorite | Level 6

Good Morning,

I have a column with text data (comments) and I would like to extract information from it. e.g. "Several documents for shipments of UN1791 missing date of shipment". I would like to extract UN1791. Also, the comments can have a space eg. UN 1791 and can have more than one UN numbers mentioned in a single comment eg. "please correct the following: UN 1202, UN 1992, UN 2794"

I need to extract only UN numbers from the comments.

 

my sample code is basic (I am a new to  SAS less than 6 months):

data UNmatches;
set work.un;
UN_matches=index(violation_comments_txt, 'UN');
run;

 

I understand that this code gives me the position where UN appears in the comments.

But I need to get the numbers.

 

Any help is greatly appreciated. 

Maitri

 

10 REPLIES 10
s_lassen
Meteorite | Level 14

One possibility is to use Pearl regular expressions (PRX):

data have;
  infile cards truncover;
  input str $100.;
cards;
giffen gaffen guffen UN 3042 pilaffen dyp UN4444 tititi UN 1111
Several documents for shipments of UN1791 missing date of shipment
;run;

data want;
  set have;
  prxid=prxparse('/\bUN\s*\d{4}\b/');
  start=1;
  stop=length(str);
  length UN_no $6;
  do while(1);
    call prxnext(prxid,start,stop,str,pos,len);
    if pos=0 then leave;
    UN_no=compress(substr(str,pos,len));
    output;
    end;
  drop start stop pos len prxid;
run;

An explanation of the PRX expression (the parameter to PRXPARSE):

  • \b is a word boundary
  • \s* is zero or more whitespace characters (e.g. blanks)
  • \d{4} is 4 digits
shahm
Fluorite | Level 6
Thank you so much for your reply! I will read up on PRX, apply the provided solution and inform you how it goes.
shahm
Fluorite | Level 6
Hi Again,

I have 280 comments in column. Is there a way to not include them in the data step?
Is it possible to use the data, set, run commands in place of data, infile, input cards?
Also, the length of the comments column is 2522. Lastly, Is giffen gaffen guffen your comment:) Pretty innovative, I must say!
Tom
Super User Tom
Super User

I have 280 comments in column. Is there a way to not include them in the data step?

Are you just talking about the data step in the answer(s) that is being used to create an example source dataset?  If so then skip that step and pull in your existing dataset with a SET statement instead.

 

If instead you mean that you want to exclude observations being read in from your existing dataset please explain the selection logic.

shahm
Fluorite | Level 6
Hello,
What I meant is I don't want copy paste the 280 comments in the example source dataset. So it seems I can put a few comments in the example source dataset and continue. Is that correct?

Sincerely,
Maitri
Shmuel
Garnet | Level 18

Based on @s_lassen's code, using PRX function, you may prefer next code with its output result:

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want1;
 set have;
     length UN_match $50 txt1 $6;
     regexid = prxparse('/\bUN\s*\d{4}\b/');
     pos1=1; 
     posend = length(comment);
     UN_match = ' ';
     do while (1);
        call prxnext(regexid,pos1,posend,comment,pos,lent);
        if pos=0 then leave;
        txt1 = compress(substr(comment,pos,lent));
        *put pos= txt1=;
	    UN_match = catx(' ',UN_match, txt1);
	 end;
	 *putlog UN_match= comment=;
	 keep UN_match comment;
run;	 

If you don't need the comment in the output remove it from the keep statement.

 

 

Shmuel
Garnet | Level 18

Try next tested code

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want(keep=comment UN_match);
 set have;
     length UN_match $7;
     text = upcase(compbl(comment));
	 pos = index(text,'UN');
	 if pos = 0 then delete; else
	 if pos > 1 then do; 
	    text = substr(text,pos); 
		pos=1; 
	 end;
	 do until (pos=0);
	    if input(substr(text,pos+2,5),?? best5.) ne . then do;
		    if substr(text,pos+2,1)=' ' 
			   then unlen=7;
			   else unlen=6;
			UN_match = substr(text,pos,unlen);
			output;
			text = substr(text,pos+unlen);
		 end;
		 else text = substr(text,pos+3); * skip UNx *;
		 if length(strip(text)) < 6 then leave;
	     pos = index(text,'UN');
	 end;
run;
		

It will be useful if you provide more sample data in the first step.

The code need be adapted to maximum length of the comment.

 

Do you need the comment in the output?

You may prefer sort the result by UN_match with NODUPKEY  option.

The code will skip any word starting with 'UN' and not followed by numbers (like UNxyzsd / UN abcd )

shahm
Fluorite | Level 6
Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri
Shmuel
Garnet | Level 18

@shahm wrote:
Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri

Have you tried the solution posted by @s_lassen ?

You just need to add the COMMENT to the drop statement at the bottom of the program

mkeintz
PROC Star

 

I think you can take advantage of the fact that, say,  compress("UN 1234") =compress("UN1234 "), so you can always take 7 characters at a time when you find a "UN".  Using the sample data set provided by @Shmuel, consider this code:

 

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want (drop=_:);
  set have;
  length un_match $50;

  _c=findw(comment,'UN',' 0123456789');
  if f^=0 then do until(_c_increment=0);  
	un_match=catx(',',un_match,compress(substr(comment,_c,7)));
	_c_increment=findw(substr(comment,_c+1),'UN',' 0123456789');
	_c=_c+_c_increment;
  end;
run;

The other useful tool here is telling the FINDW function that not only blanks are word separators, but so is any digit.  So the findw function above will locate a UN followed by a blank or a digit (and preceded by any of them too).  This would allow you to skip a string like "UNITED", while capturing both variation of UN codes that you identify.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1292 views
  • 2 likes
  • 5 in conversation