Re: data extraction from text data

shahm · Posted 09-04-2020 09:53 AM

Good Morning,

I have a column with text data (comments) and I would like to extract information from it. e.g. "Several documents for shipments of UN1791 missing date of shipment". I would like to extract UN1791. Also, the comments can have a space eg. UN 1791 and can have more than one UN numbers mentioned in a single comment eg. "please correct the following: UN 1202, UN 1992, UN 2794"

I need to extract only UN numbers from the comments.

my sample code is basic (I am a new to SAS less than 6 months):

data UNmatches;
set work.un;
UN_matches=index(violation_comments_txt, 'UN');
run;

I understand that this code gives me the position where UN appears in the comments.

But I need to get the numbers.

Any help is greatly appreciated.

Maitri

s_lassen · Posted 09-04-2020 11:13 AM

One possibility is to use Pearl regular expressions (PRX):

data have;
  infile cards truncover;
  input str $100.;
cards;
giffen gaffen guffen UN 3042 pilaffen dyp UN4444 tititi UN 1111
Several documents for shipments of UN1791 missing date of shipment
;run;

data want;
  set have;
  prxid=prxparse('/\bUN\s*\d{4}\b/');
  start=1;
  stop=length(str);
  length UN_no $6;
  do while(1);
    call prxnext(prxid,start,stop,str,pos,len);
    if pos=0 then leave;
    UN_no=compress(substr(str,pos,len));
    output;
    end;
  drop start stop pos len prxid;
run;

An explanation of the PRX expression (the parameter to PRXPARSE):

\b is a word boundary
\s* is zero or more whitespace characters (e.g. blanks)
\d{4} is 4 digits

shahm · Posted 09-04-2020 11:18 AM

Thank you so much for your reply! I will read up on PRX, apply the provided solution and inform you how it goes.

shahm · Posted 09-04-2020 01:52 PM

Hi Again,

I have 280 comments in column. Is there a way to not include them in the data step?
Is it possible to use the data, set, run commands in place of data, infile, input cards?
Also, the length of the comments column is 2522. Lastly, Is giffen gaffen guffen your comment:) Pretty innovative, I must say!

Tom · Posted 09-05-2020 04:40 PM

I have 280 comments in column. Is there a way to not include them in the data step?

Are you just talking about the data step in the answer(s) that is being used to create an example source dataset? If so then skip that step and pull in your existing dataset with a SET statement instead.

If instead you mean that you want to exclude observations being read in from your existing dataset please explain the selection logic.

shahm · Posted 09-08-2020 09:33 AM

Hello,
What I meant is I don't want copy paste the 280 comments in the example source dataset. So it seems I can put a few comments in the example source dataset and continue. Is that correct?

Sincerely,
Maitri

Shmuel · Posted 09-05-2020 03:23 PM

Based on @s_lassen's code, using PRX function, you may prefer next code with its output result:

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want1;
 set have;
     length UN_match $50 txt1 $6;
     regexid = prxparse('/\bUN\s*\d{4}\b/');
     pos1=1; 
     posend = length(comment);
     UN_match = ' ';
     do while (1);
        call prxnext(regexid,pos1,posend,comment,pos,lent);
        if pos=0 then leave;
        txt1 = compress(substr(comment,pos,lent));
        *put pos= txt1=;
	    UN_match = catx(' ',UN_match, txt1);
	 end;
	 *putlog UN_match= comment=;
	 keep UN_match comment;
run;

If you don't need the comment in the output remove it from the keep statement.

Shmuel · Posted 09-04-2020 06:23 PM

Try next tested code

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want(keep=comment UN_match);
 set have;
     length UN_match $7;
     text = upcase(compbl(comment));
	 pos = index(text,'UN');
	 if pos = 0 then delete; else
	 if pos > 1 then do; 
	    text = substr(text,pos); 
		pos=1; 
	 end;
	 do until (pos=0);
	    if input(substr(text,pos+2,5),?? best5.) ne . then do;
		    if substr(text,pos+2,1)=' ' 
			   then unlen=7;
			   else unlen=6;
			UN_match = substr(text,pos,unlen);
			output;
			text = substr(text,pos+unlen);
		 end;
		 else text = substr(text,pos+3); * skip UNx *;
		 if length(strip(text)) < 6 then leave;
	     pos = index(text,'UN');
	 end;
run;

It will be useful if you provide more sample data in the first step.

The code need be adapted to maximum length of the comment.

Do you need the comment in the output?

You may prefer sort the result by UN_match with NODUPKEY option.

The code will skip any word starting with 'UN' and not followed by numbers (like UNxyzsd / UN abcd )

shahm · Posted 09-08-2020 09:29 AM

Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri

Shmuel · Posted 09-08-2020 10:29 AM

@shahm wrote:
Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri

Have you tried the solution posted by @s_lassen ?

You just need to add the COMMENT to the drop statement at the bottom of the program

mkeintz · Posted 09-05-2020 06:27 PM

I think you can take advantage of the fact that, say, compress("UN 1234") =compress("UN1234 "), so you can always take 7 characters at a time when you find a "UN". Using the sample data set provided by @Shmuel, consider this code:

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want (drop=_:);
  set have;
  length un_match $50;

  _c=findw(comment,'UN',' 0123456789');
  if f^=0 then do until(_c_increment=0);  
	un_match=catx(',',un_match,compress(substr(comment,_c,7)));
	_c_increment=findw(substr(comment,_c+1),'UN',' 0123456789');
	_c=_c+_c_increment;
  end;
run;

The other useful tool here is telling the FINDW function that not only blanks are word separators, but so is any digit. So the findw function above will locate a UN followed by a blank or a digit (and preceded by any of them too). This would allow you to skip a string like "UNITED", while capturing both variation of UN codes that you identify.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away