topic Re: data extraction from text data in SAS Programming

data extraction from text data

shahm — Fri, 04 Sep 2020 13:53:19 GMT

Good Morning,

I have a column with text data (comments) and I would like to extract information from it. e.g. "Several documents for shipments of UN1791 missing date of shipment". I would like to extract UN1791. Also, the comments can have a space eg. UN 1791 and can have more than one UN numbers mentioned in a single comment eg. "please correct the following: UN 1202, UN 1992, UN 2794"

I need to extract only UN numbers from the comments.

my sample code is basic (I am a new to SAS less than 6 months):

data UNmatches;
set work.un;
UN_matches=index(violation_comments_txt, 'UN');
run;

I understand that this code gives me the position where UN appears in the comments.

But I need to get the numbers.

Any help is greatly appreciated.

Maitri

Re: data extraction from text data

s_lassen — Fri, 04 Sep 2020 15:13:07 GMT

One possibility is to use Pearl regular expressions (PRX):

data have;
  infile cards truncover;
  input str $100.;
cards;
giffen gaffen guffen UN 3042 pilaffen dyp UN4444 tititi UN 1111
Several documents for shipments of UN1791 missing date of shipment
;run;

data want;
  set have;
  prxid=prxparse('/\bUN\s*\d{4}\b/');
  start=1;
  stop=length(str);
  length UN_no $6;
  do while(1);
    call prxnext(prxid,start,stop,str,pos,len);
    if pos=0 then leave;
    UN_no=compress(substr(str,pos,len));
    output;
    end;
  drop start stop pos len prxid;
run;

An explanation of the PRX expression (the parameter to PRXPARSE):

\b is a word boundary
\s* is zero or more whitespace characters (e.g. blanks)
\d{4} is 4 digits

Re: data extraction from text data

shahm — Fri, 04 Sep 2020 15:18:22 GMT

Thank you so much for your reply! I will read up on PRX, apply the provided solution and inform you how it goes.

Re: data extraction from text data

shahm — Fri, 04 Sep 2020 17:52:57 GMT

Hi Again,

I have 280 comments in column. Is there a way to not include them in the data step?
Is it possible to use the data, set, run commands in place of data, infile, input cards?
Also, the length of the comments column is 2522. Lastly, Is giffen gaffen guffen your comment:) Pretty innovative, I must say!

Re: data extraction from text data

Shmuel — Fri, 04 Sep 2020 22:23:35 GMT

Try next tested code

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want(keep=comment UN_match);
 set have;
     length UN_match $7;
     text = upcase(compbl(comment));
	 pos = index(text,'UN');
	 if pos = 0 then delete; else
	 if pos > 1 then do; 
	    text = substr(text,pos); 
		pos=1; 
	 end;
	 do until (pos=0);
	    if input(substr(text,pos+2,5),?? best5.) ne . then do;
		    if substr(text,pos+2,1)=' ' 
			   then unlen=7;
			   else unlen=6;
			UN_match = substr(text,pos,unlen);
			output;
			text = substr(text,pos+unlen);
		 end;
		 else text = substr(text,pos+3); * skip UNx *;
		 if length(strip(text)) < 6 then leave;
	     pos = index(text,'UN');
	 end;
run;

It will be useful if you provide more sample data in the first step.

The code need be adapted to maximum length of the comment.

Do you need the comment in the output?

You may prefer sort the result by UN_match with NODUPKEY option.

The code will skip any word starting with 'UN' and not followed by numbers (like UNxyzsd / UN abcd )

Re: data extraction from text data

Shmuel — Sat, 05 Sep 2020 19:23:33 GMT

Based on @s_lassen's code, using PRX function, you may prefer next code with its output result:

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want1;
 set have;
     length UN_match $50 txt1 $6;
     regexid = prxparse('/\bUN\s*\d{4}\b/');
     pos1=1; 
     posend = length(comment);
     UN_match = ' ';
     do while (1);
        call prxnext(regexid,pos1,posend,comment,pos,lent);
        if pos=0 then leave;
        txt1 = compress(substr(comment,pos,lent));
        *put pos= txt1=;
	    UN_match = catx(' ',UN_match, txt1);
	 end;
	 *putlog UN_match= comment=;
	 keep UN_match comment;
run;

If you don't need the comment in the output remove it from the keep statement.

Re: data extraction from text data

Tom — Sat, 05 Sep 2020 20:40:47 GMT

I have 280 comments in column. Is there a way to not include them in the data step?

Are you just talking about the data step in the answer(s) that is being used to create an example source dataset? If so then skip that step and pull in your existing dataset with a SET statement instead.

If instead you mean that you want to exclude observations being read in from your existing dataset please explain the selection logic.

Re: data extraction from text data

mkeintz — Sat, 05 Sep 2020 22:27:14 GMT

I think you can take advantage of the fact that, say, compress("UN 1234") =compress("UN1234 "), so you can always take 7 characters at a time when you find a "UN". Using the sample data set provided by @Shmuel, consider this code:

data have;
  input comment $80.;
cards4;
Several documents for shipments of UN1791 missing date of shipment
please correct the following: UN 1202, UN 1992, UN 2794
;;;;
run;

data want (drop=_:);
  set have;
  length un_match $50;

  _c=findw(comment,'UN',' 0123456789');
  if f^=0 then do until(_c_increment=0);  
	un_match=catx(',',un_match,compress(substr(comment,_c,7)));
	_c_increment=findw(substr(comment,_c+1),'UN',' 0123456789');
	_c=_c+_c_increment;
  end;
run;

The other useful tool here is telling the FINDW function that not only blanks are word separators, but so is any digit. So the findw function above will locate a UN followed by a blank or a digit (and preceded by any of them too). This would allow you to skip a string like "UNITED", while capturing both variation of UN codes that you identify.

Re: data extraction from text data

shahm — Tue, 08 Sep 2020 13:29:46 GMT

Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri

Re: data extraction from text data

shahm — Tue, 08 Sep 2020 13:33:53 GMT

Hello,
What I meant is I don't want copy paste the 280 comments in the example source dataset. So it seems I can put a few comments in the example source dataset and continue. Is that correct?

Sincerely,
Maitri

Re: data extraction from text data

Shmuel — Tue, 08 Sep 2020 14:29:01 GMT

@shahm wrote:
Hello,

Thank you for your response! I don't need the comment in the output. I just need a column which has extracted the UN numbers from the comments. Also, it works if the code excludes comments with 'UN' without the numbers.

Sincerely,
Maitri

Have you tried the solution posted by @s_lassen ?

You just need to add the COMMENT to the drop statement at the bottom of the program