DATA Step, Macro, Functions and more

txt files I need to pattern match contents on with regular expressions RegEx

Reply
Frequent Contributor
Posts: 90

txt files I need to pattern match contents on with regular expressions RegEx

[ Edited ]

Hello,

I have a need to loop through a set of text files and pattern match with RegEx and based on positive hits, I want a simple output of what files contain at least one match of my expression. I have tested my expression against a few RegEx websites and it seems to pass my quality tests. What I don't know how to do is use a “RegEx” search on a file in SAS. Can anyone give me some starter pointers? I am an experienced programmer but new this last spring to SAS. I am willing to read but to date am not sure what function to even be looking at, …or combination of functions. I guess I am a little lost on what "proc import" to use, the files may or may not be fixed, tab, stanza, bla, bla… I just want to read them a line at a time and check for a match once found can skip out of the file and mark it as match found.   TIA.  -Keith

Super User
Posts: 11,343

Re: txt files I need to pattern match contents on with regular expressions RegEx

Posted in reply to kjohnsonm

There are a number of SAS functions but loot at PRXMATCH.

 

You may also want to look at some reading tricks such as using _infil_ instead of trying to read into variables, possibly wildcard filenaming and the Filename option on the Infile statement.

Super User
Posts: 19,792

Re: txt files I need to pattern match contents on with regular expressions RegEx

Posted in reply to kjohnsonm
You may want to look at system commands that you can pass to OS via SAS instead, that may be more efficient. This would be via the X or SYSEXEC command.
Respected Advisor
Posts: 3,156

Re: txt files I need to pattern match contents on with regular expressions RegEx

Posted in reply to kjohnsonm

Here is something may get you started (Windows version), tweak it to accommodate your environment.

 

 

filename x pipe 'dir \\yourfolder\*.txt /s /b'; /*This is pipe in all of the dir result into fileref:x*/

data want;

infile x truncover;

input fname $100.;/*this is to get the text file name*/

infile dummy filevar=fname end=last;

do while(not last);

input content $100.; /*this is to get the content of each text file*/

texfilename=fname;

/*here is where to do your RegEx match*/

output;

end;

run;

filename x clear;
Frequent Contributor
Posts: 90

Re: txt files I need to pattern match contents on with regular expressions RegEx

I finally was able to get the pipe syntax to at least run, I am not 100% sure but it seems to open the file up for viewing.  In my case I eventually want to process every *.txt file on my Network file system so this does not work for my needs but is interesting.    I was able to get something like this to work for me to a degree:

filename myFile 'D:\The_Directory_path\That_file.txt';
data want;
      infile myFile truncover scanover;
      input line   $4096.;
run;
proc contents data=want;
run;

In my logs I get data like this:

747  filename myFile 'D:\The_Directory_path\That_file.txt';
748  data want;
749        infile myFile truncover scanover;
750        input line   $4096.;
751  run;

NOTE: The infile MYFILE is:
      Filename=D:\The_Directory_path\That_file.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=431,
      Last Modified=18Dec2015:09:52:12,
      Create Time=17Nov2015:10:13:37

NOTE: 18 records were read from the infile MYFILE.
      The minimum record length was 0.
      The maximum record length was 50.
NOTE: The data set WORK.WANT has 18 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.03 seconds

752  proc contents data=want;
753  run;

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

 

 

 

Does anyone know how to trap Max/Min, and the time date created or modified in macros vars?  I could use the metadata  Smiley Happy   I of course already have the file name and path in my program that this eventually fits into...  TIA.  -KJ

Trusted Advisor
Posts: 1,117

Re: txt files I need to pattern match contents on with regular expressions RegEx

[ Edited ]
Posted in reply to kjohnsonm

Excellent advice from both, @ballardw and @Reeza! And a nice, practical example from @Haikuo.

 

If you go the data step route, you could combine @ballardw's tips with @Haikuo's code and, e.g., use the automatic variable _INFILE_ rather than a newly created variable CONTENT.

 

Example 10 of the INFILE statement documentation could also be interesting. They read the external file names from a text file, which would be preferable to the pipe approach if the .txt files were located in various different folders.


I was just playing around with the code presented there and verified that it is possible to skip the rest of a (potentially huge) file, as soon as a match was found, and then continue with the next file:

 

do until(prxmatch('/your RegEx expression/', _infile_));
  input;
end;

The above DO loop would replace the do while(^eof); ... end; there.

 

Frequent Contributor
Posts: 90

Re: txt files I need to pattern match contents on with regular expressions RegEx

[ Edited ]
Posted in reply to FreelanceReinhard

Thanks for the replies; I am not on my game today because I cannot seem to make any of these ideas produce except the use

prxparse. I will try again later.  

 

 

I was able to get a simple data set to work with the command

data _null_;
	if _N_=1 then
	do;
	retain PerlExpression;
		pattern="/(?!0)(?!9[0-9][0-9])\d{3}[-.]{1}(?!00)\d{2}/";
		PerlExpression=prxparse(pattern);
/*		put PerlExpression;*/
	end;
	array match[26] $ 105;
	input data_line $100.;
	position=prxmatch(PerlExpression, data_line);
	mA='Matched Position: ';
	mS='String: ';
	mL='Whole line:';
	if position ^= 0 then
	do;
		current_match = substr(data_line, position,5 );
		put mA position mS current_match mL data_line;
	end;
datalines;
123456789012345678901234567890123456789012345678901234567890123 123-45                             0
023456789012345678901234567890123456789012345678901234567890123 123.45                             0
12345678901234567890123456789-01                                                                    
123456789012345678901234567 89-01                                                                   000000000000000000000000000000000000000000000000000000000000003 123 45 0 14:56.456 45:32 ; run;

 

Frequent Contributor
Posts: 90

Re: txt files I need to pattern match contents on with regular expressions RegEx

Posted in reply to kjohnsonm

PS

I tried this:

 

proc print data=dictionary.tables;
run;

 

This errors because I am not using a Libref.

Respected Advisor
Posts: 3,156

Re: txt files I need to pattern match contents on with regular expressions RegEx

Posted in reply to kjohnsonm
/*
You could try this
I have limited to 10 obs, as sometimes it takes long to run if 
you have many tables
*/

proc print data=sashelp.vtable(obs=10);
run;

/*
The reason yours does not work is that except one occasion, 
SAS library name is limited to 8 characters, while 'dictionary'
is 10. The following is that one occasion
*/
proc sql inobs=10;
	select * from dictionary.tables
	;
quit;

Frequent Contributor
Posts: 90

Re: txt files I need to pattern match contents on with regular expressions RegEx

...and in my example this where limits me to just my data set want.  ...and anything i might have in my work libname.  Thanks for the knowledge.  thats helps a lot!   -KJ

 

proc sql inobs=10;
    select * from dictionary.tables
    where libname='WORK' or memname='WANT';
    ;
quit;


proc print data=sashelp.vtable;
where libname='WORK' or memname='WANT';
run;

 

Ask a Question
Discussion stats
  • 9 replies
  • 481 views
  • 2 likes
  • 5 in conversation