Re: regex prxnext ungreedy

acordes · Posted 04-13-2023 11:42 AM

At the regex101.com page my regular expression does what I want it to do.

But I cannot use the ungreedy option in the prxparse option.

I want to output any text within data|proc and run|quit. So where I get stuck is the casuser.want part at the end of the code.

Calling the expert, @Ksharp any idea?

The first example code is taken form https://communities.sas.com/t5/SAS-Programming/Scan-a-string-to-find-word-after-a-specific-word/m-p/...


data casuser.codi1;
       infile datalines4 dsd  ;
       length text $2000  ;
       input text$ ;
datalines4;
data want;
 string = "my aim is to find every word after bank_beg.upa for every bank_beg.xx in this line bank_beg.ff";
 length want $ 80;
pid=prxparse('/(?<=bank_beg\.)\w+/i'); 
 s=1;e=length(string);
 call prxnext(pid,s,e,string,p,l);
 do while(p>0);
   want=catx(' ',want,substr(string,p,l));
    call prxnext(pid,s,e,string,p,l);
 end;
 drop s e p l pid;
run;

data ReversedNames;
   input name & $100.;
   datalines;
Activa contractual
Anulada antes de firmada
Cancelacion Long Drive. CPC-NEXT
Cancelacion Long Drive. Decision Cliente
Cancelacion Long Drive. Desconocimiento producto
Cancelacion Long Drive. Descontento exceso KMS
Cancelacion Long Drive. Impagados
Cancelacion Long Drive. Importe cuota elevada
Cancelacion Long Drive. No se ajusta a sus necesidades
Cancelacion Long Drive. Perdida total (robo/siniestro)
Cancelacion Long Drive. Uso intensivo vehiculo MAX.Permitido
Cancelacion Long Drive. Uso publico excesivo (TAXI)
Cancelacion Long Drive. Venta del vehiculo
Finiquitada LONG DRIVE por excedido tranquilidad SEAT
Finiquitada LONG DRIVE por kilometraje
Finiquitada LONG DRIVE por tiempo
;
run;

data FirstLastNames;
   length first last $ 16;
   keep first last name situation;
   retain re;
   if _N_ = 1 then
      re = prxparse('/(canc|anul*)?(fini*)?(activ*)*/i');
   set ReversedNames;
   if prxmatch(re, name) then 
      do;
         last = prxposn(re, 1, name);
         first = prxposn(re, 2, name);
          situation=choosec(max(
          ^missing(prxposn(re, 1, name))*1, ^missing(prxposn(re, 2, name))*2, ^missing(prxposn(re, 3, name))*3), 
          'cancelada/anulada', 'finiquitada', 'activa' );
      end;
run;

proc sql;
select * from oks;
quit;
;;;;
run;


data _null_;
do i = 1 to 1;
call execute ("data casuser.codi_desc" || strip(put(i,$2.)) ||
"; set casuser.codi" || strip(put(i,$2.))  || " end=eof;
length textus varchar(10000);
did=" || i || " ;
retain textus;
textus=cats(textus, text);
if eof then do; 
textus =tranwrd(textus,';',cat(';', '0A'x));
output;
end;
run;");
end;
run;

data casuser.want;
set CASUSER.CODI_DESC1;
 length want $ 32000;
pid=prxparse('/^\s?(data|proc)(.*\n)*(?=(quit|run))(run|quit);/i'); 
 s=1;e=length(textus);
 call prxnext(pid,s,e,textus,p,l);
 do while(p>0);
   want=substr(textus,p,l);
  output;
    call prxnext(pid,s,e,textus,p,l);
 end;
/*  drop s e p l pid; */
run;

Ksharp · Posted 04-14-2023 07:47 AM


/*
I think it is a tough question.
There are too many scenarios you need to take care of
*/
data casuser.want;
set CASUSER.CODI_DESC1;
 length want $ 32000;
pid=prxparse('/(;)?(data|proc)\s+\w+\s*;|;\s*(run|quit)\s*;/i'); 
 s=1;e=length(textus);
 call prxnext(pid,s,e,textus,p,l);
 do while(p>0);
   want=substr(textus,p,l);
  output;
    call prxnext(pid,s,e,textus,p,l);
 end;
/*  drop s e p l pid; */
run;
data casuser.want2;
set casuser.want;
lag_p=lag(p);lag_l=lag(l);
if mod(_n_,2)=0 then do;
 want_code=substr(textus,lag_p+lag_l,p-lag_p-lag_l+1);
 output;
end;
keep want_code;
run;

Tom · Posted 04-14-2023 11:22 AM

I cannot figure out what you are asking.

For example what it the purpose of this gibberish in the middle of your question?

data _null_;
do i = 1 to 1;
  call execute 
   ("data casuser.codi_desc"
  || strip(put(i,$2.))
  || "; set casuser.codi" 
  || strip(put(i,$2.))  
  || " end=eof; length textus varchar(10000); did=" 
  || i 
  || " ; retain textus; textus=cats(textus, text); if eof then do;"  
  || "textus =tranwrd(textus,';',cat(';', '0A'x));output;end;run;"
  );
end;
run;

And why is it trying to use a character format $2. with a numeric variable I?

acordes · Posted 04-17-2023 04:51 AM

Hi Tom,

you're right, I was wrong at assigning a char format witha numeric variable. I have ammended this.

Meanwhile I've found a workaround with satisfying results.

But because you asked for it I try to make it clearer.

I have a txt file (here I use datalines for the example) with 180.000 lines of code.

Using varchar in my sas viya setup results in only 16 rows holding the string of these 180.000 lines.

For each of these 16 observation I'd like to divide each sequence of 'proc...run;' into a new observations and discard the rest.

AhmedAl_Attar · Posted 04-17-2023 06:01 AM

Hi @acordes

When you say "I have a txt file with 180.000 lines of code."

I wonder how wide each line is? Basically what's the Logical Record Length (lrecl)?

According to the Docs Data Types Supported in the CAS DATA Step, The

Char variable can hold (Range: 1–32,767) characters,
Varchar can hold ( Range (UTF-8 encoding) 1–536,870,911) characters

So I wonder why you are splitting the file into 16 variables "Using varchar in my sas viya setup results in only 16 rows holding the string of these 180.000 lines." ?

In my opinion, you need to make sure to have a single continuous string in order to correctly parse it and find the text in-between your desired starting key words (Proc*;|Data*;) and ending key words (run;|quit;)

Splitting across multiple variables will not guarantee correct parsing! I would recommend revising how you read and store your txt file into a SAS Data set first, before trying to parse it. Otherwise you'll need to parse every single line as you read it in, while keeping a track of pre-existing flags of key words encounter. Just a thought,

Hope this helps

acordes · Posted 04-17-2023 06:51 AM

Yes, I know the 16 observations will harm the result because I do not control how the total string gets broken into these 16 obs.

I tried the singlethread option but it still distrubutes the task over the the available threads. I suppose that's the reason why I end up having 16 observations.

But my main challenge is to split, I think regex is best suited for that, the total string into chunks of "proc ... run;" sequences.

The screenshots I provide show how the regex101.com page can control for that by activating the ungreedy option which translated to sas code gives me an error.

AhmedAl_Attar · Posted 04-17-2023 07:26 AM

Can you share the code you are using to read in the file and not the datalines, because they are not the same!?

Splitting into 16 variables doesn't happen be default, there must be some coding issue that causes that to happen!

ErikLund_Jensen · Posted 04-17-2023 10:41 AM

Hi @AhmedAl_Attar , @acordes

If there are 180.000 code lines in the input file, one would expect them to be read into SAS as 180.000 lines. When this doesn't happen, it might be because SAS doesn't recognize the record terminator, but reads the file in chunks of 32767 bytes, the max. allowed in an infile statement, though I cannot see why it would read more than one observation in this case.

But it could be an interesting experiment to see if a termstr= option on the infile statement specified to match the file (CR, LF or CRLF) could force SAS to read the file as 180.000 code lines. It might be simpler to parse the file this way.

An alternative could be to read the file in 16 chunks and concatenate the chunks in one variable/observation before output as a SAS varchar, so the parsing for proc/run wouldn't be compromized by broken lines.

acordes · Posted 04-17-2023 11:25 AM

My code that produces the described behaviour mirrors the pasted code from my first blog entry.

The only difference is that I paste 180000 lines of code. Ok, that's not very clean but it creates 180000 observations each one representing one line even if it is blank.

The following code produces the 16 observations. I run a length statement on these observations and the length is in line what I expect from 180000 lines with 20 to 40 characters on average divided into 16 observations.

You can easily try on your own plugging-in your sas code or whatever text.

I have taken out the do loop in the data _null_ because for this setting it only distracts.

data casuser.cars1;
       infile datalines4 dsd  ;
       length text $2000  ;
       input text$ ;
datalines4;

/* find matches retail */
proc casutil;
droptable CASDATA= 'cart' incaslib='mkt' quiet  ;

/* 180000 more lines */

;;;;
run;

data CASUSER.CARS1;
set CASUSER.CARS1;
where text ne '';
run;

data _null_;
call execute ("data casuser.codi_desc1;
set CASUSER.CARS1 end=eof;
length textus varchar(10000000);
retain textus;
textus=cats(textus, text);
if eof then do; 
textus =tranwrd(textus,';',cat(';', '0A'x));
output;
end;
run;");
run;


data CASUSER.CODI_DESC1;
set CASUSER.CODI_DESC1;
uniqueID = compress(put(_threadid_,8.) || '_' || Put(_n_,8.));
run;


proc cas;
source MPG_toyota;
			select uniqueID, length(textus) as len 
               from CASUSER.CODI_DESC1
			group by uniqueID, len  ;
endsource;
fedSQL.execDirect / query=MPG_toyota;
quit;

AhmedAl_Attar · Posted 04-17-2023 04:46 PM

Hi @acordes

Are you trying to develop code parser?

if yes, then I would highly recommend reading this 2017 paper Automatically create diagrams showing the structure and performance of your SAS code , especially when you have 180,000 lines of code to analyze.

One other alternative would the SAS 9 Content Assessment, one of its components is "SAS Code Check", where it would scan a directory of code files

Hope this helps

Registration is open

SAS Training: Just a Click Away