At the regex101.com page my regular expression does what I want it to do.
But I cannot use the ungreedy option in the prxparse option.
I want to output any text within data|proc and run|quit. So where I get stuck is the casuser.want part at the end of the code.
Calling the expert, @Ksharp any idea?
The first example code is taken form https://communities.sas.com/t5/SAS-Programming/Scan-a-string-to-find-word-after-a-specific-word/m-p/...
data casuser.codi1; infile datalines4 dsd ; length text $2000 ; input text$ ; datalines4; data want; string = "my aim is to find every word after bank_beg.upa for every bank_beg.xx in this line bank_beg.ff"; length want $ 80; pid=prxparse('/(?<=bank_beg\.)\w+/i'); s=1;e=length(string); call prxnext(pid,s,e,string,p,l); do while(p>0); want=catx(' ',want,substr(string,p,l)); call prxnext(pid,s,e,string,p,l); end; drop s e p l pid; run; data ReversedNames; input name & $100.; datalines; Activa contractual Anulada antes de firmada Cancelacion Long Drive. CPC-NEXT Cancelacion Long Drive. Decision Cliente Cancelacion Long Drive. Desconocimiento producto Cancelacion Long Drive. Descontento exceso KMS Cancelacion Long Drive. Impagados Cancelacion Long Drive. Importe cuota elevada Cancelacion Long Drive. No se ajusta a sus necesidades Cancelacion Long Drive. Perdida total (robo/siniestro) Cancelacion Long Drive. Uso intensivo vehiculo MAX.Permitido Cancelacion Long Drive. Uso publico excesivo (TAXI) Cancelacion Long Drive. Venta del vehiculo Finiquitada LONG DRIVE por excedido tranquilidad SEAT Finiquitada LONG DRIVE por kilometraje Finiquitada LONG DRIVE por tiempo ; run; data FirstLastNames; length first last $ 16; keep first last name situation; retain re; if _N_ = 1 then re = prxparse('/(canc|anul*)?(fini*)?(activ*)*/i'); set ReversedNames; if prxmatch(re, name) then do; last = prxposn(re, 1, name); first = prxposn(re, 2, name); situation=choosec(max( ^missing(prxposn(re, 1, name))*1, ^missing(prxposn(re, 2, name))*2, ^missing(prxposn(re, 3, name))*3), 'cancelada/anulada', 'finiquitada', 'activa' ); end; run; proc sql; select * from oks; quit; ;;;; run; data _null_; do i = 1 to 1; call execute ("data casuser.codi_desc" || strip(put(i,$2.)) || "; set casuser.codi" || strip(put(i,$2.)) || " end=eof; length textus varchar(10000); did=" || i || " ; retain textus; textus=cats(textus, text); if eof then do; textus =tranwrd(textus,';',cat(';', '0A'x)); output; end; run;"); end; run; data casuser.want; set CASUSER.CODI_DESC1; length want $ 32000; pid=prxparse('/^\s?(data|proc)(.*\n)*(?=(quit|run))(run|quit);/i'); s=1;e=length(textus); call prxnext(pid,s,e,textus,p,l); do while(p>0); want=substr(textus,p,l); output; call prxnext(pid,s,e,textus,p,l); end; /* drop s e p l pid; */ run;
/* I think it is a tough question. There are too many scenarios you need to take care of */ data casuser.want; set CASUSER.CODI_DESC1; length want $ 32000; pid=prxparse('/(;)?(data|proc)\s+\w+\s*;|;\s*(run|quit)\s*;/i'); s=1;e=length(textus); call prxnext(pid,s,e,textus,p,l); do while(p>0); want=substr(textus,p,l); output; call prxnext(pid,s,e,textus,p,l); end; /* drop s e p l pid; */ run; data casuser.want2; set casuser.want; lag_p=lag(p);lag_l=lag(l); if mod(_n_,2)=0 then do; want_code=substr(textus,lag_p+lag_l,p-lag_p-lag_l+1); output; end; keep want_code; run;
I cannot figure out what you are asking.
For example what it the purpose of this gibberish in the middle of your question?
data _null_; do i = 1 to 1; call execute ("data casuser.codi_desc" || strip(put(i,$2.)) || "; set casuser.codi" || strip(put(i,$2.)) || " end=eof; length textus varchar(10000); did=" || i || " ; retain textus; textus=cats(textus, text); if eof then do;" || "textus =tranwrd(textus,';',cat(';', '0A'x));output;end;run;" ); end; run;
And why is it trying to use a character format $2. with a numeric variable I?
you're right, I was wrong at assigning a char format witha numeric variable. I have ammended this.
Meanwhile I've found a workaround with satisfying results.
But because you asked for it I try to make it clearer.
I have a txt file (here I use datalines for the example) with 180.000 lines of code.
Using varchar in my sas viya setup results in only 16 rows holding the string of these 180.000 lines.
For each of these 16 observation I'd like to divide each sequence of 'proc...run;' into a new observations and discard the rest.
When you say "I have a txt file with 180.000 lines of code."
I wonder how wide each line is? Basically what's the Logical Record Length (lrecl)?
According to the Docs Data Types Supported in the CAS DATA Step, The
So I wonder why you are splitting the file into 16 variables "Using varchar in my sas viya setup results in only 16 rows holding the string of these 180.000 lines." ?
In my opinion, you need to make sure to have a single continuous string in order to correctly parse it and find the text in-between your desired starting key words (Proc*;|Data*;) and ending key words (run;|quit;)
Splitting across multiple variables will not guarantee correct parsing! I would recommend revising how you read and store your txt file into a SAS Data set first, before trying to parse it. Otherwise you'll need to parse every single line as you read it in, while keeping a track of pre-existing flags of key words encounter. Just a thought,
Hope this helps
Yes, I know the 16 observations will harm the result because I do not control how the total string gets broken into these 16 obs.
I tried the singlethread option but it still distrubutes the task over the the available threads. I suppose that's the reason why I end up having 16 observations.
But my main challenge is to split, I think regex is best suited for that, the total string into chunks of "proc ... run;" sequences.
The screenshots I provide show how the regex101.com page can control for that by activating the ungreedy option which translated to sas code gives me an error.
If there are 180.000 code lines in the input file, one would expect them to be read into SAS as 180.000 lines. When this doesn't happen, it might be because SAS doesn't recognize the record terminator, but reads the file in chunks of 32767 bytes, the max. allowed in an infile statement, though I cannot see why it would read more than one observation in this case.
But it could be an interesting experiment to see if a termstr= option on the infile statement specified to match the file (CR, LF or CRLF) could force SAS to read the file as 180.000 code lines. It might be simpler to parse the file this way.
An alternative could be to read the file in 16 chunks and concatenate the chunks in one variable/observation before output as a SAS varchar, so the parsing for proc/run wouldn't be compromized by broken lines.
My code that produces the described behaviour mirrors the pasted code from my first blog entry.
The only difference is that I paste 180000 lines of code. Ok, that's not very clean but it creates 180000 observations each one representing one line even if it is blank.
The following code produces the 16 observations. I run a length statement on these observations and the length is in line what I expect from 180000 lines with 20 to 40 characters on average divided into 16 observations.
You can easily try on your own plugging-in your sas code or whatever text.
I have taken out the do loop in the data _null_ because for this setting it only distracts.
data casuser.cars1; infile datalines4 dsd ; length text $2000 ; input text$ ; datalines4; /* find matches retail */ proc casutil; droptable CASDATA= 'cart' incaslib='mkt' quiet ; /* 180000 more lines */ ;;;; run; data CASUSER.CARS1; set CASUSER.CARS1; where text ne ''; run; data _null_; call execute ("data casuser.codi_desc1; set CASUSER.CARS1 end=eof; length textus varchar(10000000); retain textus; textus=cats(textus, text); if eof then do; textus =tranwrd(textus,';',cat(';', '0A'x)); output; end; run;"); run; data CASUSER.CODI_DESC1; set CASUSER.CODI_DESC1; uniqueID = compress(put(_threadid_,8.) || '_' || Put(_n_,8.)); run; proc cas; source MPG_toyota; select uniqueID, length(textus) as len from CASUSER.CODI_DESC1 group by uniqueID, len ; endsource; fedSQL.execDirect / query=MPG_toyota; quit;
Are you trying to develop code parser?
if yes, then I would highly recommend reading this 2017 paper Automatically create diagrams showing the structure and performance of your SAS code , especially when you have 180,000 lines of code to analyze.
One other alternative would the SAS 9 Content Assessment, one of its components is "SAS Code Check", where it would scan a directory of code files
Hope this helps
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.