Hi Tom, If it's not too late, I'd appreciate if you could try a true PRX based approach where PRX isn't used only as a word reader but actually does the work of finding the good/bad words too. I doubt it will compete with other methods on large strings because the .net framework PRX engine is NFA and that's fairly bad for alternation constructs efficiency but anyway - here's how it goes: data good; input good$; cards; wow great ok good better best ; data bad; input bad$; cards; bad meh boring never ; data have; input comment $50.; cards; "Wow so great" "It's OK" "Good but boring" "Meh" "Good Good Good Better Best, Never let it rest" ; proc sql noprint; select good into :good seperated by '|' from good; select bad into :bad seperated by '|' from bad; quit; data temp; if _N_=1 then do; prxidgood=prxparse("/\b(?:&good.)\b/i"); prxidbad=prxparse("/\b(?:&bad.)\b/i"); end; set have; start=1; goodcount=0; badcount=0; do until (pos1=0); call prxnext(prxidgood, start, -1, comment, pos1, length); if pos1>0 then goodcount=goodcount+1; end; do until (pos2=0); call prxnext(prxidbad, start, -1, comment, pos2, length); if post2>0 then badcount=badcount+1; end; retain prxidgood prxidbad; drop start pos1 pos2 prxidgood prxidbad; run; If it's not lagging too far behind, I could try to optimize into a single loop using $1 and $2 regex constructs to try to use a single &good|&bad regex and count according to the replace type. I'd have to read further about PRXNEXT and what can be done however. Thanks! Very interesting thread by the way Vincent *edit updated according to data _null_ 's comment below. It should definitely not be computed each data step iteration. *edit added i option to regex to ignore cases as mentionned by Haikuo below. *edit added the \b...\b to fully delimitate words as PG pointed out. Thanks. However, that is actually one of the strenght of regexes over scans is that you can find words embeded and in different scenarios it may achieve more of the OP's goal. The o option did not appear to work in my testing and sadly, the \b...\b is forcing me to add the parenthesis which means adding a capturing group to the regex and thus significantly decreasing efficiency. To circumvent the effect, I added the ?: at the start of the capturing group...to define it as a non-capturing group. Small scale tests shows its working as intended. I did not know about the o option before as I had done most of the regex self-learning on msdn and the only 5 discussed .net framework options there are imnsx. I can't seem to find what to search SAS help for to get the list of perl options availible. If anyone could point it out that would be much appreciated. I'm stuck with SAS 9.2 still at Statscan I did not request 9.3 yet hoping to jump on 9.4 testing as soon as we get some licenses.
... View more