And the results are in!
First series of tests, a large number of comment records with around 300 total good and bad keywords:
Hai.kuo1
real time 4:20.14
user cpu time 4:12.83
system cpu time 3.52 seconds
memory 11310.40k
OS Memory 16932.00k
Tom Kari1
As I suspected, SQL wasn't the best approach to this problem. When the decomposition from comments into words plus the SQL join took over an hour (and pretty much froze my machine), I gave up and cancelled it.
PGStats1
real time 3:22.55
user cpu time 3:02.28
system cpu time 2.80 seconds
memory 11310.40k
OS Memory 16932.00k
Hai.kuo2
real time 4:18.11
user cpu time 4:03.04
system cpu time 3.15 seconds
memory 11312.81k
OS Memory 16932.00k
Hai.kuo3
real time 10:30.18
user cpu time 10:21.19
system cpu time 3.26 seconds
memory 11312.81k
OS Memory 16932.00k
PGStats2
real time 5:41.48
user cpu time 5:04.57
system cpu time 6.80 seconds
memory 11310.40k
OS Memory 16932.00k
Ksharp1
real time 5:20.34
user cpu time 5:14.99
system cpu time 3.44 seconds
memory 11312.81k
OS Memory 17516.00k
Timestamp 19/07/2013 04:11:51 o'clock PM
Then I massively ramped up the number of keywords, to over 17,000 total good and bad keywords, to see how sensitive the in-memory approaches are to significantly upping the volume.
The results:
Hai.kuo1
real time 5:34.88
user cpu time 5:14.20
system cpu time 3.40 seconds
memory 11312.81k
OS Memory 16932.00k
PGStats1
real time 3:59.54
user cpu time 3:40.99
system cpu time 2.99 seconds
memory 11310.40k
OS Memory 16932.00k
Hai.kuo2
real time 5:20.00
user cpu time 4:56.76
system cpu time 3.47 seconds
memory 11312.81k
OS Memory 16932.00k
Hai.kuo3
This ran for an hour and a half, so I gave up and cancelled it. I then tried it on a small (total 3,000 good and bad keywords) file, and it ran to completion in 1:14.
PGStats2
real time 6:13.38
user cpu time 5:22.26
system cpu time 6.11 seconds
memory 11313.37k
OS Memory 16932.00k
Ksharp1
real time 6:16.54
user cpu time 6:05.57
system cpu time 3.97 seconds
memory 11259.78k
OS Memory 16676.00k
Conclusions:
1. Don't use SQL for a problem like this.
2. In terms of finding the words in the sentences, two approaches were used: scan and prx. Duration seems very similar, but prx seems to consume twice the "system CPU time". Is this important? Of course, with a real world problem, prx might be the only way to accomplish the subtle destringing that would be necessary.
3. To manage the keywords, we had three approaches: hash, informat, and array. The array approach seems to have broken down at high volumes, while the hash and informat options came through with flying colours. Informat might be a little faster, but I think we'd need to go to yet higher volumes to try to distinguish them.
Other notes:
These test were run on my Toshiba laptop, an I5 @2.5GHz, 6GB installed memory, Win 7 64 bit.
The data used was a set of Toronto restaurant reviews, pulled off of the web. On the large tests, the raw data was reproduced 100 times, to end up with 263,360,800 words, and 1,405,426,300 characters.
On the small keyword tests I used 320 keywords (187 good, 133 bad). On the big keyword tests I used 17,346 keywords (10136 good, 7210 bad).
I'll keep the data for a week, in case anybody wants anything tested.
Final comments:
Thank you all for providing such wonderful illustrations of the different ways that SAS can solve a real-world problem. I've learned a tremendous amount from this (in addition to never saying "I'll do that" again), and I know that this knowledge will be of great value in the future.
I'm considering writing a paper about this for SGF or some other venue. If any of you would not like me to use your software, please reply in the comments. On the other hand, if any of you would like to be listed as co-authors, also reply to that effect, and we'll figure it out offline.
Tom
Thank you very much, Tom, for the tremendous efforts you have committed. Feel free to use my code in your paper.
Haikuo
Wow Tom, That's what I call serious testing! Thanks a lot. Feel free to use my code.
PG
Maybe I should get my code better.
Feel free to use my code in your paper , Tom. You know my e-mail.
Ksharp
Hi Tom,
If it's not too late, I'd appreciate if you could try a true PRX based approach where PRX isn't used only as a word reader but actually does the work of finding the good/bad words too. I doubt it will compete with other methods on large strings because the .net framework PRX engine is NFA and that's fairly bad for alternation constructs efficiency but anyway - here's how it goes:
data good;
input good$;
cards;
wow
great
ok
good
better
best
;
data bad;
input bad$;
cards;
bad
meh
boring
never
;
data have;
input comment $50.;
cards;
"Wow so great"
"It's OK"
"Good but boring"
"Meh"
"Good Good Good Better Best, Never let it rest"
;
proc sql noprint;
select good
into :good seperated by '|'
from good;
select bad
into :bad seperated by '|'
from bad;
quit;
data temp;
if _N_=1 then do;
prxidgood=prxparse("/\b(?:&good.)\b/i");
prxidbad=prxparse("/\b(?:&bad.)\b/i");
end;
set have;
start=1;
goodcount=0;
badcount=0;
do until (pos1=0);
call prxnext(prxidgood, start, -1, comment, pos1, length);
if pos1>0 then goodcount=goodcount+1;
end;
do until (pos2=0);
call prxnext(prxidbad, start, -1, comment, pos2, length);
if post2>0 then badcount=badcount+1;
end;
retain prxidgood prxidbad;
drop start pos1 pos2 prxidgood prxidbad;
run;
If it's not lagging too far behind, I could try to optimize into a single loop using $1 and $2 regex constructs to try to use a single &good|&bad regex and count according to the replace type. I'd have to read further about PRXNEXT and what can be done however.
Thanks!
Very interesting thread by the way
Vincent
*edit updated according to data _null_ 's comment below. It should definitely not be computed each data step iteration.
*edit added i option to regex to ignore cases as mentionned by Haikuo below.
*edit added the \b...\b to fully delimitate words as PG pointed out. Thanks. However, that is actually one of the strenght of regexes over scans is that you can find words embeded and in different scenarios it may achieve more of the OP's goal. The o option did not appear to work in my testing and sadly, the \b...\b is forcing me to add the parenthesis which means adding a capturing group to the regex and thus significantly decreasing efficiency. To circumvent the effect, I added the ?: at the start of the capturing group...to define it as a non-capturing group. Small scale tests shows its working as intended.
I did not know about the o option before as I had done most of the regex self-learning on msdn and the only 5 discussed .net framework options there are imnsx. I can't seem to find what to search SAS help for to get the list of perl options availible. If anyone could point it out that would be much appreciated. I'm stuck with SAS 9.2 still at Statscan I did not request 9.3 yet hoping to jump on 9.4 testing as soon as we get some licenses.
I'll give it a try in the next few days.
Tom
Should these statements be executed for every iteration of the data step loop?.
prxidgood=prxparse("/&good/");
prxidbad=prxparse("/&bad/");
LOL, DN. Can't you be less sharp sometimes?
Haikuo
Either you use the 'o' suffix :
prxidgood=prxparse("/&good/io");
prxidbad=prxparse("/&bad/io");
or RETAIN and _n_=1 :
retain prxidgood prxidbad;
if _n_=1 then do;
prxidgood=prxparse("/&good/i");
prxidbad=prxparse("/&bad/i");
end;
Note that the patterns will match word parts: /ordinary/ will match "extraordinary". It might be better to use pattern "/\b(&good.)\b/i"
PG
Here are the results for Vince281's code.
I suspected from the earlier tests that using PRX in this way would be problematic, and that was the case.
In the "full volume" run, with 17,000 keywords in total, SAS failed with
ERROR: The length of the value of the macro variable GOOD (65540) exceeds the maximum length (65534). The value has been
truncated to 65534 characters.
so I reduced the volumes of both comments and keywords, and ran it with different combinations. Here are the results:
comment records | |||||||
27,155 | 27,155 | 27,155 | 271,550 | 271,550 | 2,715,500 | 3 | |
keywords | |||||||
33 | 320 | 3,000 | 33 | 320 | 33 | 17,346 | |
real time (mm:ss) | 00:12.98 | 01:44.87 | 15:02.65 | 02:14.30 | 17:10.08 | 23:10.12 | failed |
user cpu time (mm:ss) | 00:12.51 | 01:41.97 | 14:50.71 | 02:09.87 | 16:58.71 | 22:07.69 | |
system cpu time (mm:ss) | 00:00.31 | 00:00.37 | 00:00.57 | 00:03.26 | 00:04.04 | 00:26.67 | |
memory (KB) | 11,306.65 | 11,306.65 | 11,306.65 | 11,306.65 | 11,362.89 | 11,306.65 | |
OS Memory (KB) | 16,676.00 | 16,676.00 | 17,516.00 | 17,516.00 | 16,932.00 | 16,676.00 |
With the full volume of comment records, using only 33 keywords consumed 23 minutes, compared to the same run for Pgstats1 which took just under 4 minutes.
Tom
Tyvm for the effort Tom,
I did expect full volume to have issues. Alternation constructs are brutal on NFA PRX efficiency. Nonetheless, I am glad to see the results. I wish SAS supported a "DFA" option for perl.
Vincent
The "full volume" test would understandably fail considering the SAS limit on macro variable length. There is a macro sample at http://support.sas.com/kb/39/605.html?utm_source=dlvr.it&utm_medium=twitter that illustrates how to create a string that goes beyond the 64K limit for macro variables.
Also, results that are heavily dependent on macro variables would be affected by settings such as MSYMTABMAX and MVARSIZE.
Yes, in the case that this option had comparable performance, it would have been worth investigating.
However, on this one the other options are simply much more appropriate, so I decided to leave it at that.
Tom
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.