Scanning an Observation for a Word within a Variable - Page 2

TomKari · Posted 07-19-2013 05:32 PM

And the results are in!

First series of tests, a large number of comment records with around 300 total good and bad keywords:

Hai.kuo1

      real time           4:20.14
      user cpu time       4:12.83
      system cpu time     3.52 seconds
      memory              11310.40k
      OS Memory           16932.00k

Tom Kari1

As I suspected, SQL wasn't the best approach to this problem. When the decomposition from comments into words plus the SQL join took over an hour (and pretty much froze my machine), I gave up and cancelled it.

PGStats1

      real time           3:22.55
      user cpu time       3:02.28
      system cpu time     2.80 seconds
      memory              11310.40k
      OS Memory           16932.00k

Hai.kuo2

      real time           4:18.11
      user cpu time       4:03.04
      system cpu time     3.15 seconds
      memory              11312.81k
      OS Memory           16932.00k

Hai.kuo3

      real time           10:30.18
      user cpu time       10:21.19
      system cpu time     3.26 seconds
      memory              11312.81k
      OS Memory           16932.00k

PGStats2

      real time           5:41.48
      user cpu time       5:04.57
      system cpu time     6.80 seconds
      memory              11310.40k
      OS Memory           16932.00k

Ksharp1

      real time           5:20.34
      user cpu time       5:14.99
      system cpu time     3.44 seconds
      memory              11312.81k
      OS Memory           17516.00k
      Timestamp           19/07/2013 04:11:51 o'clock PM

Then I massively ramped up the number of keywords, to over 17,000 total good and bad keywords, to see how sensitive the in-memory approaches are to significantly upping the volume.

The results:

Hai.kuo1

      real time           5:34.88
      user cpu time       5:14.20
      system cpu time     3.40 seconds
      memory              11312.81k
      OS Memory           16932.00k

PGStats1

      real time           3:59.54
      user cpu time       3:40.99
      system cpu time     2.99 seconds
      memory              11310.40k
      OS Memory           16932.00k

Hai.kuo2

      real time           5:20.00
      user cpu time       4:56.76
      system cpu time     3.47 seconds
      memory              11312.81k
      OS Memory           16932.00k

Hai.kuo3

This ran for an hour and a half, so I gave up and cancelled it. I then tried it on a small (total 3,000 good and bad keywords) file, and it ran to completion in 1:14.

PGStats2

      real time           6:13.38
      user cpu time       5:22.26
      system cpu time     6.11 seconds
      memory              11313.37k
      OS Memory           16932.00k

Ksharp1

      real time           6:16.54
      user cpu time       6:05.57
      system cpu time     3.97 seconds
      memory              11259.78k
      OS Memory           16676.00k

Conclusions:

1. Don't use SQL for a problem like this.

2. In terms of finding the words in the sentences, two approaches were used: scan and prx. Duration seems very similar, but prx seems to consume twice the "system CPU time". Is this important? Of course, with a real world problem, prx might be the only way to accomplish the subtle destringing that would be necessary.

3. To manage the keywords, we had three approaches: hash, informat, and array. The array approach seems to have broken down at high volumes, while the hash and informat options came through with flying colours. Informat might be a little faster, but I think we'd need to go to yet higher volumes to try to distinguish them.

Other notes:

These test were run on my Toshiba laptop, an I5 @2.5GHz, 6GB installed memory, Win 7 64 bit.

The data used was a set of Toronto restaurant reviews, pulled off of the web. On the large tests, the raw data was reproduced 100 times, to end up with 263,360,800 words, and 1,405,426,300 characters.

On the small keyword tests I used 320 keywords (187 good, 133 bad). On the big keyword tests I used 17,346 keywords (10136 good, 7210 bad).

I'll keep the data for a week, in case anybody wants anything tested.

Final comments:

Thank you all for providing such wonderful illustrations of the different ways that SAS can solve a real-world problem. I've learned a tremendous amount from this (in addition to never saying "I'll do that" again), and I know that this knowledge will be of great value in the future.

I'm considering writing a paper about this for SGF or some other venue. If any of you would not like me to use your software, please reply in the comments. On the other hand, if any of you would like to be listed as co-authors, also reply to that effect, and we'll figure it out offline.

Tom

TomKari · Posted 07-19-2013 05:33 PM

To @sucksatsas

The results are in from the performance testing. Please see the above post.

Thanks,

Tom

Haikuo · Posted 07-19-2013 05:42 PM

Thank you very much, Tom, for the tremendous efforts you have committed. Feel free to use my code in your paper.

Haikuo

PGStats · Posted 07-19-2013 05:58 PM

Wow Tom, That's what I call serious testing! Thanks a lot. Feel free to use my code.

PG

Ksharp · Posted 07-21-2013 10:40 PM

Maybe I should get my code better.

Feel free to use my code in your paper , Tom. You know my e-mail.

Ksharp

Vince28_Statcan · Posted 07-22-2013 09:59 AM

Hi Tom,

If it's not too late, I'd appreciate if you could try a true PRX based approach where PRX isn't used only as a word reader but actually does the work of finding the good/bad words too. I doubt it will compete with other methods on large strings because the .net framework PRX engine is NFA and that's fairly bad for alternation constructs efficiency but anyway - here's how it goes:

data good;

input good$;

cards;

wow

great

ok

good

better

best

;

data bad;

input bad$;

cards;

bad

meh

boring

never

;

data have;

input comment $50.;

cards;

"Wow so great"

"It's OK"

"Good but boring"

"Meh"

"Good Good Good Better Best, Never let it rest"

;

proc sql noprint;

select good

into :good seperated by '|'

from good;

select bad

into :bad seperated by '|'

from bad;

quit;

data temp;

if _N_=1 then do;

prxidgood=prxparse("/\b(?:&good.)\b/i");

prxidbad=prxparse("/\b(?:&bad.)\b/i");

end;

set have;

start=1;

goodcount=0;

badcount=0;

do until (pos1=0);

call prxnext(prxidgood, start, -1, comment, pos1, length);

if pos1>0 then goodcount=goodcount+1;

end;

do until (pos2=0);

call prxnext(prxidbad, start, -1, comment, pos2, length);

if post2>0 then badcount=badcount+1;

end;

retain prxidgood prxidbad;

drop start pos1 pos2 prxidgood prxidbad;

run;

If it's not lagging too far behind, I could try to optimize into a single loop using $1 and $2 regex constructs to try to use a single &good|&bad regex and count according to the replace type. I'd have to read further about PRXNEXT and what can be done however.

Thanks!

Very interesting thread by the way

Vincent

*edit updated according to data _null_ 's comment below. It should definitely not be computed each data step iteration.

*edit added i option to regex to ignore cases as mentionned by Haikuo below.

*edit added the \b...\b to fully delimitate words as PG pointed out. Thanks. However, that is actually one of the strenght of regexes over scans is that you can find words embeded and in different scenarios it may achieve more of the OP's goal. The o option did not appear to work in my testing and sadly, the \b...\b is forcing me to add the parenthesis which means adding a capturing group to the regex and thus significantly decreasing efficiency. To circumvent the effect, I added the ?: at the start of the capturing group...to define it as a non-capturing group. Small scale tests shows its working as intended.

I did not know about the o option before as I had done most of the regex self-learning on msdn and the only 5 discussed .net framework options there are imnsx. I can't seem to find what to search SAS help for to get the list of perl options availible. If anyone could point it out that would be much appreciated. I'm stuck with SAS 9.2 still at Statscan I did not request 9.3 yet hoping to jump on 9.4 testing as soon as we get some licenses.

TomKari · Posted 07-22-2013 01:10 PM

I'll give it a try in the next few days.

Tom

data_null__ · Posted 07-22-2013 01:49 PM

Should these statements be executed for every iteration of the data step loop?.

prxidgood=prxparse("/&good/");

prxidbad=prxparse("/&bad/");

Haikuo · Posted 07-22-2013 02:17 PM

LOL, DN. Can't you be less sharp sometimes?

Haikuo

PGStats · Posted 07-22-2013 05:22 PM

Either you use the 'o' suffix :

prxidgood=prxparse("/&good/io");

prxidbad=prxparse("/&bad/io");

or RETAIN and _n_=1 :

retain prxidgood prxidbad;

if _n_=1 then do;

prxidgood=prxparse("/&good/i");

prxidbad=prxparse("/&bad/i");

end;

Note that the patterns will match word parts: /ordinary/ will match "extraordinary". It might be better to use pattern "/\b(&good.)\b/i"

PG

Haikuo · Posted 07-22-2013 03:47 PM

,

It is a very nice addition to the pool of code we have. However, to get expected results, you may want adding "i" operator to let Prx ignore cases:

prxidgood=prxparse("/&good/i");

prxidbad=prxparse("/&bad/i");

Haikuo

TomKari · Posted 08-02-2013 01:09 PM

Here are the results for Vince281's code.

I suspected from the earlier tests that using PRX in this way would be problematic, and that was the case.

In the "full volume" run, with 17,000 keywords in total, SAS failed with

ERROR: The length of the value of the macro variable GOOD (65540) exceeds the maximum length (65534). The value has been
truncated to 65534 characters.

so I reduced the volumes of both comments and keywords, and ran it with different combinations. Here are the results:

	comment records
	27,155	27,155	27,155	271,550	271,550	2,715,500	3
	keywords
	33	320	3,000	33	320	33	17,346

real time (mm:ss)	00:12.98	01:44.87	15:02.65	02:14.30	17:10.08	23:10.12	failed
user cpu time (mm:ss)	00:12.51	01:41.97	14:50.71	02:09.87	16:58.71	22:07.69
system cpu time (mm:ss)	00:00.31	00:00.37	00:00.57	00:03.26	00:04.04	00:26.67
memory (KB)	11,306.65	11,306.65	11,306.65	11,306.65	11,362.89	11,306.65
OS Memory (KB)	16,676.00	16,676.00	17,516.00	17,516.00	16,932.00	16,676.00

With the full volume of comment records, using only 33 keywords consumed 23 minutes, compared to the same run for Pgstats1 which took just under 4 minutes.

Tom

Vince28_Statcan · Posted 08-02-2013 01:56 PM

Tyvm for the effort Tom,

I did expect full volume to have issues. Alternation constructs are brutal on NFA PRX efficiency. Nonetheless, I am glad to see the results. I wish SAS supported a "DFA" option for perl.

Vincent

Fugue · Posted 08-02-2013 04:29 PM

The "full volume" test would understandably fail considering the SAS limit on macro variable length. There is a macro sample at http://support.sas.com/kb/39/605.html?utm_source=dlvr.it&utm_medium=twitter that illustrates how to create a string that goes beyond the 64K limit for macro variables.

Also, results that are heavily dependent on macro variables would be affected by settings such as MSYMTABMAX and MVARSIZE.

TomKari · Posted 08-03-2013 07:50 PM

Yes, in the case that this option had comparable performance, it would have been worth investigating.

However, on this one the other options are simply much more appropriate, so I decided to leave it at that.

Tom

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Re: Scanning an Observation for a Word within a Variable

Classroom Training Available!