Solved: Count of times repeated alternative to proc freq? - Page 2

novinosrin · Posted 06-28-2018 01:26 PM

@FreelanceReinh Thank you for making this discussion very interactive and interesting. One more favor as I'm afraid we(students at Depaul) are not allowed to play with large datasets at our college lab. May i request a test on the performance on this, similar to DOW, a self imposed interleave

data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
data want;
do until(last);
set file(in=a) file(in=b) end=last;
by pat_id ;
if first.pat_id then call missing(count);
if a then count+1;
if b then output;
end;
run;

FreelanceReinh · Posted 06-28-2018 01:39 PM

@novinosrin,

Very nice! I don't think I've seen this trick before. (Will add it to my SAS notes.)

Run time was 13 s (+/- 0.5 s) in a few repeated runs, i.e. rank 2 in the "competition".

novinosrin · Posted 06-28-2018 01:41 PM

@FreelanceReinh Thank you sir as always the "sir" goes to the special few 🙂 mentioned before lol 🙂

Cruise · Posted 06-28-2018 01:55 PM

I keep staring at this solution. 3 dimensional it is.

novinosrin · Posted 06-28-2018 02:01 PM

@Cruise Fun to find different ways to solve 🙂 Take care!

Cruise · Posted 06-28-2018 02:15 PM

Nice. It worked out fine for my actual dataset with 8 million observations. I forgot to sort my data first on ID variable.

NOTE: There were 8827728 observations read from the data set L.P02_16.

NOTE: The data set L.P02_16A has 8827728 observations and 8 variables.

NOTE: DATA statement used (Total process time):

real time 45.35 seconds

cpu time 2.37 seconds

FreelanceReinh · Posted 06-28-2018 02:35 PM

Great. Just to get back to your initial post: The "PROC FREQ + MERGE" technique you mentioned took 35 s on my machine (using the "competition" dataset, of course). Replacing PROC FREQ with PROC SUMMARY (I had thought the BY statement might be beneficial) was even worse: 45 s.

Kurt_Bremser · Posted 06-29-2018 01:54 AM

@novinosrin wrote:

@FreelanceReinh Thank you for making this discussion very interactive and interesting. One more favor as I'm afraid we(students at Depaul) are not allowed to play with large datasets at our college lab. May i request a test on the performance on this, similar to DOW, a self imposed interleave
data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
data want;
do until(last);
set file(in=a) file(in=b) end=last;
by pat_id ;
if first.pat_id then call missing(count);
if a then count+1;
if b then output;
end;
run;

I'm actually surprised that it took slightly longer than the dow loop. I would have expected almost identical times, as the mechanism is the same (read block from first handle, then read block from second handle and output). What happens if call missing is replaced with count = 0; (simple assignment may be faster than subroutine call)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 06-29-2018 04:14 AM

@Kurt_Bremser,

Good point. I've just run the variant with "count=0" three times, but the run times were always slightly longer (!) than the maximum run time of three runs of the "call missing(count)" version. But it's very close.

Another six runs of the double DOW loop (three with your "newcount+1", three with "count=sum(count,1)", thus saving the reset to zero at the beginning of the data step) resulted in five shorter run times (no clear preference for one of the variants). On average it seems that the double DOW loop leads by approx. 0.5 seconds or approx. 4 percent. (@novinosrin's approach ran a bit faster than it did yesterday.)

BTW, congrats on your 10,000th post! Remarkable achievement!

Kurt_Bremser · Posted 06-29-2018 04:32 AM

@FreelanceReinh wrote:

BTW, congrats on your 10,000th post! Remarkable achievement!

Completely overlooked that. For the records, it happened around 2018-06-29T09:30:00+02:00!

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Re: Count of times repeated alternative to proc freq?

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away