@FreelanceReinh Thank you for making this discussion very interactive and interesting. One more favor as I'm afraid we(students at Depaul) are not allowed to play with large datasets at our college lab. May i request a test on the performance on this, similar to DOW, a self imposed interleave
data file;
input pat_id drug $ ;
cards;
3 648 2
3 009 2
4 504 2
4 569 2
5 003 1
;
data want;
do until(last);
set file(in=a) file(in=b) end=last;
by pat_id ;
if first.pat_id then call missing(count);
if a then count+1;
if b then output;
end;
run;
Very nice! I don't think I've seen this trick before. (Will add it to my SAS notes.)
Run time was 13 s (+/- 0.5 s) in a few repeated runs, i.e. rank 2 in the "competition".
@FreelanceReinh Thank you sir as always the "sir" goes to the special few 🙂 mentioned before lol 🙂
@Cruise Fun to find different ways to solve 🙂 Take care!
Nice. It worked out fine for my actual dataset with 8 million observations. I forgot to sort my data first on ID variable.
NOTE: There were 8827728 observations read from the data set L.P02_16.
NOTE: There were 8827728 observations read from the data set L.P02_16.
NOTE: The data set L.P02_16A has 8827728 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 45.35 seconds
cpu time 2.37 seconds
Great. Just to get back to your initial post: The "PROC FREQ + MERGE" technique you mentioned took 35 s on my machine (using the "competition" dataset, of course). Replacing PROC FREQ with PROC SUMMARY (I had thought the BY statement might be beneficial) was even worse: 45 s.
@novinosrin wrote:
@FreelanceReinh Thank you for making this discussion very interactive and interesting. One more favor as I'm afraid we(students at Depaul) are not allowed to play with large datasets at our college lab. May i request a test on the performance on this, similar to DOW, a self imposed interleave
data file; input pat_id drug $ ; cards; 3 648 2 3 009 2 4 504 2 4 569 2 5 003 1 ; data want; do until(last); set file(in=a) file(in=b) end=last; by pat_id ; if first.pat_id then call missing(count); if a then count+1; if b then output; end; run;
I'm actually surprised that it took slightly longer than the dow loop. I would have expected almost identical times, as the mechanism is the same (read block from first handle, then read block from second handle and output). What happens if call missing is replaced with count = 0; (simple assignment may be faster than subroutine call)?
Good point. I've just run the variant with "count=0" three times, but the run times were always slightly longer (!) than the maximum run time of three runs of the "call missing(count)" version. But it's very close.
Another six runs of the double DOW loop (three with your "newcount+1", three with "count=sum(count,1)", thus saving the reset to zero at the beginning of the data step) resulted in five shorter run times (no clear preference for one of the variants). On average it seems that the double DOW loop leads by approx. 0.5 seconds or approx. 4 percent. (@novinosrin's approach ran a bit faster than it did yesterday.)
BTW, congrats on your 10,000th post! Remarkable achievement!
@FreelanceReinh wrote:
BTW, congrats on your 10,000th post! Remarkable achievement!
Completely overlooked that. For the records, it happened around 2018-06-29T09:30:00+02:00!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.