Apply count variable to all observations - Page 2

ScottBass · Posted 10-16-2013 03:31 AM

Hi,

Luke's original post said "...there must be a simpler way of doing this", but many of the replies are focused on performance.

If you discard simplicity and go for performance, and if your data will fit in memory, then IIRC you can use a hash object for this, and do this in one pass of the data.

I don't have time to look it up now, but hit the doc on the hash object and search on "suminc" (I think). Anyway, there's an example in there of deriving a sum based on the keys (your by variable), and you then use the output statement to write out the results after you've hit EOF. (I think it's kinda cool you can write a permanent dataset from a data _null_ step!)

HTH,

Scott

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

ScottBass · Posted 10-18-2013 03:21 AM

Hi All,

I had a further chance to "play" with this, and thought I'd share my code.

Luke's original post only produced a frequency count per UID, due to his subsetting if statement on last.uid. However, his text said:

What I need is for the value of the supplied variable for last.UID to be applied to all observations of that UID so that when I later class by each supplier I get an accurate count of the total time that record has been supplied.

This is a bit ambiguous; does Luke just want a frequency count per UID, as his code generates, or does he want the frequency count to be "applied to all observations"? Most of the suggested approaches assumed the latter.

The double set statement "set supplied1 (in=a) supplied1; by uid" is a pretty cool construct I'd not seen before. I walked it through the data step debugger to see that it reads all observations in the first by group from the first dataset, then all the observations in the first by group from the second dataset, etc. If I recall my terminology correctly, it is interleaving the data from the two datasets by UID "chunks". This then allows you to derive the frequency counts from the first dataset (in=a), then "merge" them back into the observations from the second dataset.

This is analogous to the message you often see in SQL "The query requires remerging summary statistics back with the original data."

I "played" with the different approaches, trying to see how I could optimize the performance. I've attached my .sas file and .log file when I ran it on my server.

Comments:

1) I've been using a "SPDE Work" library a lot lately to get better I/O performance in my large jobs. It's a real easy change to make, with some caveats (potential file locking, no support for views, doesn't support sasfile, and so on). See the attached file for more details.

2) Even though we're doing a double read of the data on disk, I didn't get an appreciable performance gain by using sasfile, and then doing the double read from memory.

3) I didn't get the expected performance gains using a hash object. And given that, no one would approach the problem this way. It was just a chance for me to "play" and learn more about hash object key summaries. See SAS(R) 9.3 Language Reference: Concepts, Second Edition, scroll to "Maintaining Key Summaries", for more details. And I'm not sure my code is optimized; while I do first read the data into memory and then iterate on the copy in memory, I suspect having to iterate over the hash introduces overhead. Using a DOW loop might be more efficient.

Anyway, I hope some of this is useful

Scott

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Peter_C · Posted 11-03-2013 12:45 PM

Hi Scott

in case no one else has said it, thank you for the depth of your investigation, the results and the detail of your review.

I also respond to discuss your suggestion:

"This is analogous to the message you often see in SQL "The query requires remerging summary statistics back with the original data."

imho the data step is not "remerging" as sql does, because this "remerge" occurs only for the rows of a/each by-group whereas for sql it will be storing the whole table in spool, cache or memory before performing a whole-table merge (at least in my experience of teradata sql).

OK when there is no by-statement or there is only one value for the by-variable the data step of this style would indeed be equivalent to an "sql remerge".

regards

peterC

ScottBass · Posted 11-03-2013 07:36 PM

Hi Peter,

Thanks for your comments. I just meant that SQL is first deriving the summary statistics (under the covers), then remerging with the original data. I didn't mean to imply that the two processes (data step vs. sql) were identical.

Regards,

Scott

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Re: Apply count variable to all observations

Click image to register for webinar

Classroom Training Available!