BookmarkSubscribeRSS Feed
ScottBass
Rhodochrosite | Level 12

Hi,

Luke's original post said "...there must be a simpler way of doing this", but many of the replies are focused on performance.

If you discard simplicity and go for performance, and if your data will fit in memory, then IIRC you can use a hash object for this, and do this in one pass of the data.

I don't have time to look it up now, but hit the doc on the hash object and search on "suminc" (I think).  Anyway, there's an example in there of deriving a sum based on the keys (your by variable), and you then use the output statement to write out the results after you've hit EOF.  (I think it's kinda cool you can write a permanent dataset from a data _null_ step!)

HTH,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
ScottBass
Rhodochrosite | Level 12

Hi All,

I had a further chance to "play" with this, and thought I'd share my code.

Luke's original post only produced a frequency count per UID, due to his subsetting if statement on last.uid.  However, his text said:

What I need is for the value of the supplied variable for last.UID to be applied to all observations of that UID so that when I later class by each supplier I get an accurate count of the total time that record has been supplied.


This is a bit ambiguous; does Luke just want a frequency count per UID, as his code generates, or does he want the frequency count to be "applied to all observations"?  Most of the suggested approaches assumed the latter.


The double set statement "set supplied1 (in=a) supplied1; by uid" is a pretty cool construct I'd not seen before.  I walked it through the data step debugger to see that it reads all observations in the first by group from the first dataset, then all the observations in the first by group from the second dataset, etc.  If I recall my terminology correctly, it is interleaving the data from the two datasets by UID "chunks".  This then allows you to derive the frequency counts from the first dataset (in=a), then "merge" them back into the observations from the second dataset.


This is analogous to the message you often see in SQL "The query requires remerging summary statistics back with the original data."


I "played" with the different approaches, trying to see how I could optimize the performance.  I've attached my .sas file and .log file when I ran it on my server.


Comments:

1) I've been using a "SPDE Work" library a lot lately to get better I/O performance in my large jobs.  It's a real easy change to make, with some caveats (potential file locking, no support for views, doesn't support sasfile, and so on).  See the attached file for more details.

2) Even though we're doing a double read of the data on disk, I didn't get an appreciable performance gain by using sasfile, and then doing the double read from memory.

3) I didn't get the expected performance gains using a hash object.  And given that, no one would approach the problem this way.  It was just a chance for me to "play" and learn more about hash object key summaries.  See SAS(R) 9.3 Language Reference: Concepts, Second Edition, scroll to "Maintaining Key Summaries", for more details.  And I'm not sure my code is optimized; while I do first read the data into memory and then iterate on the copy in memory, I suspect having to iterate over the hash introduces overhead.  Using a DOW loop might be more efficient.


Anyway, I hope some of this is useful Smiley Happy


Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Peter_C
Rhodochrosite | Level 12

Hi Scott

in case no one else has said it, thank you for the depth of your investigation, the results and the detail of your review.

I also respond to discuss your suggestion:

"This is analogous to the message you often see in SQL "The query requires remerging summary statistics back with the original data."

imho the data step is not "remerging" as sql does, because this "remerge" occurs only for the rows of a/each by-group whereas for sql it will be storing the whole table in spool, cache or memory before performing a whole-table merge (at least in my experience of teradata sql).

OK when there is no by-statement or there is only one value for the by-variable the data step of this style would indeed be equivalent to an "sql remerge".

regards

peterC

ScottBass
Rhodochrosite | Level 12

Hi Peter,

Thanks for your comments.  I just meant that SQL is first deriving the summary statistics (under the covers), then remerging with the original data.  I didn't mean to imply that the two processes (data step vs. sql) were identical.

Regards,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1187 views
  • 7 likes
  • 8 in conversation