Help using Base SAS procedures

Best practice on returning a data set

Reply
Occasional Contributor
Posts: 7

Best practice on returning a data set

I'm fairly new to SAS and I'd like to really get into returning data sets in the most efficient and economical fashion.  My lack of on the job experience leads me to the following.  If a client requested the have the following metrics for mean, count and sum by dpt (department), what would be the best approach to take?

data sample1;

infile datalines dlm=",";

length fn $ 12;

input fn $ dpt $ salary;

datalines;

john doe, fin, 50000

jane doe, fin, 69393

kim smith, it, 93893

ken smith, it, 80493

;

run;

quit;

something as simple as:

proc means data=sample1 sum n mean;

class dpt;

run;

quit;

or would we write PROC SQL statement or even a data set process where we're grouping to get our totals.  The Proc Means approach is really quick but it doesn't push the boundaries of being a good programmer.  With an approach like

data mt_samp (drop=fn);

  set sample1 ;

by dpt;

if First.dpt then total = 0;

if First.dpt then cnt = 0;

if First.dpt then ave = total /cnt;  /** this line doesn't work, sorry

  total+salary;

  cnt+1;

if Last.dpt then output;

run;

quit;

Thanks,

SJ         

Contributor
Posts: 41

Re: Best practice on returning a data set

Ghm... As I understood from your post, you don't really need help on calculating metrics:

Proc means does it perfectly.

If you need to put the result of means procedure in a dataset, you need just to add a output statement into procedure:

proc means data=sample1 noprint ;

class dpt;

output out=sample_metrics sum=SumByDept n=NumOfEntries mean=MeanByDept;

run;

quit;

you can add other metrics in output statement.

p.s.

If you want to find the other way: use proc sql. It's also possible with data step, but it's not really convenient for generic task, since you need to perform sorting operations outside of the step and have to control auxilliary variables for aggregating your stats.

Valued Guide
Posts: 632

Re: Best practice on returning a data set

'push the boundaries of being a good programmer'....  Being a good programmer does NOT mean that the most complicated solution should be used.  A good programmer understands the range of possible programming solutions and picks the best one for the task.  The better the programmer the broader the range of choices and the better the understanding of all the ramifications of that choice.

In these forums you often see the range of solutions to problems that reflect these levels of understanding.  At SAS Global Forum next month I will be presenting a paper that addresses some of the issues of those choices and the topics came from these forums.

PROC Star
Posts: 1,089

Re: Best practice on returning a data set

Using PROC MEANS does in fact push the boundaries of being a good programmer. After 30 plus years in this business, there's no better definition of the job of a programmer than "to cost-effectively meet the requirements of the client".

In your example, the following advantages are derived from using PROC MEANS:

  • It's the shortest time to code, which means lower cost for your client (on an hourly rate, meaning they'll be happy to hire you again), or more money in your pocket on a fixed-rate basis.
  • You can assume that SAS does the statistical calculations correctly. If you code the statistics yourself, you risk minor coding errors, therefore you must devote MUCH more time to testing...back to the lower time-to-delivery factor.
  • You can assume that SAS (a $2.7B company) expends considerable R&D resources making its software as efficient and easy to use as possible. By using their PROC, you piggyback on all of these advances.

In addition, at some point you depart, and I have to come in to maintain or modify your code. If you were in that situation, which program would you rather have to investigate? Again, your reputation matters. Would you rather your successor report to management..."Boy, that SJ. Did he ever do a nice job on the summarization code" or "Jeez, that SJ. He could have just used PROC MEANS, but instead we're saddled with a bunch of DATA step code. He doesn't know ANYTHING about SAS". Or even worse..."That cretin SJ: he could have used PROC MEANS, but instead he wrote some idiotic DATA step code that doesn't work right. Oh, by the way, you're going to have to restate your financials for the last five years".

So, as a "manager/professional" response, you should absolutely, definitely use PROC MEANS. However, your other point is valid. To add value as a developer (NOT a programmer), you will need a very large stock of methods and techniques to succeed. Situations will arise where you'll need to recommend an approach, and you'll have a big blank sheet of paper in front of you. How will you know how to proceed? First, you'll build up a repertoire of knowledge from just doing your job in the most efficient way possible. Second, talk to people (like your post on this forum...way to go!). Third, you'll need to invest in both formal and self-directed training. I would frequently stay a couple of hours after work (on my own time), when it's quiet, and investigate different alternatives, or try some new features of SAS that I didn't understand fully. That way, when it comes up in a meeting, I'm prepared to offer a recommendation backed by experience, where the other folks simply don't know.

Remember, a career is a very long time. I started in this business before IBM released the Personal Computer, and have had to re-invent myself a half dozen time. But when things go well, nothing is more satisfying!

Best,

  Tom

Occasional Contributor
Posts: 7

Re: Best practice on returning a data set

Tom,

Thanks for the response.  I was looking for some insight on the life of a SAS Analyst/Developer and you definitely brought the sun to this thread!

Thank you all,

SJ

Super User
Posts: 5,074

Re: Best practice on returning a data set

Agree with all the other posters.  As you think about the problem, also consider that you are not that far away from requiring more of a strategy.  For example, let's make two assumptions to complicate the task just a little.  First, assume you have a lot of data so that efficiency is important.  And second, you don't have just a single variable SALARY, but a second variable BONUS.  And you have to get the average of SALARY + BONUS.  

The inefficient method would use a DATA step to add SALARY + BONUS, then use PROC MEANS on that total.  That's inefficient because it processes a large amount of data twice.  The efficient method would run PROC MEANS on the individual variables:

VAR SALARY BONUS;

It would save an output data set holding a separate observation for each DPT (rather than printing) and save the SUM statistic.  Then a DATA step could combine the sum of SALARY with the sum of BONUS.  The same DATA step could divide by _FREQ_ (automatically part of the output data set), to get averages.  Finally, print the results.  Remember, it is very fast to process a summary data set multiple times.  All in all, if you're considering what to learn next, it wouldn't hurt to learn the structure of a PROC MEANS / PROC SUMMARY output data set.

Good luck.


Ask a Question
Discussion stats
  • 5 replies
  • 300 views
  • 4 likes
  • 5 in conversation