11-11-2015 12:07 PM
I need to create a new dataset that is like an existing one, but with some additional new variables. I understand that this can be done in the DATA step. To calculate some of the new variables, I need a PROC SUMMARY step. But that obviously ends the DATA step, even though it has not yet computed everything I need to compute. The process is as follows:
Part 1. DATA (newdata set olddata)
creates newdata from olddata, calculating some new variables in it along the way:
Part 2. PROC SUMMARY
continues the calcualtion begun in Part 1, summing up some column and storing the sum as a variable
Part 3. This is where I wish the DATA step in Part 1 could "pick up where it left off" and finish computing the new variables in NEWDATA, using the variable from Part 2 when needed. What is the easiest way of doing this?
Could I perhaps somehow open the dataset created in Part 1 and do the rest of the calculations on it with the variable from Part 2?
What I am doing now is having Part 1 create an intermediate dataset (Int_Data), from which I then compute the variable in Part 2, and then in Part 3 I am thinking of having another DATA step do the remaining calculations, with Int_Data now in the role of "old data" and the final dataset I want as the "new data". Is there some more elegant way of doing this, without intermediate datasets?
11-11-2015 12:28 PM
Sort of, but you will still need to read the data twice in the same DATA step. For example:
do until (done1);
set have end=done1;
** compute variables and sums;
do until (done2);
set have end=done2;
** repeat computations for same variables as above, and use sums in additional calculations;
Other strategies might exist where you may be able to run PROC SUMMARY first, before creating new variables. For example, if you have this as one of your computations:
liquids = 2* pints + 4*quarts;
If you are intending to run PROC SUMMARY to get the total for LIQUIDS, you can work around that. You can run an initial PROC SUMMARY to get the total of PINTS and the total of QUARTS. Then use that in a single DATA step. That sort of a strategy really depends on what the calculations are.
11-11-2015 02:10 PM
One elegant technique to do this kind of operation is automatic remerging within proc SQL. Summaries can be used directly in computations involving the original variables. For example, using sashelp.class, calculate the age difference relative to the median age for boys and girls, and keep only those within 1.5 years of the median. All this in a single query :
proc sql; select *, age - median(age) as ageDiff from sashelp.class group by sex having abs(age - median(age)) <= 1.5 order by sex, ageDiff, name; quit;
11-11-2015 03:04 PM
It probably is just a matter of thinking of the problem in a different way. Unless your data is huge it is probably easier to create a simple solution that might process the same data multiple times, but be clearer to understand.
PROC SQL can also help since you can express your calculations in terms of set operators and let the PROC optimize the way that it combines the sets for you.