I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).
Say I have this data:
data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines;
1 1 10 1 1 100 50 200
1 2 10 1 1 100 50 200
1 3 15 2 1 80 50 200
1 4 10 1 1 90 50 200
1 5 20 2 2 100 50 200
1 6 23 2 2 133 52 222
;
My desired output would be:
data want;
input id time sales bonus commission toalstock totalcash totalvalue;
1 6 88 9 8 133 52 222
;
My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows) with different ids.
Proc SQL or native SAS are both welcome.
Rows might be shuffled if necessary.
Thanks,
Krish
I have found a way to utilize proc summary in taking the last value of a group:
data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines;
1 1 10 1 1 100 50 200
1 2 10 1 1 100 50 200
1 3 15 2 1 80 50 200
1 4 10 1 1 90 50 200
1 5 20 2 2 100 50 200
1 6 23 2 2 133 52 222
;
proc summary data=have;
by id;
output
out=want (drop=_type_ _freq_)
sum(sales bonus commission)=
idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;
proc print data=want noobs;
run;
Result:
id sales bonus commission time toalstock totalcash totalvalue 1 88 9 8 6 133 52 222
As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.
If you want to keep the original horizontal order of variables, you will need another data step:
data want;
set
have (obs=0) /* retrieves the order of variables */
want /* retrieves the data */
;
run;
I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).
Say I have this data:
data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines;
1 1 10 1 1 100 50 200
1 2 10 1 1 100 50 200
1 3 15 2 1 80 50 200
1 4 10 1 1 90 50 200
1 5 20 2 2 100 50 200
1 6 23 2 2 133 52 222
2...
...
3..
;
My desired output would be:
data want;
input id time sales bonus commission toalstock totalcash totalvalue;
1 6 88 9 8 133 52 222
2...
3...
;
My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows).
Proc SQL or native SAS are both welcome.
Rows might be shuffled if necessary.
Thanks,
Krish
data want(drop=_:);
do until (last.id);
set have;
by id;
_sales+sales; _bonus+bonus; _commission+commission;
end;
sales=_sales; bonus=_bonus; commission=_commission;
run;
My preferred way of doing this is a data step:
data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines;
1 1 10 1 1 100 50 200
1 2 10 1 1 100 50 200
1 3 15 2 1 80 50 200
1 4 10 1 1 90 50 200
1 5 20 2 2 100 50 200
1 6 23 2 2 133 52 222
;
data want;
retain id time sales bonus commission toalstock totalcash totalvalue;
set have (
rename=(sales=_sales bonus=_bonus commission=_commission)
);
by id;
if first.id
then do;
sales = _sales;
bonus = _bonus;
commission = _commission;
end;
else do;
sales + _sales;
bonus + _bonus;
commission + _commission;
end;
if last.id;
drop _:;
run;
As you can see, my code does not use a hash object for 3 variables, so you won't need it for 200 either.
You have to play the rename-initialize-sum-drop game for all the variables that need summing.
Note:
The retain statement in my code is not necessary; it is mostly there to keep the horizontal order of columns. The retain for tzhe newly created variables is implicitly declared by using the increment statements.
Thanks for the quick replies.
Please note, in this example the computation is done on 6 columns and the suggested solutions so far work fine in this case.
What if I have to do the same on 200+ columns of which some need to be summed and for others the last entry will be reported?
Do I need to define the columns which used last.id in a separate table or perhaps using hash object?
How many of those +200 columns do you want to sum?
Ok. Well, the logic posted by both me and @Kurt_Bremser holds even for several hundred variables. Obviously, you would have to type out the variables that you want to sum. So unless there is some naming convention that is similar among the sum variables, I don't think you will gain much from using an array or a hash object.
@krish_101 wrote:
about 100 columns to be summed
and 100 other columns where last id is reported.
You are the coder, so it's up to you to tell SAS what to do. If you have some resource that can make the code data-driven (list(s) of variable names in a dataset), then we can show you how to utilize that to automatically create the code.
The following step is a small modification of what @Kurt_Bremser posted: it uses two arrays for the variables the need to be summed. Wrapping the step in a macro to further reduce listing the variables should be the next step.
data want;
retain id time sales bonus commission toalstock totalcash totalvalue;
set have (
rename=(sales=_sales bonus=_bonus commission=_commission)
);
by id;
array vars _sales _bonus _commission;
array sums sales bonus commission;
if first.id then do;
do _i = 1 to dim(vars);
sums[_i] = vars[_i];
end;
end;
else do;
do _i = 1 to dim(vars);
sums[_i] + vars[_i];
end;
end;
if last.id;
drop _:;
run;
I have found a way to utilize proc summary in taking the last value of a group:
data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines;
1 1 10 1 1 100 50 200
1 2 10 1 1 100 50 200
1 3 15 2 1 80 50 200
1 4 10 1 1 90 50 200
1 5 20 2 2 100 50 200
1 6 23 2 2 133 52 222
;
proc summary data=have;
by id;
output
out=want (drop=_type_ _freq_)
sum(sales bonus commission)=
idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;
proc print data=want noobs;
run;
Result:
id sales bonus commission time toalstock totalcash totalvalue 1 88 9 8 6 133 52 222
As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.
If you want to keep the original horizontal order of variables, you will need another data step:
data want;
set
have (obs=0) /* retrieves the order of variables */
want /* retrieves the data */
;
run;
looks great
I created a list(s) of variable names in a dataset called valist which groups the columns to be summed under sumvars
data varlist;
input sumvars lastvars;
name1 name2
name5 name3
... ...
name70 name50
name100 name102
;
proc sql noprint;
select sumvars into : sumvarlist separated by " "
from varlist
;
quit;
proc sql noprint;
select lastvars into : lastvarlist separated by " "
from varlist
;
quit;
modify this part of the code
sum(&sumvarlist
)= idgroup (last out (&lastvarlist
)=)
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.