Suppose I have a dataset:
--Table 1--
ID Amount
A 0.1
A 0.2
A 0.3
B 0.4
B 0.5
I used retain to create a new variable of total amount of each ID
:
data sample; set sample; by ID amout; if first.ID then Total_amount = 0; Total_amount = Total_amount +amount; run;
And this will be the output dataset:
--Table 2--
ID Amount Total_amount
A 0.1 0.1
A 0.2 0.3
A 0.3 0.6
B 0.4 0.4
B 0.5 0.9
But now I want to assign the true total_amout to each obs. The output I want is:
--Table 2--
ID Amount Total_amount
A 0.1 0.6
A 0.2 0.6
A 0.3 0.6
B 0.4 0.9
B 0.5 0.9
Is there anyway to achieve it? Now I try to sort the data in descending order and than retain the total_amount.
data sample; set sample; by ID descending amout; if first.ID then Total_amount_1 = Total_amount; run;
--Table 2--
ID Amount Total_amount Total_amount_1
A 0.3 0.6 0.6
A 0.2 0.3 0.6
A 0.1 0.1 0.6
B 0.5 0.9 0.9
B 0.4 0.4 0.9
Although I can get what I want by this way, I think this method is kind of stupid. I hope there are some better methods.
If you don't really want/ need the cumulative total here is one way.
data have; input ID $ Amount ; datalines; A 0.1 A 0.2 A 0.3 B 0.4 B 0.5 ; proc sql; create table want as select a.*, b.total from have as a left join (select id, sum(amount) as total from have group by id) as b on a.id = b.id ; quit;
basic approach get the sum then merge back to the original data.
Lots of posts on the forum about this topic.
BTW, be careful of using the
Data setname;
set setname;
construct. This replaces your existing data set and a logic error could change values that you don't want and will require going back to recover the original values.
If you don't really want/ need the cumulative total here is one way.
data have; input ID $ Amount ; datalines; A 0.1 A 0.2 A 0.3 B 0.4 B 0.5 ; proc sql; create table want as select a.*, b.total from have as a left join (select id, sum(amount) as total from have group by id) as b on a.id = b.id ; quit;
basic approach get the sum then merge back to the original data.
Lots of posts on the forum about this topic.
BTW, be careful of using the
Data setname;
set setname;
construct. This replaces your existing data set and a logic error could change values that you don't want and will require going back to recover the original values.
Thank you so much!
This "automatic remerging" can easily be achieved in SQL:
proc sql;
create table want as
select
id,
amount,
sum(amount) as total_amount
from have
group by id;
quit;
Hi @Ada77 Try this
data have;
input ID $ Amount ;
cards;
A 0.1
A 0.2
A 0.3
B 0.4
B 0.5
;
data want;
if _n_=1 then do;
dcl hash H (ordered: "d") ;
h.definekey ("id","Total_amount") ;
h.definedata ("amount","Total_amount") ;
h.definedone () ;
dcl hiter hi('h');
end;
do until(last.id);
set have;
by id;
Total_amount =sum(Total_amount ,amount);
rc=h.add();
end;
Total_amount1=Total_amount;
do while(hi.next()=0);
output;
end;
h.clear();
drop rc;
run;
@hashman Guru, I plagiarized your solution that you taught me yesterday. I admit it before you notice it. Rather, I consider that as I am a good student who pays attention in the affirmative. I am not apologizing unless you demand one. 🙂
@novinosrin : "Imitation is the sincerest form of flattery" ;).
The simplest is to use the automatic re-merge feature of PROC SQL.
create table want as select *,sum(amount) as total from sample group by id;
But if you really need to use a data step for some other reason then consider using a double-DOW loop. The first to calculate the total for the group and the second to output the detail rows.
data want;
do until (last.id);
set sample;
by id;
total=sum(total,amount);
end;
do until (last.id);
set sample;
by id;
output;
end;
run;
@Tom: Replying to your post, as you've included both the double DoW and SQL, but also to @Kurt_Bremser, @ballardw, and @novinosrin (and of course to the OP and/or anyone else watching the thread).
1. If the input data set is not sorted, it's more efficient to use the hash object to get the group totals and attach them back to the original records in the same step - since this approach avoids sorting altogether (keeping the records in the original order):
data want ;
if _n_ = 1 then do ;
dcl hash h () ;
h.definekey ("id") ;
h.definedata ("sum") ;
h.definedone () ;
do until (z) ;
set have (keep = id amt) end = z ;
if h.find() ne 0 then sum = amt ;
else sum + amt ;
h.replace() ;
end ;
end ;
set have ;
h.find() ;
run ;
2. If the input data set is sorted by ID, the double DoW is most efficient - provided that certain performance-promoting considerations (in bold italics below) are observed;
data want (drop = amt) ;
do _n_ = 1 by 1 until (last.id) ;
set have (keep = id amt) ;
by id ;
sum = sum (sum, amt) ;
end ;
do _n_ = 1 to _n_ ;
set have ;
/* note: no BY statement here */
output ;
end ;
run ;
The KEEP option on the first SET is obvious - there's no need to read any variables but ID and AMT on the first pass. The counting instead of the second BY is less obvious since most folks tend to think that BY doesn't exact any performance penalty. Yet it does since it makes the DATA step do extra work comparing the values of the BY variables in the PDV with those in the buffer to set FIRST.x and LAST.x as many times as there are BY groups in the input times the number of BY variables.
Because the program above explicitly relies on the order by ID, it works the same way irrespective of whether the input data set contains the SORTEDBY=ID data set flag set by proc SORT or is merely ordered intrinsically and doesn't have the flag set. But not so with SQL! This is because if the optimizer doesn't see SORTEDBY=ID in the header of the input data set, it will choose the path requiring sorting (indicated by the SQXSRT access method if the system option MSGLEVEL=I and SQL statement option _METHOD are in effect). Which is why, as a precaution, it's always a good idea to tell SQL that input is sorted (if it is indeed ordered, of course): If the flag is already set by SORT, it won't hurt; but if it is not, the unnecessary sorting will be avoided:
proc sql _method stimer ;
create table want as select *, sum (amt) as sum from have (sortedby=id) group id ;
quit ;
In this case, the underlying algorithm is essentially the same as with the double DoW above, though for whatever reason the latter still runs ~40% faster.
Kind regards
Paul D.
Here's a slightly different version, how I would do it:
data sample;
input ID $1 Amount;
datalines;
A 0.1
A 0.2
A 0.3
B 0.4
B 0.5
;
run;
proc sql noprint;
create table sample as
select unique *, sum(amount) as total_amount
from sample
group by id;
quit; run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.