I want to create a new variable with the sum of a variable by the category of a second variable in a single step.
The data i have. I want to create sum of var by id.
id | var |
1 | 5 |
1 | 6 |
2 | 2 |
2 | 3 |
2 | 9 |
Data I want
id | var | sum |
1 | 5 | 11 |
1 | 6 | 11 |
2 | 2 | 14 |
2 | 3 | 14 |
2 | 9 | 14 |
I know it can be done in 2 steps: in the first step we can create the sum variable and save in a new dataset and in second step we need to merge it with the original dataset.
I want to avoid this 2 steps process as my dataset is huge (0.5 billion observations).
Is your have dataset already sorted by id as your sample indicates?
Any solution will require to read the data twice - but it should be possible to only write it once. The main performance impact will be the sorting so really important to know if the source is already sorted by id.
Also of some relevance: How many obs will belong to a single ID - median and max?
Sorting would not be a big deal but would be preferable avoid. Can we not use the notsorted option?
there will be few 100s observations per unique ID.
@BayzidurRahman wrote:
Sorting would not be a big deal but would be preferable avoid. Can we not use the notsorted option?
there will be few 100s observations per unique ID.
You need a sum by ID so sorting is required. Sorting will cause a major performance impact.
If your have table needs to maintain all the source rows then a SQL with group by is likely not the right approach as it will "collapse" rows - and it will also do an implicit sort.
To avoid sorting one could first create the aggregated data in a hash table and then lookup the aggregates from there. If you've got around 100 rows per group then the data should fit into a hash unless your id variable(s) are very long.
Update: Below how this could work using a hash table. If you have more than one key variable just add them to the hash - defineKey('id1','id2'). Things will work as long as the summary hash table can fit into memory (5M rows according to the info you shared should fit). IF the key variables are very long then one could also create a hash key over them and use this as key. If using md5 and with 5 million rows the hash table should consume less than 250MB (32 bytes for the hash key plus 8 bytes for the sum_var * 5M groups plus the "overhead").
data have;
input id var;
datalines;
1 5
1 6
2 2
2 3
2 9
;
data want(drop=_:);
if _n_=1 then
do;
if 0 then set have;
length sum_var 8;
dcl hash h1();
h1.defineKey('id');
h1.defineData('sum_var');
h1.defineDone();
do _i=1 to _nobs;
set have point=_i nobs=_nobs;
if h1.find() ne 0 then sum_var=var;
else sum_var=sum(sum_var,var);
_rc=h1.replace();
end;
end;
set have;
_rc=h1.find();
run;
proc print data=want;
run;
Below how an approach using a hash key could look like. I'd only recommend taking this approach if the combined length of the key variables is significantly above 32 characters because creating the hash key is resource intensive.
data want(drop=_:);
if _n_=1 then
do;
if 0 then set have;
length sum_var 8 _hash_key $32;
dcl hash h1();
h1.defineKey('_hash_key');
h1.defineData('sum_var');
h1.defineDone();
do _i=1 to _nobs;
set have point=_i nobs=_nobs;
_hash_key=hashing('md5',catx('|',id1,id2));
if h1.find() ne 0 then sum_var=var;
else sum_var=sum(sum_var,var);
_rc=h1.replace();
end;
end;
set have;
_hash_key=hashing('md5',catx('|',id1,id2));
_rc=h1.find();
run;
Try this
data have;
input id var;
datalines;
1 5
1 6
2 2
2 3
2 9
;
proc sql;
create table want as
select id, var, sum(var) as sum
from have
group by id
;
run;
Hi,
You could try the following:
data have;
input id var;
datalines;
1 5
1 6
2 2
2 3
2 9
;
data want;
do until(last.id);
set have;
by id;
sum = sum(0, sum, var);
end;
do until(last.id);
set have;
by id;
output;
end;
run;
Thanks & kind regards,
Amir.
Thanks Amir.
how to modify the code if I want to do the sum by two id variables, say id1 and id2?
Hi,
I would suggest replacing last.id with last.id2 and by id with by id2, as follows:
data want;
do until(last.id2);
set have;
by id2;
sum = sum(0, sum, var);
end;
do until(last.id2);
set have;
by id2;
output;
end;
run;
However, without knowing what the relationship is between id1 & id2 and not knowing what result you want accordingly it can be difficult to say. I would advise you supply the data you have and the data you want in the form of data steps as some of us have done to create your input data, then we can try to advise further, as required.
Thanks & kind regards,
Amir.
The id1 and id2 are independent. Their combination creates unique groups.
You say this is a moderately large data set, 0.5 billion observations. What will you do with the data after you have created an even larger data set by adding a variable to the data set (you now have two data sets of 0.5 billion observations)?
There must be some reason to add such a variable. Perhaps if you discuss it now we can save steps or make the code more efficient.
I want to run regression model on the newly generated variable. I will create multiple variables with different resolutions (eg., weekly total by participants, monthly total by participants etc ) and run separate models on them. I have daily data per participant.
@BayzidR wrote:
I want to run regression model on the newly generated variable. I will create multiple variables with different resolutions (eg., weekly total by participants, monthly total by participants etc ) and run separate models on them. I have daily data per participant.
So, if I am understanding you properly, you don't need to find these sums on the 0.5 billion records once, you will need to do this many times, once for each regression you plan to run.
Do you need an output data set with the original 0.5 billion records, or just one record per ID?
May I suggest an alternative approach? Regression on 0.5 billion records seems like overkill, you could randomly sample the data such that you now have (let's say) 50,000 IDs (and of course much fewer than 0.5 billion records), reducing the need for this computationally intensive finding of sums, without really losing much quality in the regression.
But even here I am skeptical. Suppose you do all this different aggregating of data to get all these different sums, and then you fit many different regressions. Then what? What will you do after you have many different regressions, based upon many different time intervals? How will these regressions be used?
I don't need to run regression on 0.5b observations but only on the aggregate datasets. the 0.5b observations will be collapsed into 500 clusters and 60 to 120 time points (monthly or fortnightly).
@BayzidR wrote:
I don't need to run regression on 0.5b observations but only on the aggregate datasets. the 0.5b observations will be collapsed into 500 clusters and 60 to 120 time points (monthly or fortnightly).
So, if I understand you properly, you don't need the data set you posted originally, you need just one record for each ID containing the sum. Is that correct?
I am thinking that collapsing 0.5 billion records into 500 clusters is also an extremely computationally intensive task, something that makes me shudder and it sounds to me that getting eaten by bears would be more pleasant (and faster). Anyway, again I think randomly sampling the data first would be a much better approach.
If you have time data I would recommend using PROC TIMESERIES to accumulate your data. It can take into account time periods better than proc summary or proc means.
https://documentation.sas.com/doc/en/etscdc/14.2/etsug/etsug_timeseries_examples01.htm
@BayzidR wrote:
I don't need to run regression on 0.5b observations but only on the aggregate datasets. the 0.5b observations will be collapsed into 500 clusters and 60 to 120 time points (monthly or fortnightly).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.