Hello,
I am using below code to split my attached data into 20 equal groups each month based on excess_vwretd. I need equal number of observations in each group each month. However, below code only groups my data based on the distribution of my ranking variable. I want to rank the data based on excess_vwretd and create groups of equal size each month. Any help is appreciated.
proc rank data=a5 out=a6 groups=20;
by date2; var excess_vwretd; ranks betarank; run;
Just checking that I understand the problem first. Please confirm:
To achieve that, sort the data if necessary:
proc sort data=have;
by date2 descending excess_vwretd;
run;
Then process each DATE2 separately:
data want;
date_counter = 0;
do until (last.date2);
set have;
by date2;
date_counter + 1;
end;
group_counter = 0;
do until (last.date2);
set have;
by date2;
group_counter + 1;
group = ceil(group_counter / date_counter * 20);
**end;
output;
end;
drop group_counter date_counter;
run;
It's untested code, and might need some correction. But first things first. Is it attempting to solve the right problem?
********** EDITED to comment out an extraneous END statement.
Something like this (three groups here) ?
data WANT;
set SASHELP.CLASS nobs=NOBS;
GROUP=ceil(_N_*3/NOBS);
run;
GRP NAME
1 Alfred
1 Alice
1 Barbara
1 Carol
1 Henry
1 James
2 Jane
2 Janet
2 Jeffrey
2 John
2 Joyce
2 Judy
3 Louise
3 Mary
3 Philip
3 Robert
3 Ronald
3 Thomas
3 William
no, I need to divide monthly observations into 20 groups with equal number of observations. I also have more than 700.000 observations and your code seems to take a lot of time. I appreciate your help.
I can't think of a faster way than my code. It should run much faster than proc rank. Drop unneeded variables
This takes 0.15s on my laptop for 20 groups with one million records.
Don't just say no without looking at the solution please.
data HAVE;
do I=1 to 1e6;
output;
end;
run;
data WANT;
set HAVE nobs=NOBS;
GROUP=ceil(_N_*20/NOBS);
run;
I tried your code. It took time to process and I had to terminate. It should do it for each date. Your code doesn't take this into account. I attached a small sample, it might give a better idea.
Assuming data is sorted by date, try next code:
data want;
set have;
by date;
if _N_ = 1 then group=1;
else group+1;
if group > 20 then group=1;
run;
/* check distribution */
proc freq taya=want;
table date * group;
run;
Like this (2.8s for 10 million rows)?
data HAVE;
do DATE2=1 to 10;
do VAR=1 to 1e6;
output;
end;
end;
run;
data WANT;
do NOBS=1 to 1e9 until (last.DATE2);
set HAVE;
by DATE2;
end;
do I=1 to NOBS;
set HAVE;
GROUP=ceil(I*20/NOBS);
output;
end;
run;
If you just want to split up the date into groups with equal numbers of observations per date and there is no need to also have some sort of random selection, then I don't see how anything could be faster than a simple data step with a function as @ChrisNZ suggests.
Here another coding variant for this approach
data have(drop=_:);
format date date9.;
do date='01jan2018'd, '02jan2018'd;
do _i=1 to 150;
output;
end;
end;
do date='03jan2018'd, '04jan2018'd;
do _i=1 to 10;
output;
end;
end;
stop;
run;
%let n_groups=5;
data want;
set have;
by date;
group=mod(_n_-1,&n_groups)+1;
run;
Just checking that I understand the problem first. Please confirm:
To achieve that, sort the data if necessary:
proc sort data=have;
by date2 descending excess_vwretd;
run;
Then process each DATE2 separately:
data want;
date_counter = 0;
do until (last.date2);
set have;
by date2;
date_counter + 1;
end;
group_counter = 0;
do until (last.date2);
set have;
by date2;
group_counter + 1;
group = ceil(group_counter / date_counter * 20);
**end;
output;
end;
drop group_counter date_counter;
run;
It's untested code, and might need some correction. But first things first. Is it attempting to solve the right problem?
********** EDITED to comment out an extraneous END statement.
hello astounding, that is exactly what I need. I get this error when I run it
5775 /*then process each DATE2 separately:*/
5776 data yedek;
5777 date_counter = 0;
5778 do until (last.date2);
5779 set a5;
5780 by date2;
5781 date_counter + 1;
5782 end;
5783 group_counter = 0;
5784 do until (last.date2);
5785 set a5;
5786 by date2;
5787 group_counter + 1;
5788 group = ceil(group_counter / date_counter * 20);
5789 end;
5790 output;
5791 end;
---
161
ERROR 161-185: No matching DO/SELECT statement.
5792
5793 drop group_counter record_counter;
5794
5795 run;
WARNING: The variable record_counter in the DROP, KEEP, or RENAME list has never been referenced.
chris, thanks for taking your time to do this, but I don't think you read my previous posts. you are ignoring the ranking variable. I want to rank each month into 20 equal sized groups.
astounding, thank you but it doesn't produce groups with equal no of obs.
It worked for me. Of course, your groups can't be exactly of equal size unless the number of observations is a multiple of 20. So some group sizes can differ by 1. But other than that it should be perfect.
I can look at the log if you post it. (Wouldn't hurt to post a small piece of the output ... an example of what goes wrong.) Otherwise there's not much I can do on this end.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.