Solved: How do I split data into equal-sized groups by ranking variable?

lezgin · Posted 04-11-2018 10:07 PM

Hello,

I am using below code to split my attached data into 20 equal groups each month based on excess_vwretd. I need equal number of observations in each group each month. However, below code only groups my data based on the distribution of my ranking variable. I want to rank the data based on excess_vwretd and create groups of equal size each month. Any help is appreciated.

proc rank data=a5 out=a6 groups=20;
by date2; var excess_vwretd; ranks betarank; run;

Astounding · Posted 04-12-2018 07:19 AM

Just checking that I understand the problem first. Please confirm:

For each DATE2, the highest values for EXCESS_VWRETD belong in group 1. The next highest in group 2.
If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

proc sort data=have;

by date2 descending excess_vwretd;

run;

Then process each DATE2 separately:

data want;

date_counter = 0;

do until (last.date2);

set have;

by date2;

date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

set have;

by date2;

group_counter + 1;

group = ceil(group_counter / date_counter * 20);

**end;

output;

end;

drop group_counter date_counter;

run;

It's untested code, and might need some correction. But first things first. Is it attempting to solve the right problem?

********** EDITED to comment out an extraneous END statement.

View solution in original post

ChrisNZ · Posted 04-12-2018 01:27 AM

Something like this (three groups here) ?

data WANT;    
  set SASHELP.CLASS nobs=NOBS;
  GROUP=ceil(_N_*3/NOBS);
run;

GRP NAME
1 Alfred
1 Alice
1 Barbara
1 Carol
1 Henry
1 James
2 Jane
2 Janet
2 Jeffrey
2 John
2 Joyce
2 Judy
3 Louise
3 Mary
3 Philip
3 Robert
3 Ronald
3 Thomas
3 William

High-Performance SAS Coding - Third Edition

lezgin · Posted 04-12-2018 01:33 AM

no, I need to divide monthly observations into 20 groups with equal number of observations. I also have more than 700.000 observations and your code seems to take a lot of time. I appreciate your help.

ChrisNZ · Posted 04-12-2018 01:57 AM

I can't think of a faster way than my code. It should run much faster than proc rank. Drop unneeded variables

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-12-2018 02:00 AM

This takes 0.15s on my laptop for 20 groups with one million records.

Don't just say no without looking at the solution please.

data HAVE;
 do I=1 to 1e6;
   output; 
  end; 
run;
data WANT;    
  set HAVE nobs=NOBS;
  GROUP=ceil(_N_*20/NOBS);
run;

High-Performance SAS Coding - Third Edition

lezgin · Posted 04-12-2018 02:04 AM

I tried your code. It took time to process and I had to terminate. It should do it for each date. Your code doesn't take this into account. I attached a small sample, it might give a better idea.

Shmuel · Posted 04-12-2018 04:38 AM

Assuming data is sorted by date, try next code:

data want;
   set have;
    by date;
         if _N_ = 1 then group=1;
         else group+1;
         if group > 20 then group=1;
run;

/* check distribution */
proc freq taya=want;
    table date * group;
run;

ChrisNZ · Posted 04-12-2018 07:17 AM

Like this (2.8s for 10 million rows)?

data HAVE;
 do DATE2=1 to 10;
  do VAR=1 to 1e6;
   output; 
  end; 
 end; 
run;
data WANT; 
  do NOBS=1 to 1e9 until (last.DATE2);
    set HAVE;
    by DATE2;
  end;
  do I=1 to NOBS;
    set HAVE;
    GROUP=ceil(I*20/NOBS);
    output;
  end;
run;

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-12-2018 08:21 AM

@lezgin

If you just want to split up the date into groups with equal numbers of observations per date and there is no need to also have some sort of random selection, then I don't see how anything could be faster than a simple data step with a function as @ChrisNZ suggests.

Here another coding variant for this approach

data have(drop=_:);
  format date date9.;

  do date='01jan2018'd, '02jan2018'd;
    do _i=1 to 150;
      output;
    end;
  end;

  do date='03jan2018'd, '04jan2018'd;
    do _i=1 to 10;
      output;
    end;
  end;

  stop;
run;

%let n_groups=5;

data want;
  set have;
  by date;
  group=mod(_n_-1,&n_groups)+1;
run;

Astounding · Posted 04-12-2018 07:19 AM

Just checking that I understand the problem first. Please confirm:

For each DATE2, the highest values for EXCESS_VWRETD belong in group 1. The next highest in group 2.
If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

proc sort data=have;

by date2 descending excess_vwretd;

run;

Then process each DATE2 separately:

data want;

date_counter = 0;

do until (last.date2);

set have;

by date2;

date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

set have;

by date2;

group_counter + 1;

group = ceil(group_counter / date_counter * 20);

**end;

output;

end;

drop group_counter date_counter;

run;

It's untested code, and might need some correction. But first things first. Is it attempting to solve the right problem?

********** EDITED to comment out an extraneous END statement.

lezgin · Posted 04-12-2018 11:54 AM

hello astounding, that is exactly what I need. I get this error when I run it

5775 /*then process each DATE2 separately:*/
5776 data yedek;
5777 date_counter = 0;
5778 do until (last.date2);
5779 set a5;
5780 by date2;
5781 date_counter + 1;
5782 end;
5783 group_counter = 0;
5784 do until (last.date2);
5785 set a5;
5786 by date2;
5787 group_counter + 1;
5788 group = ceil(group_counter / date_counter * 20);
5789 end;
5790 output;
5791 end;
---
161
ERROR 161-185: No matching DO/SELECT statement.

5792
5793 drop group_counter record_counter;
5794
5795 run;

WARNING: The variable record_counter in the DROP, KEEP, or RENAME list has never been referenced.

chris, thanks for taking your time to do this, but I don't think you read my previous posts. you are ignoring the ranking variable. I want to rank each month into 20 equal sized groups.

lezgin · Posted 04-12-2018 12:23 PM

astounding, thank you but it doesn't produce groups with equal no of obs.

Astounding · Posted 04-12-2018 02:21 PM

It worked for me. Of course, your groups can't be exactly of equal size unless the number of observations is a multiple of 20. So some group sizes can differ by 1. But other than that it should be perfect.

lezgin · Posted 04-12-2018 02:28 PM

yes I know ideally group sizes should differ by one but when I check the difference is 4 or 5 in some cases for groups in same month.

Astounding · Posted 04-12-2018 02:48 PM

I can look at the log if you post it. (Wouldn't hurt to post a small piece of the output ... an example of what goes wrong.) Otherwise there's not much I can do on this end.

How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

Re: How do I split data into equal-sized groups by ranking variable?

SAS Innovate 2026 Registration is Open

SAS Training: Just a Click Away