How do I split data into equal-sized groups by ranking variable?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 15
Accepted Solution

How do I split data into equal-sized groups by ranking variable?

[ Edited ]

Hello, 

 

I am using below code to split my attached data into 20 equal groups each month based on excess_vwretd. I need equal number of observations in each group each month. However, below code only groups my data based on the distribution of my ranking variable. I want to rank the data based on excess_vwretd and create groups of equal size each month. Any help is appreciated.

 

proc rank data=a5 out=a6 groups=20;
by date2; var excess_vwretd; ranks betarank; run;


Accepted Solutions
Solution
2 weeks ago
Super User
Posts: 6,359

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

Just checking that I understand the problem first.  Please confirm:

 

  • For each DATE2, the highest values for EXCESS_VWRETD belong in group 1.  The next highest in group 2.
  • If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

 

proc sort data=have;

by date2 descending excess_vwretd;

run;

 

Then process each DATE2 separately:

 

data want;

date_counter = 0;

do until (last.date2);

   set have;

   by date2;

   date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

   set have;

   by date2;

   group_counter + 1;

   group = ceil(group_counter / date_counter * 20);

**end;

   output;

end;

drop group_counter date_counter;

run;

 

It's untested code, and might need some correction.  But first things first.  Is it attempting to solve the right problem?

 

********** EDITED to comment out an extraneous END statement.

 

View solution in original post


All Replies
PROC Star
Posts: 2,129

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

Something like this (three groups here) ?

data WANT;    
  set SASHELP.CLASS nobs=NOBS;
  GROUP=ceil(_N_*3/NOBS);
run;

GRP NAME
1 Alfred
1 Alice
1 Barbara
1 Carol
1 Henry
1 James
2 Jane
2 Janet
2 Jeffrey
2 John
2 Joyce
2 Judy
3 Louise
3 Mary
3 Philip
3 Robert
3 Ronald
3 Thomas
3 William

Occasional Contributor
Posts: 15

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

no, I need to divide monthly observations into 20 groups with equal number of observations. I also have more than 700.000 observations and your code seems to take a lot of time. I appreciate your help.

PROC Star
Posts: 2,129

Re: How do I split data into equal-sized groups by ranking variable?

I can't think of a faster way than my code. It should run much faster than proc rank. Drop unneeded variables

PROC Star
Posts: 2,129

Re: How do I split data into equal-sized groups by ranking variable?

This takes 0.15s on my laptop for 20 groups with one million records.

Don't just say no without looking at the solution please.

data HAVE;
 do I=1 to 1e6;
   output; 
  end; 
run;
data WANT;    
  set HAVE nobs=NOBS;
  GROUP=ceil(_N_*20/NOBS);
run;

 

Occasional Contributor
Posts: 15

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

I tried your code. It took time to process and I had to terminate. It should do it for each date. Your code doesn't take this into account. I attached a small sample, it might give a better idea.

Trusted Advisor
Posts: 1,779

Re: How do I split data into equal-sized groups by ranking variable?

Assuming data is sorted by date, try next code:

data want;
   set have;
    by date;
         if _N_ = 1 then group=1;
         else group+1;
         if group > 20 then group=1;
run;

/* check distribution */
proc freq taya=want;
    table date * group;
run;
PROC Star
Posts: 2,129

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

Like this (2.8s for 10 million rows)?

data HAVE;
 do DATE2=1 to 10;
  do VAR=1 to 1e6;
   output; 
  end; 
 end; 
run;
data WANT; 
  do NOBS=1 to 1e9 until (last.DATE2);
    set HAVE;
    by DATE2;
  end;
  do I=1 to NOBS;
    set HAVE;
    GROUP=ceil(I*20/NOBS);
    output;
  end;
run;

 

Respected Advisor
Posts: 4,466

Re: How do I split data into equal-sized groups by ranking variable?

@lezgin

If you just want to split up the date into groups with equal numbers of observations per date and there is no need to also have some sort of random selection, then I don't see how anything could be faster than a simple data step with a function as @ChrisNZ suggests.

 

Here another coding variant for this approach

data have(drop=_:);
  format date date9.;

  do date='01jan2018'd, '02jan2018'd;
    do _i=1 to 150;
      output;
    end;
  end;

  do date='03jan2018'd, '04jan2018'd;
    do _i=1 to 10;
      output;
    end;
  end;

  stop;
run;

%let n_groups=5;

data want;
  set have;
  by date;
  group=mod(_n_-1,&n_groups)+1;
run;
Solution
2 weeks ago
Super User
Posts: 6,359

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]

Just checking that I understand the problem first.  Please confirm:

 

  • For each DATE2, the highest values for EXCESS_VWRETD belong in group 1.  The next highest in group 2.
  • If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

 

proc sort data=have;

by date2 descending excess_vwretd;

run;

 

Then process each DATE2 separately:

 

data want;

date_counter = 0;

do until (last.date2);

   set have;

   by date2;

   date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

   set have;

   by date2;

   group_counter + 1;

   group = ceil(group_counter / date_counter * 20);

**end;

   output;

end;

drop group_counter date_counter;

run;

 

It's untested code, and might need some correction.  But first things first.  Is it attempting to solve the right problem?

 

********** EDITED to comment out an extraneous END statement.

 

Occasional Contributor
Posts: 15

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]
Posted in reply to Astounding

hello astounding, that is exactly what I need. I get this error when I run it


5775 /*then process each DATE2 separately:*/
5776 data yedek;
5777 date_counter = 0;
5778 do until (last.date2);
5779 set a5;
5780 by date2;
5781 date_counter + 1;
5782 end;
5783 group_counter = 0;
5784 do until (last.date2);
5785 set a5;
5786 by date2;
5787 group_counter + 1;
5788 group = ceil(group_counter / date_counter * 20);
5789 end;
5790 output;
5791 end;
---
161
ERROR 161-185: No matching DO/SELECT statement.

5792
5793 drop group_counter record_counter;
5794
5795 run;

WARNING: The variable record_counter in the DROP, KEEP, or RENAME list has never been referenced.

 

chris, thanks for taking your time to do this, but I don't think you read my previous posts. you are ignoring the ranking variable. I want to rank each month into 20 equal sized groups.

Occasional Contributor
Posts: 15

Re: How do I split data into equal-sized groups by ranking variable?

[ Edited ]
Posted in reply to Astounding

astounding, thank you but it doesn't produce groups with equal no of obs.

Super User
Posts: 6,359

Re: How do I split data into equal-sized groups by ranking variable?

It worked for me.  Of course, your groups can't be exactly of equal size unless the number of observations is a multiple of 20.  So some group sizes can differ by 1.  But other than that it should be perfect.

Occasional Contributor
Posts: 15

Re: How do I split data into equal-sized groups by ranking variable?

Posted in reply to Astounding
yes I know ideally group sizes should differ by one but when I check the difference is 4 or 5 in some cases for groups in same month.
Super User
Posts: 6,359

Re: How do I split data into equal-sized groups by ranking variable?

I can look at the log if you post it.  (Wouldn't hurt to post a small piece of the output ... an example of what goes wrong.)  Otherwise there's not much I can do on this end.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 17 replies
  • 212 views
  • 7 likes
  • 6 in conversation