BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lezgin
Obsidian | Level 7

Hello, 

 

I am using below code to split my attached data into 20 equal groups each month based on excess_vwretd. I need equal number of observations in each group each month. However, below code only groups my data based on the distribution of my ranking variable. I want to rank the data based on excess_vwretd and create groups of equal size each month. Any help is appreciated.

 

proc rank data=a5 out=a6 groups=20;
by date2; var excess_vwretd; ranks betarank; run;

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Just checking that I understand the problem first.  Please confirm:

 

  • For each DATE2, the highest values for EXCESS_VWRETD belong in group 1.  The next highest in group 2.
  • If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

 

proc sort data=have;

by date2 descending excess_vwretd;

run;

 

Then process each DATE2 separately:

 

data want;

date_counter = 0;

do until (last.date2);

   set have;

   by date2;

   date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

   set have;

   by date2;

   group_counter + 1;

   group = ceil(group_counter / date_counter * 20);

**end;

   output;

end;

drop group_counter date_counter;

run;

 

It's untested code, and might need some correction.  But first things first.  Is it attempting to solve the right problem?

 

********** EDITED to comment out an extraneous END statement.

 

View solution in original post

16 REPLIES 16
ChrisNZ
Tourmaline | Level 20

Something like this (three groups here) ?

data WANT;    
  set SASHELP.CLASS nobs=NOBS;
  GROUP=ceil(_N_*3/NOBS);
run;

GRP NAME
1 Alfred
1 Alice
1 Barbara
1 Carol
1 Henry
1 James
2 Jane
2 Janet
2 Jeffrey
2 John
2 Joyce
2 Judy
3 Louise
3 Mary
3 Philip
3 Robert
3 Ronald
3 Thomas
3 William

lezgin
Obsidian | Level 7

no, I need to divide monthly observations into 20 groups with equal number of observations. I also have more than 700.000 observations and your code seems to take a lot of time. I appreciate your help.

ChrisNZ
Tourmaline | Level 20

I can't think of a faster way than my code. It should run much faster than proc rank. Drop unneeded variables

ChrisNZ
Tourmaline | Level 20

This takes 0.15s on my laptop for 20 groups with one million records.

Don't just say no without looking at the solution please.

data HAVE;
 do I=1 to 1e6;
   output; 
  end; 
run;
data WANT;    
  set HAVE nobs=NOBS;
  GROUP=ceil(_N_*20/NOBS);
run;

 

lezgin
Obsidian | Level 7

I tried your code. It took time to process and I had to terminate. It should do it for each date. Your code doesn't take this into account. I attached a small sample, it might give a better idea.

Shmuel
Garnet | Level 18

Assuming data is sorted by date, try next code:

data want;
   set have;
    by date;
         if _N_ = 1 then group=1;
         else group+1;
         if group > 20 then group=1;
run;

/* check distribution */
proc freq taya=want;
    table date * group;
run;
ChrisNZ
Tourmaline | Level 20

Like this (2.8s for 10 million rows)?

data HAVE;
 do DATE2=1 to 10;
  do VAR=1 to 1e6;
   output; 
  end; 
 end; 
run;
data WANT; 
  do NOBS=1 to 1e9 until (last.DATE2);
    set HAVE;
    by DATE2;
  end;
  do I=1 to NOBS;
    set HAVE;
    GROUP=ceil(I*20/NOBS);
    output;
  end;
run;

 

Patrick
Opal | Level 21

@lezgin

If you just want to split up the date into groups with equal numbers of observations per date and there is no need to also have some sort of random selection, then I don't see how anything could be faster than a simple data step with a function as @ChrisNZ suggests.

 

Here another coding variant for this approach

data have(drop=_:);
  format date date9.;

  do date='01jan2018'd, '02jan2018'd;
    do _i=1 to 150;
      output;
    end;
  end;

  do date='03jan2018'd, '04jan2018'd;
    do _i=1 to 10;
      output;
    end;
  end;

  stop;
run;

%let n_groups=5;

data want;
  set have;
  by date;
  group=mod(_n_-1,&n_groups)+1;
run;
Astounding
PROC Star

Just checking that I understand the problem first.  Please confirm:

 

  • For each DATE2, the highest values for EXCESS_VWRETD belong in group 1.  The next highest in group 2.
  • If two observations have the same EXCESS_VWRETD, they may need to be placed in different groups, in order to keep the group sizes equal.

To achieve that, sort the data if necessary:

 

proc sort data=have;

by date2 descending excess_vwretd;

run;

 

Then process each DATE2 separately:

 

data want;

date_counter = 0;

do until (last.date2);

   set have;

   by date2;

   date_counter + 1;

end;

group_counter = 0;

do until (last.date2);

   set have;

   by date2;

   group_counter + 1;

   group = ceil(group_counter / date_counter * 20);

**end;

   output;

end;

drop group_counter date_counter;

run;

 

It's untested code, and might need some correction.  But first things first.  Is it attempting to solve the right problem?

 

********** EDITED to comment out an extraneous END statement.

 

lezgin
Obsidian | Level 7

hello astounding, that is exactly what I need. I get this error when I run it


5775 /*then process each DATE2 separately:*/
5776 data yedek;
5777 date_counter = 0;
5778 do until (last.date2);
5779 set a5;
5780 by date2;
5781 date_counter + 1;
5782 end;
5783 group_counter = 0;
5784 do until (last.date2);
5785 set a5;
5786 by date2;
5787 group_counter + 1;
5788 group = ceil(group_counter / date_counter * 20);
5789 end;
5790 output;
5791 end;
---
161
ERROR 161-185: No matching DO/SELECT statement.

5792
5793 drop group_counter record_counter;
5794
5795 run;

WARNING: The variable record_counter in the DROP, KEEP, or RENAME list has never been referenced.

 

chris, thanks for taking your time to do this, but I don't think you read my previous posts. you are ignoring the ranking variable. I want to rank each month into 20 equal sized groups.

lezgin
Obsidian | Level 7

astounding, thank you but it doesn't produce groups with equal no of obs.

Astounding
PROC Star

It worked for me.  Of course, your groups can't be exactly of equal size unless the number of observations is a multiple of 20.  So some group sizes can differ by 1.  But other than that it should be perfect.

lezgin
Obsidian | Level 7
yes I know ideally group sizes should differ by one but when I check the difference is 4 or 5 in some cases for groups in same month.
Astounding
PROC Star

I can look at the log if you post it.  (Wouldn't hurt to post a small piece of the output ... an example of what goes wrong.)  Otherwise there's not much I can do on this end.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 16303 views
  • 7 likes
  • 6 in conversation