BookmarkSubscribeRSS Feed
sunny86
Calcite | Level 5
data have;
input Balance;
datalines;
5000
6000
5500
4500
5200
4800
5100
5600
5200
5300
;
run;
 
proc sort data=have;
by descending Balance;
run;
 
data result;
do i= 1 to 5, 5 to 1 until(done);
set have end=done;
group = i;
output;
 
end;
drop i;
run;

My output
 
160001
256002
355003
453004
552005
652001
751002
850003
948004
1045005

 

Expected output

160001
256002
355003
453004
552005
652005
751004
850003
948002
1045001

 

12 REPLIES 12
mkeintz
PROC Star

This is a good case for taking advantage of the automatic variable _N_, and the modulo function:

 

proc sort data=have out=need;
  by descending balance;
run;
data want;
  set need;
  group=mod(_n_-1,5)+1;
run;

In this case (in most cases) _N_ is equivalent to the observation sequence in dataset NEED.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sunny86
Calcite | Level 5

@mkeintz the output I ma expecting is to reverse the sequence alternatively

 

Form your suggestion i got the below output but what i want is to reverse it

 
 
 
160001
256002
355003
453004
552005
652001
751002
850003
948004
1045005

 

Expected 

 

 
160001
256002
355003
453004
552005
652005
751004
850003
948002
1045001

 

mkeintz
PROC Star

So, for every ten observations, you want to have groups in sequence 1,2,3,4,5,5,4,3,2,1, correct?  Then you can just use a temporary array of 10 group values:

 

proc sort data=have out=need;
  by descending balance;
run;
data want_array;
  array grps {0:9} _temporary_ (1,1,2,3,4,5,5,4,3,2);
  set need;
  group=grps{mod(_n_,10)};
run;

The array grps has a lower bound of zero, not 1.  So the first array element corresponds to _n_=10,20,30, etc, all of which have mod(_n_,10)=0.   Second element corresponds to _n_=1,11,21, etc.

 

If the array approach seems a little too abstruse, then you can have a group increment of +1 for five observations, followed by -1 for five observations:

 

proc sort data=have out=need;
  by descending balance;
run;
data want (drop=_:);
  set need;
  retain _increment 1;
  group + _increment;
  if group in (0,6) then do;
    _increment=-1*_increment;
    group+_increment;
  end;  
run;

Question:  what if you have, say, 36 observations.  What group values will the last 6 obs have?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ballardw
Super User

@sunny86 wrote:
data have;
input Balance;
datalines;
5000
6000
5500
4500
5200
4800
5100
5600
5200
5300
;
run;
 
proc sort data=have;
by descending Balance;
run;
 
data result;
do i= 1 to 5, 5 to 1 until(done);
set have end=done;
group = i;
output;
 
end;
drop i;
run;

My output
 
1 6000 1
2 5600 2
3 5500 3
4 5300 4
5 5200 5
6 5200 1
7 5100 2
8 5000 3
9 4800 4
10 4500 5

 

Expected output

1 6000 1
2 5600 2
3 5500 3
4 5300 4
5 5200 5
6 5200 5
7 5100 4
8 5000 3
9 4800 2
10 4500 1

 


Your example code explicitly states 5 as a group. Would it better to say that you want HALF of the values in descending order numbered in increasing order until half are numbered and then start decreasing the group variable? Or is your data always exactly 10 records in the data set? If the number of observations may ever be odd you need describe how the group gets assigned for the observation that can't have a group partner.

 

This seems to work for EVEN numbers of observations.

data want;
  set have nobs=numobs;
  retain group;
  if _n_ le numobs/2 then do;
     group=mod(_n_-1,numobs/2)+1;
     output;
     if _n_ = numobs/2 then group+1;
  end;
  else do ;
     group=group-1;
     output;
  end;
run;
sunny86
Calcite | Level 5

@ballardw  I have a datset of 649 observations i want to alternate the sequence in descending order.
I am sorting the dataset in descending order and i want to assing the balances to the 5 groups such that the average balance across the groups will be close to similar
I am achieving it by assigning the balances to groups
So in the example dataset I want to alternate the Groups
In the first iteration the sequence went from 1 to 5 and then next iteration it should assign from 5 to 1

 

 

ballardw
Super User

@sunny86 wrote:

@ballardw  I have a datset of 649 observations i want to alternate the sequence in descending order.
I am sorting the dataset in descending order and i want to assing the balances to the 5 groups such that the average balance across the groups will be close to similar
I am achieving it by assigning the balances to groups
So in the example dataset I want to alternate the Groups
In the first iteration the sequence went from 1 to 5 and then next iteration it should assign from 5 to 1

 

 


Still not a particularly clear objective. You seem to spend as much time concerned with an approach as describing the result. Is the result to have N, where N is some number you pick, such as 5 groups each having a similar (that definition may need some clarification) mean value of Balance?

 

BTW, I still see no mention of what your approach would do with the odd number of observations.

 

An example of 10 observations may be obscuring other options that may be more robust with a larger data set. Group sizes of 2 from 10 observations would require lots of control. Group sizes of over 100, such as 5 groups from 649 observations are likely amenable to other approaches such as random selection.

Consider running the following code against your larger data set and see what the means are for the groups of balance values.

proc surveyselect data=have groups=5 out=test
;
run;

proc means data=test mean;
   class groupid;
   var balance;
run;
sunny86
Calcite | Level 5

@ballardw 

 

What i am trying to achieve is

I have 5 departments and i have to assign the 649 observations to the groups in such a way that the variance between the groups is minimal for both balance and score.

I tried with balance of reverse sequence

Data want;

do i = 1 to &Dep;
set Sample (keep=Department Score Balance Uniq_no) end=done;
group = i;
output;
end;
 
do i = &Dep to 1 by -1 until(done);
set Sample (keep=Department Score Balance Uniq_no) end=done;
group = i;
output;
end;
drop i;



 

ballardw
Super User

@sunny86 wrote:

@ballardw 

 

What i am trying to achieve is

I have 5 departments and i have to assign the 649 observations to the groups in such a way that the variance between the groups is minimal for both balance and score.

I tried with balance of reverse sequence

Data want;

do i = 1 to &Dep;
set Sample (keep=Department Score Balance Uniq_no) end=done;
group = i;
output;
end;
 
do i = &Dep to 1 by -1 until(done);
set Sample (keep=Department Score Balance Uniq_no) end=done;
group = i;
output;
end;
drop i;

There is no guarantee that what you are doing will do what you want. In fact by having the group=1 , at least with a sample size of two by having the largest and smallest value of Balance in the same group you increase the variance while the group=5 is more likely to have a much smaller variance than the other groups.

I ran proc means on the data set that  you want to create. Note that the variance is larger for group 1 than for group 5.

Analysis Variable : Balance
group N Obs Mean Variance Std Dev
1 2 5250.00 1125000.00 1060.66
2 2 5200.00 320000.00 565.6854249
3 2 5250.00 125000.00 353.5533906
4 2 5200.00 20000.00 141.4213562
5 2 5200.00 0 0

 

You also have described how this algorithm is to applied to a data set of 649 observations. Or how many actual groups you want.

 

Did you even test the example code on your whole data set? You get see the variance or standard deviation between the groups by adding VAR and STD to the Means statement. Then you can see what the means and variance are. If you want to you could create hundreds of output datasets from Surveyselect, test each one for the "variances" , which I suspect that you might just be using incorrectly, and pick a set that has the lowest mean "variance".

 

 

Astounding
PROC Star

Logically, I think this is the right way to get what you are describing.  After sorting:

 

data want;
   do until (5=4);
      do i=1 to 5, 5 to 1 by -1;
         set have;
         output;
      end;
   end;
run;

It might seem strange to use 5=4 as the UNTIL condition.  That just means the DATA step keeps on going until the SET statement ends it by running out of observations to read.

 

There are other legitimate questions to ask.  Is this the best statistical way to divide up your data?  (Probably not.)  Why didn't your earlier attempt with two SET statements work?  (Each SET statement operates independently of the other.  Both start reading the first observation from the incoming data, so I expect you have 649 + 645 observations from that attempt.)  Those questions can be addressed in more detail.  But for now, the above program should get you what you asked for.

Tom
Super User Tom
Super User

Exactly.

But you don't need that extra DO loop.  The data step itself will do that for you.

Astounding
PROC Star

@Tom , you're absolutely right.  Good point.

 

After sorting, the "decluttered" version of my solution would be:

data want;
   do i=1 to 5, 5 to 1 by -1;
      set have;
      output;
   end;
run;

If HAVE runs out of observations before a loop completes, that would be a normal end to the DATA step.

Ksharp
Super User
data want;
 set have;
 id=mod(_n_,5);
 if id=0 then id=5;
 if id=1 then group+1;
 if mod(group,2)=0 then id=6-id;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1076 views
  • 4 likes
  • 6 in conversation