Solved: Bucketing data according to proportion of a variable

benvar · Posted 06-25-2019 09:48 AM

I have data of purchases with different payment amounts. I want to bucket the data so that each bucket has an equal proportion of total dollars. I am experimenting with PROC HPBIN but I don't know which bucketing method to use or if I need to use a different PROC.

Example of data:

Purchase # Dollar Amount

1 $10

2 $20

3 $10

4 $10

5 $50

6 $30

7 $10

8 $20

9 $10

10 $20

11 $10

Code should result in buckets like this:

Bucket 1: $0-$10: total of $50 (Purchases #1, 3, 4, 7, 9)

Bucket 2: $10-$20: total of $50 (Purchases #11, 2, 😎

Bucket 3: $20-$30: total of $50 (Purchases #10, 6)

Bucket 4: More than $30: total of $50 (Purchase #5)

Rick_SAS · Posted 06-26-2019 08:18 AM

As Paul says, there are assumptions that may or may not hold. Here's a program that enables the cumulative sums to exceed the target values (50).

proc sort data=Have out=HaveSort;
by Amount;
run;

/* manually inspect the sum */
proc means data=Have Sum;
var Amount;
run;

%let nGroups = 4;
%let SUM = 200;    /* <== input sum from PROC MEANS */

data Cumul;
retain Bin 1;
set HaveSort;
d = &SUM / &nGroups;
cusum + Amount;
diff = cusum - d;
if diff > 0 then do;
   cusum = cusum - d;
   Bin + 1;
end;
run;

proc print; run;

The code won't work perfectly if there are very large amounts like $200.

View solution in original post

Rick_SAS · Posted 06-25-2019 01:59 PM

I think that most people would not consider this "binning." Binning usually means assigning similar units to groups: For example, you could bin by the dollar amount and get one bin that has small amounts, another that has medium amounts, and one that has large amounts.

From your description, it sounds like you want to split the data into groups that have (approximately) sum (or proportion) of dollars. If that is correct, you might want to read the article "Split data into groups that have the same mean "

See also the comments at the end of the article for an alternative solution.

benvar · Posted 06-25-2019 02:38 PM

Yes, I need my data to be in order of dollar amount (from lowest to highest) and then divided into groups that each have the same sum of total dollars. The article you linked splits data into groups with the same mean which is not what I want. Can you please help?

hashman · Posted 06-25-2019 05:46 PM

@benvar :

Your presumption that what you need to do is possible is based on the assumption that the data are actually structured in such a way that as you go through the amounts in ascending order, first N1 items will tally up to 50, next N2 items will also tally up to 50, and so on. Your sample data shown here are in fact structured in this manner. If your real data also are, then try this:

data have ;                         
  input purchase amount ;           
  cards ;                           
 1  10                              
 2  20                              
 3  10                              
 4  10                              
 5  50                              
 6  30                              
 7  10                              
 8  20                              
 9  10                              
10  20                              
11  10                              
run ;                               
                                    
proc sort ;                         
  by amount ;                       
run ;                               
                                    
data want (drop = _:) ;             
  do _q = 1 by 1 until (_sum = 50) ;
    set have ;                      
    _sum = sum (_sum, amount) ;     
  end ;                             
  bucket = _n_ ;                    
  do _q = 1 to _q ;                 
    set have ;                      
    output ;                        
  end ;                             
run ;

Kind regards

Paul D.

Rick_SAS · Posted 06-26-2019 08:18 AM

As Paul says, there are assumptions that may or may not hold. Here's a program that enables the cumulative sums to exceed the target values (50).

proc sort data=Have out=HaveSort;
by Amount;
run;

/* manually inspect the sum */
proc means data=Have Sum;
var Amount;
run;

%let nGroups = 4;
%let SUM = 200;    /* <== input sum from PROC MEANS */

data Cumul;
retain Bin 1;
set HaveSort;
d = &SUM / &nGroups;
cusum + Amount;
diff = cusum - d;
if diff > 0 then do;
   cusum = cusum - d;
   Bin + 1;
end;
run;

proc print; run;

The code won't work perfectly if there are very large amounts like $200.

Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Re: Bucketing data according to proportion of a variable

Click image to register for webinar

Classroom Training Available!