I have data of purchases with different payment amounts. I want to bucket the data so that each bucket has an equal proportion of total dollars. I am experimenting with PROC HPBIN but I don't know which bucketing method to use or if I need to use a different PROC.
Example of data:
Purchase # Dollar Amount
1 $10
2 $20
3 $10
4 $10
5 $50
6 $30
7 $10
8 $20
9 $10
10 $20
11 $10
Code should result in buckets like this:
Bucket 1: $0-$10: total of $50 (Purchases #1, 3, 4, 7, 9)
Bucket 2: $10-$20: total of $50 (Purchases #11, 2, 😎
Bucket 3: $20-$30: total of $50 (Purchases #10, 6)
Bucket 4: More than $30: total of $50 (Purchase #5)
As Paul says, there are assumptions that may or may not hold. Here's a program that enables the cumulative sums to exceed the target values (50).
proc sort data=Have out=HaveSort;
by Amount;
run;
/* manually inspect the sum */
proc means data=Have Sum;
var Amount;
run;
%let nGroups = 4;
%let SUM = 200; /* <== input sum from PROC MEANS */
data Cumul;
retain Bin 1;
set HaveSort;
d = &SUM / &nGroups;
cusum + Amount;
diff = cusum - d;
if diff > 0 then do;
cusum = cusum - d;
Bin + 1;
end;
run;
proc print; run;
The code won't work perfectly if there are very large amounts like $200.
I think that most people would not consider this "binning." Binning usually means assigning similar units to groups: For example, you could bin by the dollar amount and get one bin that has small amounts, another that has medium amounts, and one that has large amounts.
From your description, it sounds like you want to split the data into groups that have (approximately) sum (or proportion) of dollars. If that is correct, you might want to read the article "Split data into groups that have the same mean "
See also the comments at the end of the article for an alternative solution.
@benvar :
Your presumption that what you need to do is possible is based on the assumption that the data are actually structured in such a way that as you go through the amounts in ascending order, first N1 items will tally up to 50, next N2 items will also tally up to 50, and so on. Your sample data shown here are in fact structured in this manner. If your real data also are, then try this:
data have ;
input purchase amount ;
cards ;
1 10
2 20
3 10
4 10
5 50
6 30
7 10
8 20
9 10
10 20
11 10
run ;
proc sort ;
by amount ;
run ;
data want (drop = _:) ;
do _q = 1 by 1 until (_sum = 50) ;
set have ;
_sum = sum (_sum, amount) ;
end ;
bucket = _n_ ;
do _q = 1 to _q ;
set have ;
output ;
end ;
run ;
Kind regards
Paul D.
As Paul says, there are assumptions that may or may not hold. Here's a program that enables the cumulative sums to exceed the target values (50).
proc sort data=Have out=HaveSort;
by Amount;
run;
/* manually inspect the sum */
proc means data=Have Sum;
var Amount;
run;
%let nGroups = 4;
%let SUM = 200; /* <== input sum from PROC MEANS */
data Cumul;
retain Bin 1;
set HaveSort;
d = &SUM / &nGroups;
cusum + Amount;
diff = cusum - d;
if diff > 0 then do;
cusum = cusum - d;
Bin + 1;
end;
run;
proc print; run;
The code won't work perfectly if there are very large amounts like $200.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.