DATA Step, Macro, Functions and more

Assign bins to variable

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 5
Accepted Solution

Assign bins to variable

Hi I have a variable that has value from for example -3000 to 20,000, i want to assign them into 10,000 buckets evenly spaced, and calculate the percentage of observations that fall between the values. How is this done in sas steps?

 

Thank you!


Accepted Solutions
Solution
‎06-14-2017 02:37 PM
Super User
Posts: 10,500

Re: Assign bins to variable

Without knowing how this is going to be used and which assignment makes sense for some I would start with this.

data have;
   do x = -200 to 3000;
      output;
   end;
run;

proc sql;
select max(x) into : max from have;
select min(x) into : min from have;
quit;
 
data buckets;
   set have;
   bin = floor((x-&min)/((&max-&min)/10000)) + 1;
run;

Note that with a range of -200 to 3000 if your values are integers you are not going to use all 10,000 bins. Ever. You only have 3200 values if integer valued so roughly two thirds of the bins will receive no assignment.

 

View solution in original post


All Replies
PROC Star
Posts: 551

Re: Assign bins to variable

[ Edited ]

Use PROC RANK with the GROUPS option

Occasional Contributor
Posts: 5

Re: Assign bins to variable

Hi Draycut, I tried out proc rank but it gives me equal number of observations in each assigned group. However, my want the group to threshold to be evenly spaced between maxinum and minimum with 10,000 buckets. I should assign the variable based on the value of the variable I looked at, so maybe group 1 will only have 1 observation, and group 30 will have 360 observations, do you know a solution for this? thank you!

Super User
Super User
Posts: 7,401

Re: Assign bins to variable

You would use do loops or a calculation.  To get further answers to you question provide the information suggested in the box next to post a question - provide test data in the form of a datastep, what the output should look like and describe the logic between them.

At a rough guess:

data want;
  set have;
bin=ceil(value / 10000);
run;
proc means data=want;
...
run;
Occasional Contributor
Posts: 5

Re: Assign bins to variable

my dataset would be something like:

I have about 2330000 observations, I want to assign basically evenly spaced 10,000 buckets. The bucket criteira would the max(var)-min(var). So for example, my maximum value is 3000 and my minimum value is -200, So my bucket spread will be (3000+200)/10,000=0.32. And bucket 1 will be value between -200 to (-200+0.32), bucket 2 will be value between (-200+0.32) to (-200+0.32*2) and so on. I want to have a new variable that see if what bucket is falls into.

 

Var            New_var

-200              1

-53                ?

-5                  ?

-46                ?

5

8

4

56

7542

242

....

Super User
Posts: 5,082

Re: Assign bins to variable

I'd set up a format to define the buckets.  For example:

 

proc sql;

select max(var) into : max from have;

select min(var) into : min from have;

quit;

 

data buckets;

interval = (&max - &min) / 10000;

fmtname = 'bucket';

do label = 1 to 10000;

   start = &min + (label - 1) * interval;

   end = start + interval;

   if label=10000 then end=&max;

   output;

end;

run;

proc format cntlin=buckets;

run;

 

This gives you a format that translates each range into a bucket (1 to 10000).  For example, you could use:

 

data want;

set have;

bucket = put(var, bucket.);

run;

 

In this example, BUCKET would be character, but you could always convert to numeric if you want to:

 

bucket = input(put(var, bucket.), 5.);

 

The end point of the final bucket gets hard-coded to the max, because 10,000 iterations of floating point math might produce a slight difference in the endpoint.

 

The code is untested.  It looks right, but you will have to do the testing.

 

Solution
‎06-14-2017 02:37 PM
Super User
Posts: 10,500

Re: Assign bins to variable

Without knowing how this is going to be used and which assignment makes sense for some I would start with this.

data have;
   do x = -200 to 3000;
      output;
   end;
run;

proc sql;
select max(x) into : max from have;
select min(x) into : min from have;
quit;
 
data buckets;
   set have;
   bin = floor((x-&min)/((&max-&min)/10000)) + 1;
run;

Note that with a range of -200 to 3000 if your values are integers you are not going to use all 10,000 bins. Ever. You only have 3200 values if integer valued so roughly two thirds of the bins will receive no assignment.

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 134 views
  • 0 likes
  • 5 in conversation