BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jkurka
Fluorite | Level 6

Hello SAS community! These posts have been very helpful in the past, so I'm hoping some may have solutions for my task.

 

Data: one variable ranging from 0.000 - 5.000

Task: create groups based on ranges of values

     For example:

Values <= 0 to <= 0.039 = group 1

Values >= 0.04 to <=0.0.79 = group 2

Values >=0.08 to <=0.119 = group 3, etc

...all the way up to 5.000 (this should result in 128 groups).

 

I have over 10 million observations and about 900 datasets to process this on, so I would like it to be fairly quick too.

 

Can anyone help? Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Possibly a slight modification of @FreelanceReinh's code:

 

group = floor(x/(5/128))+1;

 

When I iterate at 0.01 between 0 and 5 that generates 128 groups of 4 values each

View solution in original post

10 REPLIES 10
jkurka
Fluorite | Level 6

I should mention that the goal is to avoid having to code in 128 if-then statements.

Reeza
Super User

Use a format, and you can build your format by using a formula. 

 

data range_fmt;
fmtname='Range_fmt';
type='N';
EEXCL='Y'; *exclude end value of range;
label=0; *group value;

do i=0 to 5 by 0.04;
	start=i;
	end=i+0.04;
	label+1;
	output;
end;

drop i;
run;

proc format cntlin=range_fmt;
run;
jkurka
Fluorite | Level 6

Thank you. However, the end of label 1 is the same as the start of label 2. I'd like label 1 to be 0 <= 0.039 and label 2 to be 0.04 <= 0.079. Also, how would you match up the format with the original data set?

Reeza
Super User

In my opinion <0.04 is the same as <=0.039. The EEXCL option included above, states that 0.04 is NOT included in the boundary condition.

FreelanceReinh
Jade | Level 19

If you prefer to have a new variable group with values 1, 2, 3, etc., you could define it as follows (for a variable X containing the values to be categorized):

data want;
set have;
group=floor(x/0.04)+1;
run;

Whether you create a new variable or a format (as suggested by Reeza) or a function (with PROC FCMP), you should pay particular attention to the correct assignment of values at the boundary between two ranges. Otherwise, numeric representation issues could lead to surprising group assignments.

 

Your outlined definition is not quite clear about the categorization of values x with 0.039 < x < 0.04 etc. (if any). Also, how do you know there will be 128 groups? Continuing your definition I obtain 5/0.04=125 plus 1 group for value 5.000 alone.

 

FreelanceReinh
Jade | Level 19

Here's what I meant by "surprising group assignments":

/* Create test data */
data have;
do _n_=0 to 5000;
  x=_n_/1000;
  output;
end;
run; /* 5001 obs.: x=0, 0.001, 0.002, ..., 4.999, 5 */

/* Compare the two group definitions that have been suggested so far */
data chk;
set have;
group=floor(x/0.04)+1;
groupf=input(put(x, range_fmt.), 8.);
if group ne groupf;
run; /* 56 obs.! */

proc print data=chk(obs=10) noobs;
run;

Result (with SAS 9.4 on Windows 7):

  x     group    groupf

0.44      12       11
0.52      14       13
0.76      20       19
0.84      22       21
0.92      24       23
1.00      26       25
1.08      28       27
1.16      30       29
1.24      32       31
1.32      34       33
jkurka
Fluorite | Level 6

All values are only to the thousandths, so there are no values between 0.039 and 0.040. I'd like the range to be 0.039...and 5/0.039 is about 128. Another way of explaining would be that I want values from 0.000 to 5.000 split into 128 groups of equal range.

ballardw
Super User

Possibly a slight modification of @FreelanceReinh's code:

 

group = floor(x/(5/128))+1;

 

When I iterate at 0.01 between 0 and 5 that generates 128 groups of 4 values each

FreelanceReinh
Jade | Level 19
@jkurka wrote:

All values are only to the thousandths, so there are no values between 0.039 and 0.040. I'd like the range to be 0.039...and 5/0.039 is about 128. Another way of explaining would be that I want values from 0.000 to 5.000 split into 128 groups of equal range.


It's true that 5/0.039=128.2..., but due to the structural gaps (like 0.039 to 0.04) between your intervals the number of ranges will not exceed 126 with the first two definitions that had been discussed.

 

A definition of "128 groups of equal range," however, would require a different definition. Please note that 5001 is not divisible by 128 (5001/128=39.07...). So, if you assign each of the 5001 values 0, 0.001, 0.002, ..., 4.999, 5 to 128 ranges, the numbers of elements per range will at best vary between 39 and 40.

 

Suggestion: Use 5/128=0.0390625 as the interval length, right-open intervals, but assign 5 to group 128 (so as to avoid a 129th group for value 5 alone). This can be achieved by applying ballardw's suggestion to values x with 0<=x<5 and classifying 5 into group 128. 

 

This definition creates 119 categories with 39 values and 9 with 40 values.

TomKari
Onyx | Level 15

My advice is to use the "cntlin" option that @Reeza posted. Use the "EEXCL" option to prevent the end of your range including the number.

 

Tom

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 19975 views
  • 0 likes
  • 5 in conversation