Hi All, I have the following data. I'd like to create 3 groups (1, 3, 5) based on the percent variable. I'd like the group = 1 to represent the closest approximation to 20%. For example, .18 and .23 exist but .18 is closer to .2 so I'd like that to be the cutoff. Essentially, the tie goes to the number that minimizes the difference from .2. The same logic applies for group 5 but instead I'd like the cutoff to be .8, again I'd like the tie to go towards the number that minimizes the difference from 0.8. For all the values in between, they can be assigned group = 3. The want data is below. Thanks!
data have;
input percent;
datalines;
.01
.05
.16
.18
.23
.46
.55
.67
.69
.698
.74
.78
.81
.91
.82
;
run;
data want;
input percent group;
datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.81 5
.91 5
.92 5
;
run;
Here is some SQL that finds the 'breakpoints' and uses them to apply the group values.
The 1 and 5 groups are computed to contain the largest number of percents in case of ties, which means ties for group 1 break at value > 0.20 and ties for group 5 break at value < 0.80. You will want to ROUND results to ensure ties are properly identified.
Example:
data have;
input percent;
datalines;
.01
.05
.16
.18
.22 /* higher tie */
.23
.46
.55
.67
.69
.698
.74
.78 /* lower tie */
.82
.91
.92
; proc sql; create table breakpoints as select percent, group, diff from ( ( select percent , 1 as group , round(abs(percent-0.20),0.01) as diff from have having diff = min(diff) ) union (select percent , 5 as group , round(abs(percent-0.80),0.01) as diff from have having diff = min(diff) ) ) group by group having group = 1 and percent=max(percent) /* tie processing */ or group = 5 and percent=min(percent) ; create table want as select have.* , coalesce(group, 3) as group from have left join breakpoints on have.percent <= breakpoints.percent and group = 1 or have.percent >= breakpoints.percent and group = 5 order by percent ;
Sorry the .82 in the have data should have been .92, here it is again:
data have;
input percent;
datalines;
.01
.05
.16
.18
.23
.46
.55
.67
.69
.698
.74
.78
.81
.91
.92
;
run;
data want;
input percent group;
datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.81 5
.91 5
.92 5
;
run;
From your description i expected 0.81 to be in group 3, because the distance to 0.8 is smaller than the distance between 0.78 and 0.8
What if you have a tie, e.g. one value of .18 and one value of .22? They are the same distance from .2, so which one becomes your cutoff for group 1?
And what if your lowest value is, say, .35? Does it get put in group 1?
Here is some SQL that finds the 'breakpoints' and uses them to apply the group values.
The 1 and 5 groups are computed to contain the largest number of percents in case of ties, which means ties for group 1 break at value > 0.20 and ties for group 5 break at value < 0.80. You will want to ROUND results to ensure ties are properly identified.
Example:
data have;
input percent;
datalines;
.01
.05
.16
.18
.22 /* higher tie */
.23
.46
.55
.67
.69
.698
.74
.78 /* lower tie */
.82
.91
.92
; proc sql; create table breakpoints as select percent, group, diff from ( ( select percent , 1 as group , round(abs(percent-0.20),0.01) as diff from have having diff = min(diff) ) union (select percent , 5 as group , round(abs(percent-0.80),0.01) as diff from have having diff = min(diff) ) ) group by group having group = 1 and percent=max(percent) /* tie processing */ or group = 5 and percent=min(percent) ; create table want as select have.* , coalesce(group, 3) as group from have left join breakpoints on have.percent <= breakpoints.percent and group = 1 or have.percent >= breakpoints.percent and group = 5 order by percent ;
Hi this code is almost perfect, I really appreciate it! One thing I noticed though is when there are duplicate percent values, the step when creating the want table, it deletes the duplicates so that only one unique value of percent can exist. With the have data below, your code produces the want data and see that .23 and .18 duplicates are both missing. Is there a way to resolve that? It says there there might be a cartesian product problem?
data have;
input percent;
datalines;
.01
.05
.16
.18
.18
.23
.23
.46
.55
.67
.69
.698
.74
.78
.82
.91
.92
;
run;
data want;
input percent group;
datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.82 5
.91 5
.92 5
;
run;
ignore my last comment it was a problem on my end, works great thanks so much!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.