Solved: Re: Create ID groups by criteria

PeterBr · Posted 08-31-2020 09:05 PM

Hi All, I have the following data. I'd like to create 3 groups (1, 3, 5) based on the percent variable. I'd like the group = 1 to represent the closest approximation to 20%. For example, .18 and .23 exist but .18 is closer to .2 so I'd like that to be the cutoff. Essentially, the tie goes to the number that minimizes the difference from .2. The same logic applies for group 5 but instead I'd like the cutoff to be .8, again I'd like the tie to go towards the number that minimizes the difference from 0.8. For all the values in between, they can be assigned group = 3. The want data is below. Thanks!


data have;
   input percent;
   datalines;
.01
.05
.16
.18
.23
.46
.55 
.67
.69
.698
.74
.78
.81
.91
.82
;
run;

data want;
   input percent group;
   datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.81 5
.91 5
.92 5
;
run;

RichardDeVen · Posted 09-01-2020 02:25 AM

Here is some SQL that finds the 'breakpoints' and uses them to apply the group values.

The 1 and 5 groups are computed to contain the largest number of percents in case of ties, which means ties for group 1 break at value > 0.20 and ties for group 5 break at value < 0.80. You will want to ROUND results to ensure ties are properly identified.

Example:

data have;
input percent;
datalines;
.01
.05
.16
.18
.22 /* higher tie */
.23
.46
.55 
.67
.69
.698
.74
.78 /* lower tie */
.82
.91
.92
;

proc sql;
  create table breakpoints as
  select percent, group, diff
  from 
  (
    ( select
      percent
    , 1 as group
    , round(abs(percent-0.20),0.01) as diff
    from have
    having diff = min(diff)
    )
    union
    (select
      percent
    , 5 as group
    , round(abs(percent-0.80),0.01) as diff
    from have
    having diff = min(diff)
    )
  )
  group by group
  having group = 1 and percent=max(percent) /* tie processing */
      or group = 5 and percent=min(percent)
  ;

  create table want as
  select 
    have.*
  , coalesce(group, 3) as group
  from have 
  left join breakpoints
  on 
  have.percent <= breakpoints.percent and group = 1
  or
  have.percent >= breakpoints.percent and group = 5
  order by percent
  ;

View solution in original post

PeterBr · Posted 08-31-2020 10:28 PM

Sorry the .82 in the have data should have been .92, here it is again:

data have;
   input percent;
   datalines;
.01
.05
.16
.18
.23
.46
.55 
.67
.69
.698
.74
.78
.81
.91
.92
;
run;

data want;
   input percent group;
   datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.81 5
.91 5
.92 5
;
run;

andreas_lds · Posted 09-01-2020 02:23 AM

From your description i expected 0.81 to be in group 3, because the distance to 0.8 is smaller than the distance between 0.78 and 0.8

mkeintz · Posted 08-31-2020 10:37 PM

What if you have a tie, e.g. one value of .18 and one value of .22? They are the same distance from .2, so which one becomes your cutoff for group 1?

And what if your lowest value is, say, .35? Does it get put in group 1?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

RichardDeVen · Posted 09-01-2020 02:25 AM

Here is some SQL that finds the 'breakpoints' and uses them to apply the group values.

The 1 and 5 groups are computed to contain the largest number of percents in case of ties, which means ties for group 1 break at value > 0.20 and ties for group 5 break at value < 0.80. You will want to ROUND results to ensure ties are properly identified.

Example:

data have;
input percent;
datalines;
.01
.05
.16
.18
.22 /* higher tie */
.23
.46
.55 
.67
.69
.698
.74
.78 /* lower tie */
.82
.91
.92
;

proc sql;
  create table breakpoints as
  select percent, group, diff
  from 
  (
    ( select
      percent
    , 1 as group
    , round(abs(percent-0.20),0.01) as diff
    from have
    having diff = min(diff)
    )
    union
    (select
      percent
    , 5 as group
    , round(abs(percent-0.80),0.01) as diff
    from have
    having diff = min(diff)
    )
  )
  group by group
  having group = 1 and percent=max(percent) /* tie processing */
      or group = 5 and percent=min(percent)
  ;

  create table want as
  select 
    have.*
  , coalesce(group, 3) as group
  from have 
  left join breakpoints
  on 
  have.percent <= breakpoints.percent and group = 1
  or
  have.percent >= breakpoints.percent and group = 5
  order by percent
  ;

PeterBr · Posted 09-01-2020 09:37 PM

Hi this code is almost perfect, I really appreciate it! One thing I noticed though is when there are duplicate percent values, the step when creating the want table, it deletes the duplicates so that only one unique value of percent can exist. With the have data below, your code produces the want data and see that .23 and .18 duplicates are both missing. Is there a way to resolve that? It says there there might be a cartesian product problem?


data have;
   input percent;
   datalines;
.01
.05
.16
.18
.18
.23
.23
.46
.55
.67
.69
.698
.74
.78
.82
.91
.92
;
run;


data want;
   input percent group;
   datalines;
.01 1
.05 1
.16 1
.18 1
.23 3
.46 3
.55 3
.67 3
.69 3
.698 3
.74 3
.78 3
.82 5
.91 5
.92 5
;
run;

PeterBr · Posted 09-01-2020 09:42 PM

ignore my last comment it was a problem on my end, works great thanks so much!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away