Solved: Output number closest to X

GS2 · Posted 06-25-2020 10:21 AM

Using SAS 9.4

I have a variable C_mos indicated the months from event A to event B and I want to create a new variable and choose the C_v score that is closest to 6 month between 5-7 months and leave the rest blank. However, I want to prioritize 6 months as the closest but then prioritize 7 months over 5 months as they have reached the 6 month threshold. So I envision the data to look like below:

ID	C_MOS	C_V	newvar
1	5.25	1	.
1	6.02	2	2
1	7.01	3	.
1	5.98	4	.
1	10	5	.
1	5	6	.
2	5.02	1	.
2	5.5	2	.
2	7.5	3	.
2	7.06	4	4
2	8	5	.

So as this shows the closest to 6.0 is taken as long as 6 is achieved but where there is not a 6 month score month 7 is given the priority over month 5 even though the month 5 (5.02) is closer to 6.0 than then 7 month (7.06). Thank you

mkeintz · Posted 06-25-2020 11:53 PM

Then you need to read each ID series twice. The first time, you track the optimal value (call it _X67) in the range [6,8), and also track the optimal value in the range [5,6), call it _x5 (in case no _X67 is found). Note x in the range [6,8) means 6<=x<8.

Take _x67 if it exists, otherwise take _x5 as the optimal _X value.

In the second reading of each ID, compare c_mos to _X, and set NEWVAR=C_V when C_MOS=_X.

data have;
input
ID	C_MOS	C_V	expected_var;
datalines;
1	5.25	1	.
1	6.02	2	2
1	7.01	3	.
1	5.98	4	.
1	10	5	.
1	5	6	.
2	5.02	1	.
2	5.5	2	.
2	7.5	3	.
2	7.06	4	4
2	8	5	.
run;

data want (drop=_:);
  do until (last.id);
    set have;
    by id;
    if (6<=c_mos<8) then _x67=min(_x67,c_mos) ; else
    if (5<=c_mos<6) then _x5=max(_x5,c_mos);
  end;
  _x=coalesce(_x67,_x5);
  do until (last.id);
    set have;
    by id;
    newvar=ifn(c_mos=_x,c_v,.);
    output;
  end;
run;

The "do until (last.id)" loops, with embedded "set have; by id", each reads a complete series for a single ID. The first loop establishes _X5 and _X67. The second loop uses those values to identify the desired record.

If there is a tie, then two records will get a value for NEWVAR. And if there are no values in the range [5,8) then that ID will have only missing values for NEWVAR.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 06-25-2020 11:18 AM

You want to choose "c_v score that is closest to 6 month", yet for ID=2 you choose 7.06 rather than the much closer 5.5. I don't understand the rule. I presume you would therefore also want 6.9 over 5.5. In what situation would you choose a value 5.5?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

GS2 · Posted 06-25-2020 12:15 PM

I want the the 6 month first, then 7 month and finally 5 month. So, in your example while 5.5 is technically closer it has not reached the 6 month threshold and therefore a 7.06 would be chosen. So it almost a hierarchy where I want the closest C_mos between 6-6.99, if not available then 7-7.99, if not available then 5-5.9. But within each group I would want the closest to 6.0 per ID. Does that make more sense? Thank you

mkeintz · Posted 06-25-2020 11:53 PM

Then you need to read each ID series twice. The first time, you track the optimal value (call it _X67) in the range [6,8), and also track the optimal value in the range [5,6), call it _x5 (in case no _X67 is found). Note x in the range [6,8) means 6<=x<8.

Take _x67 if it exists, otherwise take _x5 as the optimal _X value.

In the second reading of each ID, compare c_mos to _X, and set NEWVAR=C_V when C_MOS=_X.

data have;
input
ID	C_MOS	C_V	expected_var;
datalines;
1	5.25	1	.
1	6.02	2	2
1	7.01	3	.
1	5.98	4	.
1	10	5	.
1	5	6	.
2	5.02	1	.
2	5.5	2	.
2	7.5	3	.
2	7.06	4	4
2	8	5	.
run;

data want (drop=_:);
  do until (last.id);
    set have;
    by id;
    if (6<=c_mos<8) then _x67=min(_x67,c_mos) ; else
    if (5<=c_mos<6) then _x5=max(_x5,c_mos);
  end;
  _x=coalesce(_x67,_x5);
  do until (last.id);
    set have;
    by id;
    newvar=ifn(c_mos=_x,c_v,.);
    output;
  end;
run;

The "do until (last.id)" loops, with embedded "set have; by id", each reads a complete series for a single ID. The first loop establishes _X5 and _X67. The second loop uses those values to identify the desired record.

If there is a tie, then two records will get a value for NEWVAR. And if there are no values in the range [5,8) then that ID will have only missing values for NEWVAR.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 06-25-2020 11:40 PM

Something like below could work.

data have;
  infile datalines dsd;
  input ID C_MOS C_V expected;
datalines;
1,5.25,1,.
1,6.02,2,2
1,7.01,3,.
1,5.98,4,.
1,10,5,.
1,5,6,.
2,5.02,1,.
2,5.5,2,.
2,7.5,3,.
2,7.06,4,4
2,8,5,.
3,9.5,2,.
3,4.5,3,.
3,3.06,4,.
3,8,5,.
;

proc sql;
/*  create table want as*/
    select 
      o.*,
      i.c_v as derived
    from have o
    left join
    (
      select id,c_v
      from
      (
      select 
        *,
        case(floor(c_mos))
          when 6 then 10000-c_mos
          when 7 then 1000-c_mos
          when 5 then 100-c_mos
          else .
          end as calc
      from have
      )
      group by id
      having 
        max(calc) = calc 
        and not missing(calc)
    ) i
    on o.id=i.id and o.c_v=i.c_v
    ;
quit;

Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Re: Output number closest to X

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!