Solved: Re: Cutting a dataset according to a specified value

BlueNose · Posted 04-05-2016 06:18 AM

Hello all,

I am facing a tricky operation. To simplify, let's say I have a dataset with 3 columns: Scenario (getting values 1-4), time, and survival. Where time and survival are a result of the lifetest procedure (survival analysis). For each scenario, I ran the lifetest procedure and sent the results to output.

In addition, I have a specified time point, for example, 730 days (two years, my time is in days). I want to do the following thing. I want to cut the data and to keep only the time points up to 730 days. But not so simply. If I have the exact number (730) in the data, I wish to simply take only observations smaller or equal to it. If I don't (this is the tricky part), I wish the take the closest observations below, and the closest obsetvation above (I mean the survival value), to average them, and to add a new observation, with time=730 and survival = the averaged value.

I have no idea how to do it, can you please assist me ? Many thanks in advance !

Kurt_Bremser · Posted 04-05-2016 08:10 AM

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

proc print;
run;

Now, we want to get rid of those two observations that were used for the average:

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

%let time1=x;
%let time2=x;

data want;
set have;
old_time = lag(time);
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    * keep the time values of the obs to be deleted;
    call symput('time1',put(old_time,best.));
    call symput('time2',put(time,best.));
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

* just to check;
%put time1=&time1;
%put time2=&time2;

%macro cleanup;
%if "&time1" ne "x" %then %do; /* conditionally execute code */
data want;
set want;
if time not in (&time1,&time2);
run;
%end;
%mend;
%cleanup;

proc print;
run;

New output:

                                                Obs    Time    Survival    old_time

                                                 1        0      1.000          .  
                                                 2      150      0.970          0  
                                                 3      150      0.880        150  
                                                 4      200      0.930        150  
                                                 5      310      0.790        200  
                                                 6      393      0.750        310  
                                                 7      420      0.680        393  
                                                 8      568      0.610        420  
                                                 9      730      0.555        639

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 04-05-2016 07:22 AM

How about this:

%let cutoff=730;

proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival);
output;
if survival >= &cutoff
then do;
  time = &cutoff;
  if survival > &cutoff
  then survival = (survival + old_sur) / 2;
  output;
  stop;
end;
drop old_sur;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-05-2016 07:41 AM

Hello. Thank you for the effort. This code does not do what I need. I will give a numerical example, maybe it will be easier to understand what I need exactly.

data temp;
input Time Survival;
datalines;
0     1
150 0.97
200 0.93
150 0.88
310 0.79
393 0.75
420 0.68
568 0.61
639 0.57
792 0.54
880 0.52
930 0.49
1050 0.47
1200 0.35
;
run;

My cutoff is 730 days. If I had 730 days in this data, I want to delete all observations with time larger than 730 (so 730 remains the largest one). If I don't have 730 days (like in this example), I want to take the survival time of the two closest observations (in this case 639 and 792 days - note: it has to be one of each side of 730) and average them: (0.57+0.54)/2, and then I wish to add an observation: 730, 0.555.

Kurt_Bremser · Posted 04-05-2016 07:47 AM

Ok, it's clearer now:

data have;
input Time Survival;
datalines;
0     1
150 0.97
200 0.93
150 0.88
310 0.79
393 0.75
420 0.68
568 0.61
639 0.57
792 0.54
880 0.52
930 0.49
1050 0.47
1200 0.35
;
run;

%let cutoff=730;

proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival);
output;
if time >= &cutoff
then do;
  if time > &cutoff
  then do;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop;
end;
drop old_sur;
run;

proc print;
run;

Result:

                                                      Obs    Time    Survival

                                                        1       0      1.000 
                                                        2     150      0.970 
                                                        3     150      0.880 
                                                        4     200      0.930 
                                                        5     310      0.790 
                                                        6     393      0.750 
                                                        7     420      0.680 
                                                        8     568      0.610 
                                                        9     639      0.570 
                                                       10     792      0.540 
                                                       11     730      0.555

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-05-2016 07:51 AM

Wow, this seems to be working, apart from one small issue which I did not specify, when I add the new observation, I want to delete the other two I used for averaging, or at least the one larger than 730. But in addition, I have one tiny request. Can you generally explain what you did ? I can see that it's working, but I don't understand what it does. This is a very nice code, not long, yet sophisticated. I would have never thought of it myself. Thank you, I am amazed how easy that was for you.

Kurt_Bremser · Posted 04-05-2016 08:10 AM

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

proc print;
run;

Now, we want to get rid of those two observations that were used for the average:

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

%let time1=x;
%let time2=x;

data want;
set have;
old_time = lag(time);
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    * keep the time values of the obs to be deleted;
    call symput('time1',put(old_time,best.));
    call symput('time2',put(time,best.));
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

* just to check;
%put time1=&time1;
%put time2=&time2;

%macro cleanup;
%if "&time1" ne "x" %then %do; /* conditionally execute code */
data want;
set want;
if time not in (&time1,&time2);
run;
%end;
%mend;
%cleanup;

proc print;
run;

New output:

                                                Obs    Time    Survival    old_time

                                                 1        0      1.000          .  
                                                 2      150      0.970          0  
                                                 3      150      0.880        150  
                                                 4      200      0.930        150  
                                                 5      310      0.790        200  
                                                 6      393      0.750        310  
                                                 7      420      0.680        393  
                                                 8      568      0.610        420  
                                                 9      730      0.555        639

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-05-2016 08:16 AM

Thank you, incredible ! I think I get it better now, but just to be sure, if my data will have an observation of exactly 730 days, will it keep only the observations smaller or equal to this observation, without changing anything else ?

Kurt_Bremser · Posted 04-05-2016 08:23 AM

Yes. With

if time > &cutoff

I prevented the addition of an observation if time equals cutoff, and I also did not set the two macro variables, so they keep their initial value of x.

This then prevents the execution of the data step in the macro.

Just try it with test data, just to be sure.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-05-2016 08:26 AM

I will, thank you greatly. Another small question. If I will have several such datasets appended into one large dataset, distinguished by a parameter (index = 1,2,3,4). Can I add a "by" statement to this code so it will run separately for each value of index ?

Kurt_Bremser · Posted 04-05-2016 08:34 AM

The first thing that came to my mind was:

- add a WHERE condition so only a certain index is processed at a time, preferably when sorting (and do that into an intermediate dataset).

- encapsulate the whole shebang in a macro defintion that takes one index value as a parameter (use that macro variable in the WHERE)

- call the macro with the indexes wanted

- if number of indexes is sufficiently large, call the macro with call execute from a data step for further automation.

- add a PROC APPEND step within the macro if you want one consolidated output dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-05-2016 08:43 AM

OK, I will try this. Thank you. As you can imagine, what I asked here is just a small piece of what I need. What I actually need, is to take the output (for each index) of your code, to get the survival values and time values into a string, and to send it into PROC POWER...fairly complicated.

BlueNose · Posted 04-06-2016 06:12 AM

Hello Kurt,

I was thinking about your solution, and I don't think I know how to proceed with it. I expect to have around 40 different scenarios, so this macro will need to be called many times.

Kurt_Bremser · Posted 04-06-2016 07:10 AM

Wrap the code into a macro, use the variable parts as parameters in the %macro definition. In the code, replace variable parts with the parameters.

Test manually to verify that the macro works.

Write a data step that reads data internally from CARDS; to store the parameters for scenarios.

Run the macro automatically from the data&colon;

data _null_;
set scenarios;
length command $200;
command = '%macroname(' !! trim(param1) !! ',' !! trim(param2) !! ',' !! trim(param3) !! ');';
call execute(command);
run;

Note that it is necessary to quote the macro call with single quotes, so that the macro is not resolved immediately, but after the data step has finished.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

BlueNose · Posted 04-06-2016 07:24 AM

I have done something slightly different.

I wrote a macro, with %do loops, something like:

%do i = 1 %to 5;
       %do j = 1 to 3;
             data have_treatment;
              ....
              where group = treatment;
             run;
             
             data have_control;
              ....
              where group = control;
             run;

             data want_treatment;
             ........
      %end;
%end;

I am not sure if this will even work, but I don't know what to do with the %let time1 = x; Should I define it again for control ? I think that time1 must have some values in now that should be erased.

Your suggested code is nice, but my SAS skill is not good enough to figure it out, unfortunatelly.

Just to make it clear, I have several scenarios by two variables (i and j) and I have two treatments. Eventaually, want_trt and want_cnt need to be appended somehow

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away