Cutting a dataset according to a specified value

Accepted Solution Solved
Reply
Regular Contributor
Posts: 180
Accepted Solution

Cutting a dataset according to a specified value

Hello all,

 

I am facing a tricky operation. To simplify, let's say I have a dataset with 3 columns: Scenario (getting values 1-4), time, and survival. Where time and survival are a result of the lifetest procedure (survival analysis). For each scenario, I ran the lifetest procedure and sent the results to output.

 

In addition, I have a specified time point, for example, 730 days (two years, my time is in days). I want to do the following thing. I want to cut the data and to keep only the time points up to 730 days. But not so simply. If I have the exact number (730) in the data, I wish to simply take only observations smaller or equal to it. If I don't (this is the tricky part), I wish the take the closest observations below, and the closest obsetvation above (I mean the survival value), to average them, and to add a new observation, with time=730 and survival = the averaged value.

 

I have no idea how to do it, can you please assist me ? Many thanks in advance !


Accepted Solutions
Solution
‎04-05-2016 08:42 AM
Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

proc print;
run;

Now, we want to get rid of those two observations that were used for the average:

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

%let time1=x;
%let time2=x;

data want;
set have;
old_time = lag(time);
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    * keep the time values of the obs to be deleted;
    call symput('time1',put(old_time,best.));
    call symput('time2',put(time,best.));
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

* just to check;
%put time1=&time1;
%put time2=&time2;

%macro cleanup;
%if "&time1" ne "x" %then %do; /* conditionally execute code */
data want;
set want;
if time not in (&time1,&time2);
run;
%end;
%mend;
%cleanup;

proc print;
run;

New output:

                                                Obs    Time    Survival    old_time

                                                 1        0      1.000          .  
                                                 2      150      0.970          0  
                                                 3      150      0.880        150  
                                                 4      200      0.930        150  
                                                 5      310      0.790        200  
                                                 6      393      0.750        310  
                                                 7      420      0.680        393  
                                                 8      568      0.610        420  
                                                 9      730      0.555        639  
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

How about this:

%let cutoff=730;

proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival);
output;
if survival >= &cutoff
then do;
  time = &cutoff;
  if survival > &cutoff
  then survival = (survival + old_sur) / 2;
  output;
  stop;
end;
drop old_sur;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

[ Edited ]

Hello. Thank you for the effort. This code does not do what I need. I will give a numerical example, maybe it will be easier to understand what I need exactly.

 

 

data temp;
input Time Survival;
datalines;
0     1
150 0.97
200 0.93
150 0.88
310 0.79
393 0.75
420 0.68
568 0.61
639 0.57
792 0.54
880 0.52
930 0.49
1050 0.47
1200 0.35
;
run;

 

 

My cutoff is 730 days. If I had 730 days in this data, I want to delete all observations with time larger than 730 (so 730 remains the largest one). If I don't have 730 days (like in this example), I want to take the survival time of  the two closest observations (in this case 639 and 792 days - note: it has to be one of each side of 730) and average them: (0.57+0.54)/2, and then I wish to add an observation: 730, 0.555.

Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

Ok, it's clearer now:

data have;
input Time Survival;
datalines;
0     1
150 0.97
200 0.93
150 0.88
310 0.79
393 0.75
420 0.68
568 0.61
639 0.57
792 0.54
880 0.52
930 0.49
1050 0.47
1200 0.35
;
run;

%let cutoff=730;

proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival);
output;
if time >= &cutoff
then do;
  if time > &cutoff
  then do;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop;
end;
drop old_sur;
run;

proc print;
run;

Result:

                                                      Obs    Time    Survival

                                                        1       0      1.000 
                                                        2     150      0.970 
                                                        3     150      0.880 
                                                        4     200      0.930 
                                                        5     310      0.790 
                                                        6     393      0.750 
                                                        7     420      0.680 
                                                        8     568      0.610 
                                                        9     639      0.570 
                                                       10     792      0.540 
                                                       11     730      0.555 
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

Wow, this seems to be working, apart from one small issue which I did not specify, when I add the new observation, I want to delete the other two I used for averaging, or at least the one larger than 730. But in addition, I have one tiny request. Can you generally explain what you did ? I can see that it's working, but I don't understand what it does. This is a very nice code, not long, yet sophisticated. I would have never thought of it myself. Thank you, I am amazed how easy that was for you.
Solution
‎04-05-2016 08:42 AM
Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

data want;
set have;
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

proc print;
run;

Now, we want to get rid of those two observations that were used for the average:

* I declare a macro variable to make it easier if the cutoff time changes;
%let cutoff=730;

* sort, just to be sure;
proc sort data=have;
by time;
run;

%let time1=x;
%let time2=x;

data want;
set have;
old_time = lag(time);
old_sur = lag(survival); * preserve the value of the preceding observation;
output;
* since another output statement follows, this one is necessary to
replace the implicit output that normally occurs at the end of a data step
iteration;
if time >= &cutoff
then do;
  if time > &cutoff
  then do; * we have to build the additional observation;
    * keep the time values of the obs to be deleted;
    call symput('time1',put(old_time,best.));
    call symput('time2',put(time,best.));
    time = &cutoff;
    survival = (survival + old_sur) / 2;
    output;
  end;
  stop; * end the data step once the cutoff value has been reached;
  * we don't want to read the rest of the input dataset for nothing;
end;
drop old_sur;
run;

* just to check;
%put time1=&time1;
%put time2=&time2;

%macro cleanup;
%if "&time1" ne "x" %then %do; /* conditionally execute code */
data want;
set want;
if time not in (&time1,&time2);
run;
%end;
%mend;
%cleanup;

proc print;
run;

New output:

                                                Obs    Time    Survival    old_time

                                                 1        0      1.000          .  
                                                 2      150      0.970          0  
                                                 3      150      0.880        150  
                                                 4      200      0.930        150  
                                                 5      310      0.790        200  
                                                 6      393      0.750        310  
                                                 7      420      0.680        393  
                                                 8      568      0.610        420  
                                                 9      730      0.555        639  
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

Thank you, incredible ! I think I get it better now, but just to be sure, if my data will have an observation of exactly 730 days, will it keep only the observations smaller or equal to this observation, without changing anything else ?
Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

Yes. With

if time > &cutoff

I prevented the addition of an observation if time equals cutoff, and I also did not set the two macro variables, so they keep their initial value of x.

This then prevents the execution of the data step in the macro.

Just try it with test data, just to be sure.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

I will, thank you greatly. Another small question. If I will have several such datasets appended into one large dataset, distinguished by a parameter (index = 1,2,3,4). Can I add a "by" statement to this code so it will run separately for each value of index ?
Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

The first thing that came to my mind was:

- add a WHERE condition so only a certain index is processed at a time, preferably when sorting (and do that into an intermediate dataset).

- encapsulate the whole shebang in a macro defintion that takes one index value as a parameter (use that macro variable in the WHERE)

- call the macro with the indexes wanted

- if number of indexes is sufficiently large, call the macro with call execute from a data step for further automation.

- add a PROC APPEND step within the macro if you want one consolidated output dataset.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

OK, I will try this. Thank you. As you can imagine, what I asked here is just a small piece of what I need. What I actually need, is to take the output (for each index) of your code, to get the survival values and time values into a string, and to send it into PROC POWER...fairly complicated.
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

Hello Kurt,

 

I was thinking about your solution, and I don't think I know how to proceed with it. I expect to have around 40 different scenarios, so this macro will need to be called many times.

Super User
Posts: 6,928

Re: Cutting a dataset according to a specified value

Wrap the code into a macro, use the variable parts as parameters in the %macro definition. In the code, replace variable parts with the parameters.

Test manually to verify that the macro works.

Write a data step that reads data internally from CARDS; to store the parameters for scenarios.

Run the macro automatically from the data:

data _null_;
set scenarios;
length command $200;
command = '%macroname(' !! trim(param1) !! ',' !! trim(param2) !! ',' !! trim(param3) !! ');';
call execute(command);
run;

Note that it is necessary to quote the macro call with single quotes, so that the macro is not resolved immediately, but after the data step has finished.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Regular Contributor
Posts: 180

Re: Cutting a dataset according to a specified value

[ Edited ]

I have done something slightly different.

 

I wrote a macro, with %do loops, something like:

 

 

%do i = 1 %to 5;
       %do j = 1 to 3;
             data have_treatment;
              ....
              where group = treatment;
             run;
             
             data have_control;
              ....
              where group = control;
             run;

             data want_treatment;
             ........
      %end;
%end;

 

I am not sure if this will even work, but I don't know what to do with the %let time1 = x; Should I define it again for control ? I think that time1 must have some values in now that should be erased.

 

Your suggested code is nice, but my SAS skill is not good enough to figure it out, unfortunatelly.

 

Just to make it clear, I have several scenarios by two variables (i and j) and I have two treatments. Eventaually, want_trt and want_cnt need to be appended somehow

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 828 views
  • 1 like
  • 2 in conversation