Solved: Creating a variable that corresponds to number of values missing in su...

RAGC · Posted 01-21-2019 06:08 PM

Hello everyone,

I am working on a discrete-time survival analysis, and I have quite a bit of missing information on the outcome at various time points. To resolve this, I would like to create a variable that corresponds to the number of missing time points (i.e. the different intervals) to include as a covariate in the analysis. My apologies if my explanation is not that clear, but I have attached two tables that provide a visual example of what I mean. Table 1 is an example of how the dataset currently looks, and table 2 represents an example of how I would like it to look once it has been transformed appropriately.

I can get the basic array that would restructure the data from long to wide:

DATA long;
SET wide;
DO Time=1 TO duration;
IF Time=duration AND event=1 THEN death=1;
ELSE death=0;
OUTPUT;
END;
RUN;

It is just adding in that variable that corresponds to the exact number that were missing in between, and then flagging it as dying or not dying at that final time period where there is data.

Any suggestions would be much appreciated!

mkeintz · Posted 01-21-2019 08:22 PM

For each incoming record you want to output as many records as there are 0's or 1's in the DEATH1-DEATH6 variables. So if there are three such values, you want to output a record for timeptn=1, timeptn=2, and timeptn=3.

And for each output record the interval_length is just the count of columns between the current death=0/1 and the preceding death=0/1. Or for timeptn=1 it's just the position of the 0/1 value in the list of death variables. The sum of interval_length values should equal duration:

data have;
  infile datalines missover;
  input ID Duration Event Death1 Death2 Death3 Death4 Death5 Death6;
datalines;
1 3 1 . 0 1  
2 5 0 0 0 . . 0   
3 3 0 . . 0       
4 1 1 1           
5 6 0 . . . . . 0
run;
 
data want (drop=t);
  set have;
  array dth {6} death1-death6;
  timeptn=1;
  do t=1 to duration;
    if dth{t}=0 or dth{t}=1 then do;
      interval_length=ifn(timeptn=1,t,dif(t));
      death=dth{t};
      output;
      timeptn=timeptn+1;
    end;
  end;
run;

The interesting part is within the "if dth{t}=0 or dth{t}=1 then do" group.

It calculates interval_length as either the value for t (if timeptn=1) or otherwise the difference between the current t and the most recent t for a death=0 or 1. The DIF function is defined as dif(x)=x-lag(x).
The death dummy assignment is obvious.
Then the record is output and timeptn is incremented

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Reeza · Posted 01-21-2019 06:26 PM

Is interval length what you want to calculate? If so, how is that being calculated. To me, for ID 1 they should be the same?

PS Please post tables and such directly into the forum, attachments are a pain to work with.

@RAGC wrote:

Hello everyone,

I am working on a discrete-time survival analysis, and I have quite a bit of missing information on the outcome at various time points. To resolve this, I would like to create a variable that corresponds to the number of missing time points (i.e. the different intervals) to include as a covariate in the analysis. My apologies if my explanation is not that clear, but I have attached two tables that provide a visual example of what I mean. Table 1 is an example of how the dataset currently looks, and table 2 represents an example of how I would like it to look once it has been transformed appropriately.

I can get the basic array that would restructure the data from long to wide:

DATA long;
SET wide;
DO Time=1 TO duration;
IF Time=duration AND event=1 THEN death=1;
ELSE death=0;
OUTPUT;
END;
RUN;

It is just adding in that variable that corresponds to the exact number that were missing in between, and then flagging it as dying or not dying at that final time period where there is data.

Any suggestions would be much appreciated!

RAGC · Posted 01-21-2019 07:48 PM

Hi @Reeza,

I have included the tables below (So sorry for not doing so the first time!!), and made the new variable names I am looking to create in blue. When I have previously done this there was no missing values so each time point (i.e. Death1 to Deathn) corresponded to 1 time point each (i.e. an interval of 1) so the array statement would create the timeptn variable from the i variable in the do statement, the new death variable would be created for each person, and there was no need to create this interval variable to account for the greater time periods.

The interval variable refers to how many sequential time points (called Death1 to Deathn) that data point corresponds to. For example, ID 1 does not have an observation at Death1 (i.e. time point 1 which corresponds to 1 interval). However, they do have a value at Death2 (i.e. time point 2, or the next time interval). As such, I want to create this interval variable to let me know that this first time point (that corresponds to this row for that patient) actually represents two possible time points (i.e. 2 intervals of time) - Death1 and Death2. Hopefully that clears things up, and thank you so much for taking a look!

Table 1: Wide data without flags

ID	Duration	Event	Death1	Death2	Death3	Death4	Death5	Death6
1	3	1	.	0	1
2	5	0	0	0	.	.	0
3	3	0	.	.	0
4	1	1	1
5	6	0	.	.	.	.	.	0

Table 2: Long Data Transformed Correctly

ID	Duration	Event	Timeptn	Death1	Death2	Death3	Death4	Death5	Death6	Interval_Length	Death
1	3	1	1	.	0	1				2	0
1	3	1	2	.	0	1				1	1
2	5	0	1	0	0	.	.	0		1	0
2	5	0	2	0	0	.	.	0		1	0
2	5	0	3	0	0	.	.	0		3	0
3	3	0	1	.	.	0				3	0

mkeintz · Posted 01-21-2019 08:22 PM

For each incoming record you want to output as many records as there are 0's or 1's in the DEATH1-DEATH6 variables. So if there are three such values, you want to output a record for timeptn=1, timeptn=2, and timeptn=3.

And for each output record the interval_length is just the count of columns between the current death=0/1 and the preceding death=0/1. Or for timeptn=1 it's just the position of the 0/1 value in the list of death variables. The sum of interval_length values should equal duration:

data have;
  infile datalines missover;
  input ID Duration Event Death1 Death2 Death3 Death4 Death5 Death6;
datalines;
1 3 1 . 0 1  
2 5 0 0 0 . . 0   
3 3 0 . . 0       
4 1 1 1           
5 6 0 . . . . . 0
run;
 
data want (drop=t);
  set have;
  array dth {6} death1-death6;
  timeptn=1;
  do t=1 to duration;
    if dth{t}=0 or dth{t}=1 then do;
      interval_length=ifn(timeptn=1,t,dif(t));
      death=dth{t};
      output;
      timeptn=timeptn+1;
    end;
  end;
run;

The interesting part is within the "if dth{t}=0 or dth{t}=1 then do" group.

It calculates interval_length as either the value for t (if timeptn=1) or otherwise the difference between the current t and the most recent t for a death=0 or 1. The DIF function is defined as dif(x)=x-lag(x).
The death dummy assignment is obvious.
Then the record is output and timeptn is incremented

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

RAGC · Posted 01-21-2019 08:43 PM

@mkeintz thank you so much for your help!! It works perfectly 🙂 Your explanation was also really helpful in following along the code. Thank you!

Creating a variable that corresponds to number of values missing in subsequent columns for each row

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Creating a variable that corresponds to number of values missing in subsequent columns for each row

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Re: Creating a variable that corresponds to number of values missing in subsequent columns for each

Registration is open

SAS Training: Just a Click Away