Hi there,
I have a data set containing the dose that each patient has received per visit. When the value has changed from previous the visit its value is included. If not, it is supposed to be the same and is not included. To make some calculations I would like to create a code that, when the dose is mantained (samedose = 'Y'), the previous dose is assigned to the actual register.
Let me use a small example to explain what I mean. Starting from this data set:
data sample_data;
infile datalines delimiter=',';
input pt $ visit dose samedose $;
datalines;
001,0,7.4,
001,1,.,Y
001,2,.,Y
001,3,.,Y
002,0,3.7,
002,1,2.3,N
002,2,.,Y
002,3,.,Y
003,0,5.4,
003,1,.,Y
003,2,2.7,N
003,3,.,Y
004,0,5.4,
004,1,3.2,N
004,2,.,Y
004,3,4.8,N
;
run;
I would like to obtain the following completed data set:
For this purpose I thought that the LAG function could be a good option. However, it is behaving unexpectedly:
This code is not working at all:
data sample_lag_output;
set sample_data;
if samedose = 'Y' then dose = lag1(dose);
run;
This code is only assigning properly only the values whose previous value was originally non-missing:
data sample_lag_output;
set sample_data;
lagdose = lag1(dose);
if samedose = 'Y' then dose = lagdose;
run;
That is easy.
data sample_data;
infile datalines delimiter=',';
input pt $ visit dose samedose $ withdrawn $;
datalines;
001,0,7.4, ,
001,1,.,Y,
001,2,.,Y,
001,3,.,Y,
002,0,3.7, ,
002,1,2.3,N,
002,2,.,Y,
002,3,.,Y,
003,0,5.4, ,
003,1,.,Y,
003,2,2.7,N,
003,3,., ,Y
004,0,5.4, ,
004,1,3.2,N,
004,2,., ,Y
004,3,., ,Y
;
run;
data want;
update sample_data(obs=0) sample_data;
by pt;
lag=lag(dose);
if withdrawn='Y' then call missing(dose,samedose);
output;
run;
data sample_data;
infile datalines delimiter=',';
input pt $ visit dose samedose $;
datalines;
001,0,7.4,
001,1,.,Y
001,2,.,Y
001,3,.,Y
002,0,3.7,
002,1,2.3,N
002,2,.,Y
002,3,.,Y
003,0,5.4,
003,1,.,Y
003,2,2.7,N
003,3,.,Y
004,0,5.4,
004,1,3.2,N
004,2,.,Y
004,3,4.8,N
;
run;
data want;
update sample_data(obs=0) sample_data;
by pt;
lag=lag(dose);
output;
run;
Dear @Ksharp,
Thank you for your quick response. Although your code is working for this simple example, I don't know if it will fit my needs for the full data set on which some cells are expected to remain missing if samedose ne 'Y'. From what I see, you are not using any conditional structure in your code to force the LAG value to be used only when samedose='Y'. Or maybe you are including this feature and I'm not fully understanding your code.
Could you please explain a little bit more what is the purpose of each line?
Thank you very much in advance!
"some cells are expected to remain missing if samedose ne 'Y'. From what I see, you are not using any conditional structure in your code to force the LAG value to be used only when samedose='Y'. "
I don't understand what you mean. An example is best to explain question.
You're right, @Ksharp, I tried to simplify as much as possible the input sample and on the way I lost some of its features.
Let me introduce a new variable called "withdrawn" which turns to Y if the patient is withdrawn from the study before completing the 4 scheduled visits (0, 1, 2, 3). In that case, the dose value should remain missing for visits to which the patient did not attend. Imagine this modified sample data set on which patients 003 and 004 have an early termination:
data sample_data;
infile datalines delimiter=',';
input pt $ visit dose samedose $ withdrawn $;
datalines;
001,0,7.4, ,
001,1,.,Y,
001,2,.,Y,
001,3,.,Y,
002,0,3.7, ,
002,1,2.3,N,
002,2,.,Y,
002,3,.,Y,
003,0,5.4, ,
003,1,.,Y,
003,2,2.7,N,
003,3,., ,Y
004,0,5.4, ,
004,1,3.2,N,
004,2,., ,Y
004,3,., ,Y
;
run;
In this case when I run your code I obtain this result on which I have highlighted the unwanted results:
Could your code be modified to include this?
Thanks for your patience!
That is easy.
data sample_data;
infile datalines delimiter=',';
input pt $ visit dose samedose $ withdrawn $;
datalines;
001,0,7.4, ,
001,1,.,Y,
001,2,.,Y,
001,3,.,Y,
002,0,3.7, ,
002,1,2.3,N,
002,2,.,Y,
002,3,.,Y,
003,0,5.4, ,
003,1,.,Y,
003,2,2.7,N,
003,3,., ,Y
004,0,5.4, ,
004,1,3.2,N,
004,2,., ,Y
004,3,., ,Y
;
run;
data want;
update sample_data(obs=0) sample_data;
by pt;
lag=lag(dose);
if withdrawn='Y' then call missing(dose,samedose);
output;
run;
Using lag() in a conditional branch is always dangerous, as lag() feeds its FIFO chain only when it is called, and it puts the current value into it, so you will at one point propagate a missing value if two or more appear in succession. Use a retained variable instead:
data want;
set sample_data;
by pt;
retain _dose;
if first.pt then _dose = .;
if dose ne .
then _dose = dose;
else dose = _dose;
drop _dose;
run;
Dear @Kurt_Bremser,
I thought about doing it with the retain function as you are suggesting, but now that I've tried it with LAG, I want to get to understand how this function works for future applications.
Thank you very much for your quick answer!
As @Kurt_Bremser said, the lag() function only updates its value when called.
Since you don't always call it by using it inside a test
if samedose = 'Y' then dose = lag1(dose);
it misses some updates.
Using an extraneous variable as you did is the proper way to do what you want.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.