Hello,
Here is the sample data set similar to what I am handling now.
data temp; infile datalines dlm='#'; input id $ dttm :datetime20. var1 var2 var3 var4; format dttm datetime20.; datalines; A # 20JAN2021:13:37:39.000 # 1 # # # A # 10FEB2021:09:45:46.000 # 1 # 0 # 0 # 0 A # 04SEP2021:11:17:24.000 # 1 # # # A # 09OCT2021:18:01:40.000 # 1 # 0 # 0 # 0 A # 10OCT2021:06:15:52.000 # 0 # # # A # 13OCT2021:13:42:22.000 # 1 # # # A # 17OCT2021:10:44:29.000 # 0 # 0 # # A # 11DEC2021:09:54:47.000 # 0 # 1 # 1 # A # 15DEC2021:16:55:44.000 # 1 # 0 # 0 # A # 15MAY2022:11:33:08.000 # 0 # # # ; run;
Here are the codes I wrote.
data temp1;set temp; by id; length var_combined $20.; if first.id then do; ct_one=.; ct_zero=.; end; var_combined=catx(', ', of var:); ct_one=count(var_combined, '1'); ct_zero=count(var_combined, '0'); if (ct_one>0 and ct_zero=0) then outcome=1; if (ct_one=0 and ct_zero>0) then outcome=0; if (ct_one>0 and ct_zero>0) then outcome=1-lag(outcome); run;
Explanations.
Any advice or help would be greatly appreciated! Please let me know if you need additional information. Thank you.
Can you clarify the rules?
You mentioned "previous date", but your data has DATETIME values, not DATE values.
Can you have two observations on the date, but different times of day. In that case how will that impact your rule #2 ?
If you did mean the previous DATETIME value then can two observations have the same DATETIME value? If so then what do you want to do then?
Of is the "DATE" actually only used for sorting and the actual rule is just to toggle from the pervious result?
What is the rule when the FIRST observation for an ID has mixed 1 and 0 values?
Can you provide the output you want for the given input?
If the rule is that when the first observation has mixed values then new variable is missing then this is my interpretation of what you EXPECT.
data temp;
input id $ dttm :datetime. var1-var4 expect reason $40.;
format dttm datetime19.;
datalines;
A 20JAN2021:13:37:39 1 . . . 1 Only one and missing
A 10FEB2021:09:45:46 1 0 0 0 0 Toggle previous
A 04SEP2021:11:17:24 1 . . . 1 Only one and missing
B 09OCT2021:18:01:40 1 0 0 0 . Mixed on first observation
B 10OCT2021:06:15:52 0 . . . 0 Only zero and missing
B 13OCT2021:13:42:22 1 . . . 1 Only one and missing
B 17OCT2021:10:44:29 0 0 . . 0 Only zero and missing
B 11DEC2021:09:54:47 0 1 1 . 1 Toggle previous
B 11DEC2021:09:54:47 1 1 0 0 0 Toggle previous
C 15DEC2021:16:55:44 1 0 0 . . Mixed on first observation
C 15MAY2022:11:33:08 0 . . . 0 Only zero and missing
;
Which you can get by doing something like this:
data temp1;
set temp;
by id;
if whichn(1,of var:) and 0=whichn(0,of var:) then want=1;
else if 0=whichn(1,of var:) and whichn(0,of var:) then want=0;
else if first.id then want=.;
else want = not want;
retain want;
run;
Now if you really did want to toggle the value from the previous DATETIME (and not just the previous observation) then you could retain that "lagged" value instead of the actual value and only update it on the last observation for a DATETIME value.
data temp;
input id $ dttm :datetime. var1-var4 expect reason $40.;
format dttm datetime19.;
datalines;
A 20JAN2021:13:37:39 1 . . . 1 Only one and missing
A 10FEB2021:09:45:46 1 0 0 0 0 Toggle previous
A 04SEP2021:11:17:24 1 . . . 1 Only one and missing
B 09OCT2021:18:01:40 1 0 0 0 . Mixed on first observation
B 10OCT2021:06:15:52 0 . . . 0 Only zero and missing
B 13OCT2021:13:42:22 1 . . . 1 Only one and missing
B 17OCT2021:10:44:29 0 0 . . 0 Only zero and missing
B 11DEC2021:09:54:47 0 1 1 . 1 Toggle previous datetime
B 11DEC2021:09:54:47 1 1 0 0 1 Toggle previous datetime
C 15DEC2021:16:55:44 1 0 0 . . Mixed on first observation
C 15MAY2022:11:33:08 0 . . . 0 Only zero and missing
;
data temp1;
set temp;
by id dttm;
if whichn(1,of var:) and 0=whichn(0,of var:) then want=1;
else if 0=whichn(1,of var:) and whichn(0,of var:) then want=0;
else if first.id then want=.;
else want = not lag_want;
output;
if last.dttm then lag_want=want;
if last.id then lag_want=.;
retain lag_want;
run;
Instead of
if (ct_one>0 and ct_zero>0) then outcome=1-lag(outcome);
you should use
prev_outcome=lag(outcome);
if (ct_one>0 and ct_zero>0) then outcome=1-prev_outcome;
Thank you, @PaigeMiller. I replaced the line with your suggested two lines, and it partially works. The outcome is missing when the first obs has multiple values (id=C, var_combined='1, 0, 0'). So I made the codes into two steps, and it seems working as I want now. It seems like where and when the lag function is executed makes a difference. Please let me know if you see any issues. Below are the codes that I tried.
data temp1;set temp; length var_combined $20.; by id; var_combined=catx(', ', of var:); ct_one=count(var_combined, '1'); ct_zero=count(var_combined, '0'); if (ct_one>0 and ct_zero=0) then outcome=1; if (ct_one=0 and ct_zero>0) then outcome=0; if first.id and (ct_one>0 and ct_zero>0) then outcome=var1; run; data temp2;set temp1; by id; pre_outcome=lag(outcome); if first.id then pre_outcome=.; if not first.id and (ct_one>0 and ct_zero>0) then outcome=1-pre_outcome; run;
The results.
Well, I tested it with the real data, and it doesn't work perfectly. When there are two consecutive observations with multiple values, the second one doesn't have any value. I updated the data set (temp) and then ran the codes again. See the results below. Please advise!
data temp; infile datalines dlm='#'; input id $ dttm :datetime20. var1 var2 var3 var4; format dttm datetime20.; datalines; A # 20JAN2021:13:37:39.000 # 1 # # # A # 10FEB2021:09:45:46.000 # 1 # 0 # 0 # 0 A # 04SEP2021:11:17:24.000 # 1 # # # B # 09OCT2021:18:01:40.000 # 1 # 0 # 0 # 0 B # 10OCT2021:06:15:52.000 # 0 # # # B # 13OCT2021:13:42:22.000 # 1 # # # B # 17OCT2021:10:44:29.000 # 0 # 0 # # B # 11DEC2021:09:54:47.000 # 0 # 1 # 1 # B # 11DEC2021:09:54:47.000 # 1 # 1 # 0 # 0 C # 15DEC2021:16:55:44.000 # 1 # 0 # 0 # C # 15MAY2022:11:33:08.000 # 0 # # # ; run; data temp1;set temp; length var_combined $20.; by id; var_combined=catx(', ', of var:); ct_one=count(var_combined, '1'); ct_zero=count(var_combined, '0'); if (ct_one>0 and ct_zero=0) then outcome=1; if (ct_one=0 and ct_zero>0) then outcome=0; if first.id and (ct_one>0 and ct_zero>0) then outcome=0; /* first obs and multiple values, then assign "sinus" (0) */ run; data temp2;set temp1; by id; pre_outcome=lag(outcome); if first.id then pre_outcome=.; if not first.id and (ct_one>0 and ct_zero>0) then outcome=1-pre_outcome; run;
The results.
LAG() does not look back into the observation read previously from a dataset. Instead it sets up a FIFO queue, and puts a value into it every time it is called. By calling it in a conditional branch, you uncouple the filling of the queue from the progress through the incoming dataset.
LAG cannot be used to calculate a value and also put it into the queue; when LAG is called in your code, outcome will always be missing (not yet set), and that value will end up in the queue for the next call.
Instead, make outcome a RETAINed variable, and initialize it properly at first.id.
This part
if first.id then do;
ct_one=.;
ct_zero=.;
end;
makes no sense, as the variables are not retained, and will be set to missing at the start of each DATA step iteration anyway.
Thank you, @Kurt_Bremser. First, I removed the part you pointed out. Yes, it doesn't make sense (it was a leftover from my previous tries!). Then I added the retain statement. Still, it doesn't work as I want. Here are the codes that I tried.
data temp1;set temp; by id; length var_combined $20.; retain outcome; if first.id then outcome=.; var_combined=catx(', ', of var:); ct_one=count(var_combined, '1'); ct_zero=count(var_combined, '0'); if (ct_one>0 and ct_zero=0) then outcome=1; if (ct_one=0 and ct_zero>0) then outcome=0; if (ct_one>0 and ct_zero>0) then outcome=1-lag(outcome); run;
The results.
Once again, this
if (ct_one>0 and ct_zero>0) then outcome=1-lag(outcome);
won't work. At the moment LAG is called, it puts the current value of outcome (retained from the previous iteration) into the queue, not the one that is calculated.
And you still call LAG conditionally, which will load the queue inconsistently.
Also point out which values of outcome you expect for your example data.
Can you clarify the rules?
You mentioned "previous date", but your data has DATETIME values, not DATE values.
Can you have two observations on the date, but different times of day. In that case how will that impact your rule #2 ?
If you did mean the previous DATETIME value then can two observations have the same DATETIME value? If so then what do you want to do then?
Of is the "DATE" actually only used for sorting and the actual rule is just to toggle from the pervious result?
What is the rule when the FIRST observation for an ID has mixed 1 and 0 values?
Can you provide the output you want for the given input?
If the rule is that when the first observation has mixed values then new variable is missing then this is my interpretation of what you EXPECT.
data temp;
input id $ dttm :datetime. var1-var4 expect reason $40.;
format dttm datetime19.;
datalines;
A 20JAN2021:13:37:39 1 . . . 1 Only one and missing
A 10FEB2021:09:45:46 1 0 0 0 0 Toggle previous
A 04SEP2021:11:17:24 1 . . . 1 Only one and missing
B 09OCT2021:18:01:40 1 0 0 0 . Mixed on first observation
B 10OCT2021:06:15:52 0 . . . 0 Only zero and missing
B 13OCT2021:13:42:22 1 . . . 1 Only one and missing
B 17OCT2021:10:44:29 0 0 . . 0 Only zero and missing
B 11DEC2021:09:54:47 0 1 1 . 1 Toggle previous
B 11DEC2021:09:54:47 1 1 0 0 0 Toggle previous
C 15DEC2021:16:55:44 1 0 0 . . Mixed on first observation
C 15MAY2022:11:33:08 0 . . . 0 Only zero and missing
;
Which you can get by doing something like this:
data temp1;
set temp;
by id;
if whichn(1,of var:) and 0=whichn(0,of var:) then want=1;
else if 0=whichn(1,of var:) and whichn(0,of var:) then want=0;
else if first.id then want=.;
else want = not want;
retain want;
run;
Now if you really did want to toggle the value from the previous DATETIME (and not just the previous observation) then you could retain that "lagged" value instead of the actual value and only update it on the last observation for a DATETIME value.
data temp;
input id $ dttm :datetime. var1-var4 expect reason $40.;
format dttm datetime19.;
datalines;
A 20JAN2021:13:37:39 1 . . . 1 Only one and missing
A 10FEB2021:09:45:46 1 0 0 0 0 Toggle previous
A 04SEP2021:11:17:24 1 . . . 1 Only one and missing
B 09OCT2021:18:01:40 1 0 0 0 . Mixed on first observation
B 10OCT2021:06:15:52 0 . . . 0 Only zero and missing
B 13OCT2021:13:42:22 1 . . . 1 Only one and missing
B 17OCT2021:10:44:29 0 0 . . 0 Only zero and missing
B 11DEC2021:09:54:47 0 1 1 . 1 Toggle previous datetime
B 11DEC2021:09:54:47 1 1 0 0 1 Toggle previous datetime
C 15DEC2021:16:55:44 1 0 0 . . Mixed on first observation
C 15MAY2022:11:33:08 0 . . . 0 Only zero and missing
;
data temp1;
set temp;
by id dttm;
if whichn(1,of var:) and 0=whichn(0,of var:) then want=1;
else if 0=whichn(1,of var:) and whichn(0,of var:) then want=0;
else if first.id then want=.;
else want = not lag_want;
output;
if last.dttm then lag_want=want;
if last.id then lag_want=.;
retain lag_want;
run;
data temp;
infile datalines dlm='#';
input id $ dttm :datetime20. var1 var2 var3 var4;
format dttm datetime20.;
datalines;
A # 20JAN2021:13:37:39.000 # 1 # # #
A # 10FEB2021:09:45:46.000 # 1 # 0 # 0 # 0
A # 04SEP2021:11:17:24.000 # 1 # # #
B # 09OCT2021:18:01:40.000 # 1 # 0 # 0 # 0
B # 10OCT2021:06:15:52.000 # 0 # # #
B # 13OCT2021:13:42:22.000 # 1 # # #
B # 17OCT2021:10:44:29.000 # 0 # 0 # #
B # 11DEC2021:09:54:47.000 # 0 # 1 # 1 #
B # 11DEC2021:09:54:47.000 # 1 # 1 # 0 # 0
C # 15DEC2021:16:55:44.000 # 1 # 0 # 0 #
C # 15MAY2022:11:33:08.000 # 0 # # #
;
run;
data temp1; set temp;
by id dttm;
if whichn(1, of var:) and 0=whichn(0, of var:) then var=1;
else if 0=whichn(1, of var:) and whichn(0, of var:) then var=0;
else if first.id then var=0;
else var= not lag_var;
output;
if last.dttm then lag_var=var;
if last.id then lag_var=.;
retain lag_var;
run;
The results (which I wanted).
Don't call the new variable VAR and then use OF VAR: in the code because the NEW variable will be in included in that list (at least in the ELSE clause after the data step compiler has had a chance to see that VAR exists).
Either use VAR1-VAR4 or call the variable something that does not start with VAR.
Also don't indent lines of data. It will just confuse you (and SAS) about where the data starts on the line. And if you have accidentally turned off the editor option to replace tabs with spaces you might end up with actual TAB characters in your data if you run the code without using Display Manager.
Thank you for the tips, @Tom. It actually solved the problem I had with the real data. My codes didn't work perfectly when I used the real data, and not using the same prefix (var) solved the problem!
The final codes.
data temp1; set temp;
by id dttm;
if whichn(1, of var:) and 0=whichn(0, of var:) then outcome=1;
else if 0=whichn(1, of var:) and whichn(0, of var:) then outcome=0;
else if first.id then outcome=0;
else outcome= not lag_outcome;
output;
if last.dttm then lag_outcome=outcome;
if last.id then lag_outcome=.;
retain lag_outcome;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.