I have a dataset that looks like this:
Case_ind | Dateofvisit | Personid |
0 | 3/20/2016 | 1 |
0 | 3/24/2016 | 1 |
1 | 4/3/2016 | 1 |
0 | 4/3/2016 | 1 |
0 | 4/4/2016 | 1 |
0 | 4/20/2016 | 1 |
0 | 6/20/2017 | 1 |
0 | 6/31/2017 | 1 |
0 | 7/5/2017 | 1 |
0 | 7/9/2017 | 1 |
1 | 7/10/2017 | 1 |
0 | 7/11/2017 | 1 |
0 | 7/12/2017 | 1 |
0 | 7/20/2017 | 1 |
0 | 1/16/2016 | 2 |
0 | 1/16/2016 | 2 |
1 | 1/17/2016 | 2 |
0 | 1/18/2016 | 2 |
0 | 1/25/2016 | 2 |
1 | 5/9/2016 | 3 |
0 | 9/16/2016 | 3 |
0 | 9/18/2016 | 3 |
1 | 9/20/2016 | 3 |
0 | 9/27/2016 | 3 |
It is a dataset that has doctor visits 3 weeks before and 3 weeks after for a person that was diagnosed (case_ind=1) for a condition at the doctor's office. As you can see, this could happen multiple times for one person. I want to group the visits into *visit before diagnosis* and *visit after diagnosis* so I can look into what happened in a health care setting before and after that person was diagnosed.
This would presumably look like this:
Case_ind | Dateofvisit | Personid | Place |
0 | 3/20/2016 | 1 | 1 |
0 | 3/24/2016 | 1 | 1 |
1 | 4/3/2016 | 1 | 0 |
0 | 4/3/2016 | 1 | 2 |
0 | 4/4/2016 | 1 | 2 |
0 | 4/20/2016 | 1 | 2 |
0 | 6/20/2017 | 1 | 1 |
0 | 6/31/2017 | 1 | 1 |
0 | 7/5/2017 | 1 | 1 |
0 | 7/9/2017 | 1 | 1 |
1 | 7/10/2017 | 1 | 0 |
0 | 7/11/2017 | 1 | 2 |
0 | 7/12/2017 | 1 | 2 |
0 | 7/20/2017 | 1 | 2 |
0 | 1/16/2016 | 2 | 1 |
0 | 1/16/2016 | 2 | 1 |
1 | 1/17/2016 | 2 | 0 |
0 | 1/18/2016 | 2 | 2 |
0 | 1/25/2016 | 2 | 2 |
1 | 5/9/2016 | 3 | 0 |
0 | 9/16/2016 | 3 | 1 |
0 | 9/18/2016 | 3 | 1 |
1 | 9/20/2016 | 3 | 0 |
0 | 9/27/2016 | 3 | 2 |
I created the following code after messing around for awhile:
data trial_final_data;
set pre_final_data;
by personid ;
if first.personid then place=1;
else if dif(visitdate)>7 then place=1;
if case_ind>0 then place+1;
run;
data trial_final_data1;
set trial_final_data;
if case_ind>0 then place=0;
run;
The code almost worked but there are random errors after the first data step. PersonID 3 would have something that looked like this:
1 | 5/9/2016 | 3 | 2 |
0 | 9/16/2016 | 3 | 2 |
0 | 9/18/2016 | 3 | 2 |
1 | 9/20/2016 | 3 | 3 |
0 | 9/27/2016 | 3 | 3 |
You'll notive that the 2nd observation should be a 1 for the place...but it is a 2. This only happened with like 7 out of 15k people.
Does anyone have an idea why this is going on or another approach to get the end result I would like?
Thank you so much!
The LAG() and DIF() functions do NOT reference the previous observation. They reference the stack of previous values that are built when the function executes. Having the DIF() function call only in the ELSE clause of an IF/THEN/ELSE statement is going to cause your stack of previous values to be messed the **** up.
I good way to avoid this is save the value into a variable that you can then conditionally use.
data trial_final_data;
set pre_final_data;
by personid ;
dif_visitdate=dif(visitdate);
if first.personid then place=1;
else if dif_visitdate >7 then place=1;
if case_ind>0 then place+1;
run;
But for this problem you can just change the order of your tests so that the DIF() always runs.
Are you sure you don't want the last statement to use an ELSE so that if the first record for a person has CASE_IND> 0 it doesn't start with PLACE=2?
data trial_final_data;
set pre_final_data;
by personid ;
if dif(visitdate) >7 then place=1;
else if first.personid then place=1;
else if case_ind>0 then place+1;
run;
The LAG() and DIF() functions do NOT reference the previous observation. They reference the stack of previous values that are built when the function executes. Having the DIF() function call only in the ELSE clause of an IF/THEN/ELSE statement is going to cause your stack of previous values to be messed the **** up.
I good way to avoid this is save the value into a variable that you can then conditionally use.
data trial_final_data;
set pre_final_data;
by personid ;
dif_visitdate=dif(visitdate);
if first.personid then place=1;
else if dif_visitdate >7 then place=1;
if case_ind>0 then place+1;
run;
But for this problem you can just change the order of your tests so that the DIF() always runs.
Are you sure you don't want the last statement to use an ELSE so that if the first record for a person has CASE_IND> 0 it doesn't start with PLACE=2?
data trial_final_data;
set pre_final_data;
by personid ;
if dif(visitdate) >7 then place=1;
else if first.personid then place=1;
else if case_ind>0 then place+1;
run;
Ahhhh that explains a lot! Thank you so much!
Do you know of any resources that could help me understand how SAS processes the lag/dif functions? Thanks!
From the online help for the function:
The queue for each occurrence of LAGn is initialized with n missing values, where n is the length of the queue (for example, a LAG2 queue is initialized with two missing values). When an occurrence of LAGn is executed, the value at the top of its queue is removed and returned, the remaining values are shifted upward, and the new value of the argument is placed at the bottom of the queue. Hence, missing values are returned for the first n executions of each occurrence of LAGn, after which the lagged values of the argument begin to appear.
The key part is the "EACH OCCURENCE". When you have an If Thisvar>5 then Value=Lag3(othervar); the Occurence is not every record only the record(s) where ThisVar is > 5.
Please see
data example; do outer=1 to 5; do inner= 3 to 7; output; end; end; run; data result; set example; if inner> 5 then do; newval=lag3(outer); other = lag2(inner); end; run;
Tracing down why "other" = inner and not the value two before will be educational. It has to do that the stack is only updated when Inner is > 5, so 6 and 7 are the only values the lag2(inner) sees.
And the first non-missing value of newval = 1 because the lag stack is 3 deep and the first time the stack has a non-missing value for Lag3 is looking at the outer=1 and inner=6.
You might expect to get a value for inner=4,5,6,7
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.