Hi All, I have been fighting with this data file..
Problem : There are Visits and Medication number associated with PatientID. For the missing medication, there will be reCalc flag of1. So, what I need to do here is whenever there is 1 I will have to identify the just previous visit(only) for the particular PatientID and need to find how many medications were given in the previous visit(It could be 1 or 2 and N number of medications given during the visits as per the plan). Once found this as well, need to add those medications in the current visit which is flagged as 1.
PatientID | Visit | Medication | reCalc |
1 | Visit1 | 111111 | |
1 | Visit2 | 222222 | |
1 | Visit2 | 33333 | |
1 | Visit3 | 1 | |
1 | Visit4 | 888888 | |
……. | |||
12 | Visit1 | 34343 | |
12 | Visit2 | 12345 | |
12 | Visit2 | 67891 | |
12 | Visit2 | 54321 | |
12 | Visit4 | 1 | |
12 | Visit5 | 43333 | |
12 | Visit6 | 1 |
below is the expected output,
PatientID | Visit | Medication | reCalc |
1 | Visit1 | 111111 | |
1 | Visit2 | 222222 | |
1 | Visit2 | 33333 | |
1 | Visit3 | 222222 | |
1 | Visit3 | 33333 | 1 |
1 | Visit4 | 888888 | |
……. | |||
12 | Visit1 | 34343 | |
12 | Visit2 | 12345 | |
12 | Visit2 | 67891 | |
12 | Visit2 | 54321 | |
12 | Visit4 | 12345 | |
12 | Visit4 | 67891 | |
12 | Visit4 | 54321 | 1 |
12 | Visit5 | 43333 | |
12 | Visit6 | 43333 | 1 |
Please please look into this.
So now the business rule needs to be expanded to accommodate one or more visit records with no medications falling between a complete visit and the recalc-visit. So using the LAG function based on the event of first.visit=1 no longer is the rule. It would have to be based on a more complex rule. And using _N_-1 as the end of the prior visit is no longer accurate, meaning a separate conditional lag would be needed.
In this case, I'd drop the use of lag and just set _prior_visit_start and _prior_visit_end variables whenever there is a real visit in hand with real medications:
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 12345
12 Visit2 67891
12 Visit2 54321
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
13 Visit1 454545
13 Visit1 778788
13 Visit2 .
13 Visit3 .
13 Visit4 .
13 Visit5 .
13 Visit6 .
13 Visit7 . 1
14 Visit1 545450
14 Visit1 3434454
14 Visit2 . 1
14 Visit3 . 1
;
data want (drop=_:);
set have;
by patientid visit;
retain _prior_visit_start _prior_visit_end;
if first.visit and medication^=. then _prior_visit_start=_n_;
if last.visit and medication^=. then _prior_visit_end=_n_;
if recalc^=1 then output;
else do ptr=_prior_visit_start to _prior_visit_end;
set have (keep=medication) point=ptr;
output;
end;
run;
Remember, this program assumes that each ID has at least one "real" visit sometime before any visits with recalc=1. If that were not the case, then a recalc record from one id could end up with visit data from the preceding id.
Editted explanatory note:
How this works:
The program reads the incoming data, expecting it to be sorted by patientid/visit (the BY statement). When the first record for a visit is found (first.visit=1) which is also a record with a valid medication (medication^=.) then you are at the start of a visit whose data may be needed later on. So for that record save the record's position in the entire dataset (_prior_visit_start=_N_). Now _N_ is not always the actual position in the dataset, but it is in this simple data step. Do the same for the last record for any such visit (_prior_visit_end=_N_).
The two variables used to record the record position are "retained", so SAS will not default to automatically resetting them to missing values in each iteration of the data step. Which makes them available to guide the re-reading of data when needed.
Now if the incoming record doesn't have recalc=1, there is no need to fetch data from a prior visit, so just output.
But if recalc=1 then re-read each record ("do ptr=_prior_visit_start to _prior_visit_end") in the most recent visit that had actual medications, and re-output each of them. This is done by using another SET statement with the "point=ptr" option. But also note this extra SET statement only reads the medication variable ("(keep=medication)"), so it doesn't overwrite the current visit number with the prior visit number, .. or any other variables that might be in the data.
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 12345
12 Visit2 67891
12 Visit2 54321
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
;
data want;
set have;
by patientid visit;
retain t;
if last.visit and Medication then t=Medication;
if recalc then Medication=t;
run;
If the current visit is a single record with recalc=1, you want to repeat the entire medication sequence for the preceding visit by the same patient. So if the preceding visit had 3 medication records, you want to replicate them with the current visit.
This is one of those times, when the mistakenly-abhorred use of a lag function in the IF statement is just the ticket:
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 12345
12 Visit2 67891
12 Visit2 54321
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
;
data want (drop=_:);
set have;
by patientid visit;
retain _prior_visit_start;
if first.visit then _prior_visit_start=lag(_n_);
if recalc^=1 then output;
else do ptr=_prior_visit_start to _n_-1;
set have (keep=medication) point=ptr;
output;
end;
run;
Assumptions:
Back to the subject of a lag function inside an IF statement. What you want is to know where in dataset HAVE is the beginning of the prior visit. The "if first.visit" test says to do an action only at the start of the visit. And that action is the lag function, which really should be thought of as an "update queue" function. So instead of looking back one observation, the lag here is looking back one "instance" - namely the prior time first.visit=1. Thus one keeps track of the observation number starting the prior visit.
And of course, the end of the prior visit is simply _N_-1.
Also whenever recalc=1 just re-read only the medication sequence for the prior group outputting once per medication. This is done via the "point=" option on the set statement. Otherwise, if recalc^=1 simply output.
The only way this deviates from your sample want, is that it has recalc=1 in every one of the generated records for the recalc-visit, not just the last one. That would not be hard to fix, but would detract from the central point.
Thanks a lot for all your effort. I added your code snippet in my code.
We are almost there,,,, this code is taking most part of my datafile Except that, the below scenario has to be covered,
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 .
12 Visit2 .
12 Visit2 .
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
;
Please try this. When there are many null values, it is nit considering. Please help me with this as well.
I have told you what I understand the task to be - as a specific "business rule".
You have shown additional input data that my understanding does not address. That's find, but you have NOT shown what you want the output to look like. I am not a good mind reader. Please show (1) what the program produced, (2) what you want it to produce, and (3) a description of the actual business rule you are attempting to implement.
I appreciate you for your sincere effort on this tricky problem. The solution you have provided was covering my code for Patient 1 and 12. But not for 13 and 14. I have added the scenario here for reference with the desire output. Please have a look.
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 12345
12 Visit2 67891
12 Visit2 54321
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
13 Visit1 454545
13 Visit1 778788
13 Visit2 .
13 Visit3 .
13 Visit4 .
13 Visit5 .
13 Visit6 .
13 Visit7 . 1
14 Visit1 545450
14 Visit1 3434454
14 Visit2 . 1
14 Visit3 . 1
;
Run;
Desire Output :
PatientID | Visit | Medication | reCalc |
1 | Visit1 | 111111 | |
1 | Visit2 | 222222 | |
1 | Visit2 | 33333 | |
1 | Visit3 | 222222 | 1 |
1 | Visit3 | 33333 | 1 |
1 | Visit4 | 888888 | |
12 | Visit1 | 34343 | |
12 | Visit2 | 12345 | |
12 12 | Visit2 | 67891 | |
12 | Visit2 | 54321 | |
12 | Visit4 | 12345 | 1 |
12 | Visit4 | 67891 | 1 |
12 | Visit4 | 54321 | 1 |
12 12 | Visit5 | 43333 | |
12 | Visit6 | 43333 | 1 |
13 | Visit1 | 454545 | |
13 | Visit1 | 778788 | |
13 | Visit2 | ||
13 | Visit3 | ||
13 | Visit4 | ||
13 | Visit5 | ||
13 | Visit6 | ||
13 | Visit7 | 454545 | 1 |
13 | Visit7 | 778788 | 1 |
14 | Visit1 | 545450 | |
14 | Visit1 | 3434454 | |
14 | Visit2 | 545450 | 1 |
14 | Visit2 | 3434454 | |
14 | Visit3 | 3434454 | 1 |
14 | Visit2 | 3434454 |
So now the business rule needs to be expanded to accommodate one or more visit records with no medications falling between a complete visit and the recalc-visit. So using the LAG function based on the event of first.visit=1 no longer is the rule. It would have to be based on a more complex rule. And using _N_-1 as the end of the prior visit is no longer accurate, meaning a separate conditional lag would be needed.
In this case, I'd drop the use of lag and just set _prior_visit_start and _prior_visit_end variables whenever there is a real visit in hand with real medications:
data have;
infile cards truncover;
input PatientID Visit $ Medication reCalc;
cards;
1 Visit1 111111
1 Visit2 222222
1 Visit2 33333
1 Visit3 . 1
1 Visit4 888888
12 Visit1 34343
12 Visit2 12345
12 Visit2 67891
12 Visit2 54321
12 Visit4 . 1
12 Visit5 43333
12 Visit6 . 1
13 Visit1 454545
13 Visit1 778788
13 Visit2 .
13 Visit3 .
13 Visit4 .
13 Visit5 .
13 Visit6 .
13 Visit7 . 1
14 Visit1 545450
14 Visit1 3434454
14 Visit2 . 1
14 Visit3 . 1
;
data want (drop=_:);
set have;
by patientid visit;
retain _prior_visit_start _prior_visit_end;
if first.visit and medication^=. then _prior_visit_start=_n_;
if last.visit and medication^=. then _prior_visit_end=_n_;
if recalc^=1 then output;
else do ptr=_prior_visit_start to _prior_visit_end;
set have (keep=medication) point=ptr;
output;
end;
run;
Remember, this program assumes that each ID has at least one "real" visit sometime before any visits with recalc=1. If that were not the case, then a recalc record from one id could end up with visit data from the preceding id.
Editted explanatory note:
How this works:
The program reads the incoming data, expecting it to be sorted by patientid/visit (the BY statement). When the first record for a visit is found (first.visit=1) which is also a record with a valid medication (medication^=.) then you are at the start of a visit whose data may be needed later on. So for that record save the record's position in the entire dataset (_prior_visit_start=_N_). Now _N_ is not always the actual position in the dataset, but it is in this simple data step. Do the same for the last record for any such visit (_prior_visit_end=_N_).
The two variables used to record the record position are "retained", so SAS will not default to automatically resetting them to missing values in each iteration of the data step. Which makes them available to guide the re-reading of data when needed.
Now if the incoming record doesn't have recalc=1, there is no need to fetch data from a prior visit, so just output.
But if recalc=1 then re-read each record ("do ptr=_prior_visit_start to _prior_visit_end") in the most recent visit that had actual medications, and re-output each of them. This is done by using another SET statement with the "point=ptr" option. But also note this extra SET statement only reads the medication variable ("(keep=medication)"), so it doesn't overwrite the current visit number with the prior visit number, .. or any other variables that might be in the data.
Perfect!!!
The code you shared works in 100 percent perfection.
And I had accepted this as Solution.
Could you please help me understand how does this code work. I just tried your code blindly. And worked. But, did not understand the logicality of it. I am really not as techy as you.
Kudos to you technical skill.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.