Hello Community,
I hope you're doing well. I have a dataset with three key variables: patient
, time
, and drug
. My goal is to create a new variable (group
) that reflects the sequence of drugs that each patient receives based on their prescription time.
For instance, here’s a small sample of the data:
1 | 1 | A |
1 | 2 | B |
2 | 1 | B |
2 | 2 | A |
In this example:
Patient 1 received drug A at time 1 and drug B at time 2, so their group
should be "AB"
.
Patient 2 received drug B at time 1 and drug A at time 2, so their group
should be "BA"
.
I need to accumulate a sub-string for each patient based on the drug
sequence, where the order of the drugs is determined by the time
variable.
What I’ve Tried So Far:
I initially thought of using a BY
group processing approach in a DATA
step to accumulate the drug names for each patient, but I’m running into some challenges with concatenation and ensuring the drugs are ordered correctly.
Solution Requirements:
Sort by patient and time: To ensure the drug sequence is accumulated in the correct order.
Concatenate drug names: For each patient, accumulate the drug sequence into a new variable called group
.
Handle each patient separately: The accumulated group should be assigned to each patient, but only once per patient (after all their records are processed).
Expected Output:
For each patient, the output should be:
1 | 1 | A | AB |
1 | 2 | B | AB |
2 | 1 | B | BA |
2 | 2 | A | BA |
Seeking Guidance:
Best Approach: What’s the most efficient way to handle this type of problem? Should I continue with a BY
group processing approach, or is there another technique that would work better (such as using PROC SQL
or another method)?
Challenges with Concatenation: How can I ensure that the drug names are concatenated in the right order for each patient, even if the data might not always come sorted by time
?
Optimisation Tips: If there’s a more optimised solution to handle larger datasets, I’d appreciate any pointers!
Conclusion:
This seems like a common issue where you need to group data by patient and time, but I’m hoping to make the solution as efficient as possible. Any advice or solutions from the community would be greatly appreciated!
For further context,
I am also exploring to analysing healthcare data, which might help with more advanced drug sequence analysis. If anyone has experience combining with tools, I would love to hear your thoughts!
Thank you in advance!
For data processing on the SAS side using data step by group processing is a very valid approach. If your data resides in a database then pushing processing to the database using SQL (eventually DB native SQL) would be another option.
Below code for processing on the SAS side.
I've amended your sample data a bit with cases for repeated drugs at different time points. The logic for the group var takes currently only the first occurrence of a drug per patient. If you need something else then you need to tell us and also provide sample data that contains such cases.
It's also always appreciated if you provide sample data already via working SAS code as done below:
data have;
infile datalines truncover dsd dlm=',';
input patient time drug :$8.;
datalines;
1,1,A
1,2,B
2,1,B
2,2,A
2,3,B
3,1,AB
3,2,AB
3,3,BA
3,4,B
3,5,A B
3,5,b
;
run;
However you implement, you will in any case need to read the data twice, once for constructing the group variable and once for adding the group to all the rows of your source data.
Only sort the data if you can't be sure it's already sorted. If it could be sorted but you're not sure then you can use the presorted option with Proc Sort - but only if the data could already be sorted because presorted does a prescan (read) of the data so it's only beneficial for cases where the data is actually already sorted. If it's not already sorted the the option will add processing time.
proc sort data=have presorted;
by patient time;
run;
About your Best Approach question:
I'd say that depends on your needs. Coding for performance often means more complicated and though harder to maintain code. In my mind it's often more about keeping the balance and what's the right approach for a specific situation.
If you want to take performance tuning to the "extreme" then you need also to take your environment into consideration - throughput to disk both for WORK and permanent tables , available memory, data volume to process, etc.
Below two coding options.
Option 1: Keep code simple but not best performance. Creates an intermediary table.
/* create and populate group variable per patient */
data inter(keep=patient group);
set have;
by patient time;
/* Declare variables */
length group $20;
retain group;
/* First time seeing a patient — initialize group */
if first.patient then group = drug;
/* add new drug if not already in group */
else
if findw(group,drug,'|','it')=0 then group = catx('|', group, drug);
if last.patient then output;
run;
/* Repopulate full group value across all rows for each patient */
data want;
merge have inter;
by patient;
run;
Option 2: Performance tuned code with processing in a single data step. Avoids writing to an intermediary table.
Uses a "double DOW". The first loop populates the group variable, the 2nd DOW loop writes all rows to the target table with the populated group variable included. Also this approach requires the source data to be sorted by patient and time.
There are multiple papers discussing the DOW loop technique. Here one of them: Practical Uses of the DOW Loop
data want2(drop=_:);
dcl hash h1();
h1.defineKey('_drug');
h1.defineDone();
do _n_=1 by 1 until(last.patient);
set have;
by patient;
/* Declare variables */
length group $20;
_drug=upcase(drug);
/* First time seeing a patient — empty hash lookup table and start populating group */
if first.patient then
do;
_rc=h1.clear();
group=_drug;
_rc=h1.add();
end;
else
/* for same patient: add new drugs to group */
if h1.check() ne 0 then
do;
group = catx('|', group, _drug);
_rc=h1.add();
end;
end;
do _n_=1 by 1 until(last.patient);
set have;
by patient;
output;
end;
run;
That would be easy if there were not duplicated drug within a patient.
Otherwise,you need PROC SORT+NODUPKEY option to eliminate these duplicated drug.
data have;
infile datalines truncover dsd dlm=',';
input patient time drug :$8.;
datalines;
1,1,A
1,2,B
2,1,B
2,2,A
;
data want;
do until(last.patient);
set have;
by patient;
length group $ 80;
group=cats(group,drug);
end;
do until(last.patient);
set have;
by patient;
output;
end;
run;
Hi @ArdigenSupport , I copied my answer to @pink_poodle 's question in the other post here. The basic idea is: (1) first use lag() function or proc transpose, and then cat() function to create the patient group column, and then (2) use sql join to produce the final table. And BTW, I think lag() function and proc transpose are suitable techniques for dealing with data from this kind of typical crossover or repeated measure design.
(1) use lag() function
data have1;
input patient time drug $;
datalines;
1 1 A
1 2 B
2 1 B
2 2 A
;
run;
proc print data=have1;run;
data want1;
set have1;
by patient;
drugprev1=lag1(drug);
if first.patient then drugprev1=' ';
if last.patient then group=cat(compress(drugprev1),
compress(drug));
run;
proc print data=want1;run;
proc sql;
create table patientgrp as
select distinct patient,group
from want1
where group is not null;
select * from patientgrp;
quit;
proc sql;
create table wantfinal as
select w.patient,w.time,pg.group
from want1 as w left join
patientgrp as pg
on w.patient=pg.patient;
select * from wantfinal;
quit;
(2) use proc transpose
data have;
input patient time drug $;
datalines;
1 1 A
1 2 B
1 3 B
2 1 B
2 2 A
;
run;
proc print data=have;run;
data have1;
set have;
timechar=cat('t',put(time,1.));
run;
proc print data=have1;run;
proc transpose data=have1 out=want;
by patient;
id timechar;
var drug;
run;
proc print data=want;run;
data want1;
set want;
group=compress(cat(t1,t2,t3));
run;
proc print data=want1;run;
proc sql;
select h.*,
w1.group
from have as h left join
want1 w1
on h.patient=w1.patient
order by h.patient,h.time;
quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.