For data processing on the SAS side using data step by group processing is a very valid approach. If your data resides in a database then pushing processing to the database using SQL (eventually DB native SQL) would be another option.
Below code for processing on the SAS side.
I've amended your sample data a bit with cases for repeated drugs at different time points. The logic for the group var takes currently only the first occurrence of a drug per patient. If you need something else then you need to tell us and also provide sample data that contains such cases.
It's also always appreciated if you provide sample data already via working SAS code as done below:
data have;
infile datalines truncover dsd dlm=',';
input patient time drug :$8.;
datalines;
1,1,A
1,2,B
2,1,B
2,2,A
2,3,B
3,1,AB
3,2,AB
3,3,BA
3,4,B
3,5,A B
3,5,b
;
run;
However you implement, you will in any case need to read the data twice, once for constructing the group variable and once for adding the group to all the rows of your source data.
Only sort the data if you can't be sure it's already sorted. If it could be sorted but you're not sure then you can use the presorted option with Proc Sort - but only if the data could already be sorted because presorted does a prescan (read) of the data so it's only beneficial for cases where the data is actually already sorted. If it's not already sorted the the option will add processing time.
proc sort data=have presorted;
by patient time;
run;
About your Best Approach question:
I'd say that depends on your needs. Coding for performance often means more complicated and though harder to maintain code. In my mind it's often more about keeping the balance and what's the right approach for a specific situation. If you want to take performance tuning to the "extreme" then you need also to take your environment into consideration - throughput to disk both for WORK and permanent tables , available memory, data volume to process, etc.
Below two coding options.
Option 1: Keep code simple but not best performance. Creates an intermediary table.
/* create and populate group variable per patient */
data inter(keep=patient group);
set have;
by patient time;
/* Declare variables */
length group $20;
retain group;
/* First time seeing a patient — initialize group */
if first.patient then group = drug;
/* add new drug if not already in group */
else
if findw(group,drug,'|','it')=0 then group = catx('|', group, drug);
if last.patient then output;
run;
/* Repopulate full group value across all rows for each patient */
data want;
merge have inter;
by patient;
run;
Option 2: Performance tuned code with processing in a single data step. Avoids writing to an intermediary table.
Uses a "double DOW". The first loop populates the group variable, the 2nd DOW loop writes all rows to the target table with the populated group variable included. Also this approach requires the source data to be sorted by patient and time. There are multiple papers discussing the DOW loop technique. Here one of them: Practical Uses of the DOW Loop
data want2(drop=_:);
dcl hash h1();
h1.defineKey('_drug');
h1.defineDone();
do _n_=1 by 1 until(last.patient);
set have;
by patient;
/* Declare variables */
length group $20;
_drug=upcase(drug);
/* First time seeing a patient — empty hash lookup table and start populating group */
if first.patient then
do;
_rc=h1.clear();
group=_drug;
_rc=h1.add();
end;
else
/* for same patient: add new drugs to group */
if h1.check() ne 0 then
do;
group = catx('|', group, _drug);
_rc=h1.add();
end;
end;
do _n_=1 by 1 until(last.patient);
set have;
by patient;
output;
end;
run;
... View more