BookmarkSubscribeRSS Feed
ArdigenSupport
Fluorite | Level 6

Hello  Community,

I hope you're doing well. I have a dataset with three key variables: patient, time, and drug. My goal is to create a new variable (group) that reflects the sequence of drugs that each patient receives based on their prescription time.

For instance, here’s a small sample of the data:

patient time drug
11A
12B
21B
22A
 

In this example:

  • Patient 1 received drug A at time 1 and drug B at time 2, so their group should be "AB".

  • Patient 2 received drug B at time 1 and drug A at time 2, so their group should be "BA".

I need to accumulate a sub-string for each patient based on the drug sequence, where the order of the drugs is determined by the time variable.

What I’ve Tried So Far:

I initially thought of using a BY group processing approach in a DATA step to accumulate the drug names for each patient, but I’m running into some challenges with concatenation and ensuring the drugs are ordered correctly.

Solution Requirements:

  1. Sort by patient and time: To ensure the drug sequence is accumulated in the correct order.

  2. Concatenate drug names: For each patient, accumulate the drug sequence into a new variable called group.

  3. Handle each patient separately: The accumulated group should be assigned to each patient, but only once per patient (after all their records are processed).

Expected Output:

For each patient, the output should be:

patient time drug group
11AAB
12BAB
21BBA
22ABA
 

Seeking Guidance:

  1. Best Approach: What’s the most efficient way to  handle this type of problem? Should I continue with a BY group processing approach, or is there another technique that would work better (such as using PROC SQL or another method)?

  2. Challenges with Concatenation: How can I ensure that the drug names are concatenated in the right order for each patient, even if the data might not always come sorted by time?

  3. Optimisation Tips: If there’s a more optimised solution to handle larger datasets, I’d appreciate any pointers!

Conclusion:

This seems like a common issue where you need to group data by patient and time, but I’m hoping to make the solution as efficient as possible. Any advice or solutions from the community would be greatly appreciated!

For further context,

I am also exploring  to analysing healthcare data, which might help with more advanced drug sequence analysis. If anyone has experience combining with tools, I would love to hear your thoughts!

Thank you in advance!

 

 

3 REPLIES 3
Patrick
Opal | Level 21

For data processing on the SAS side using data step by group processing is a very valid approach. If your data resides in a database then pushing processing to the database using SQL (eventually DB native SQL) would be another option.

Below code for processing on the SAS side.

 

I've amended your sample data a bit with cases for repeated drugs at different time points. The logic for the group var takes currently only the first occurrence of a drug per patient. If you need something else then you need to tell us and also provide sample data that contains such cases.

 

It's also always appreciated if you provide sample data already via working SAS code as done below:

data have;
  infile datalines truncover dsd dlm=',';
  input patient time drug :$8.;
  datalines;
1,1,A
1,2,B
2,1,B
2,2,A
2,3,B
3,1,AB
3,2,AB
3,3,BA
3,4,B
3,5,A B
3,5,b
;
run;

However you implement, you will in any case need to read the data twice, once for constructing the group variable and once for adding the group to all the rows of your source data. 

Only sort the data if you can't be sure it's already sorted. If it could be sorted but you're not sure then you can use the presorted option with Proc Sort - but only if the data could already be sorted because presorted does a prescan (read) of the data so it's only beneficial for cases where the data is actually already sorted. If it's not already sorted the the option will add processing time.

proc sort data=have presorted;
  by patient time;
run;

 

About your Best Approach question:

I'd say that depends on your needs. Coding for performance often means more complicated and though harder to maintain code. In my mind it's often more about keeping the balance and what's the right approach for a specific situation.
If you want to take performance tuning to the "extreme" then you need also to take your environment into consideration - throughput to disk both for WORK and permanent tables , available memory, data volume to process, etc.

 

Below two coding options.

 

Option 1: Keep code simple but not best performance. Creates an intermediary table.

/* create and populate group variable per patient */
data inter(keep=patient group);
  set have;
  by patient time;

  /* Declare variables */
  length group $20;
  retain group;

  /* First time seeing a patient — initialize group */
  if first.patient then group = drug;
  /* add new drug if not already in group */
  else  
  if findw(group,drug,'|','it')=0 then group = catx('|', group, drug);

  if last.patient then output;
run;

/* Repopulate full group value across all rows for each patient */
data want;
  merge have inter;
  by patient;
run;

 

Option 2: Performance tuned code with processing in a single data step. Avoids writing to an intermediary table.

Uses a "double DOW". The first loop populates the group variable, the 2nd DOW loop writes all rows to the target table with the populated group variable included. Also this approach requires the source data to be sorted by patient and time.
There are multiple papers discussing the DOW loop technique. Here one of them: Practical Uses of the DOW Loop

data want2(drop=_:);

  dcl hash h1();
  h1.defineKey('_drug');
  h1.defineDone();
  
  do _n_=1 by 1 until(last.patient);
    set have;
    by patient;
    /* Declare variables */
    length group $20;
    _drug=upcase(drug);
  
    /* First time seeing a patient — empty hash lookup table and start populating group */
    if first.patient then
      do;
        _rc=h1.clear();
        group=_drug;
        _rc=h1.add();
      end;
    else
    /* for same patient: add new drugs to group */
    if h1.check() ne 0 then
      do;
        group = catx('|', group, _drug);
        _rc=h1.add();
      end;
  end;
 
  do _n_=1 by 1 until(last.patient);
    set have;
    by patient;
    output;
  end;
run;

 

 

 

 

 

Ksharp
Super User

That would be easy if there were not duplicated drug within a patient.

Otherwise,you need PROC SORT+NODUPKEY option to eliminate these duplicated drug.

 

data have;
  infile datalines truncover dsd dlm=',';
  input patient time drug :$8.;
  datalines;
1,1,A
1,2,B
2,1,B
2,2,A
;

data want;
do until(last.patient);
 set have;
 by patient;
 length group $ 80;
 group=cats(group,drug);
end;
do until(last.patient);
 set have;
 by patient;
 output;
end;
run;
dxiao2017
Lapis Lazuli | Level 10

Hi @ArdigenSupport , I copied my answer to @pink_poodle 's question in the other post here. The basic idea is: (1) first use lag() function or proc transpose, and then cat() function to create the patient group column, and then (2) use sql join to produce the final table. And BTW, I think lag() function and proc transpose are suitable techniques for dealing with data from this kind of typical crossover or repeated measure design.

 

(1) use lag() function

data have1;
	input patient time drug $;
	datalines;
1 1 A
1 2 B
2 1 B
2 2 A
;
run;
proc print data=have1;run;
data want1;
   set have1;
   by patient;
   drugprev1=lag1(drug);
   if first.patient then drugprev1=' ';
   if last.patient then group=cat(compress(drugprev1),
                                  compress(drug));
run;
proc print data=want1;run;
proc sql;
create table patientgrp as
select distinct patient,group
   from want1
   where group is not null;
select * from patientgrp;
quit;
proc sql;
create table wantfinal as
select w.patient,w.time,pg.group
   from want1 as w left join
        patientgrp as pg
   on w.patient=pg.patient;
select * from wantfinal;
quit;

dxiao2017_0-1754660246175.png

 

(2) use proc transpose

data have;
	input patient time drug $;
	datalines;
1 1 A
1 2 B
1 3 B
2 1 B
2 2 A
;
run;
proc print data=have;run;
data have1;
   set have;
   timechar=cat('t',put(time,1.));
run;
proc print data=have1;run;
proc transpose data=have1 out=want;
   by patient;
   id timechar;
   var drug;
run;
proc print data=want;run;
data want1;
   set want;
   group=compress(cat(t1,t2,t3));
run;
proc print data=want1;run;
proc sql;
select h.*,
       w1.group
   from have as h left join
        want1 w1
   on h.patient=w1.patient
   order by h.patient,h.time;
quit;

dxiao2017_3-1754661163390.png

dxiao2017_4-1754661218892.png

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 516 views
  • 2 likes
  • 4 in conversation