Solved: Want to subset a champion/Challenger dataset from larger population

TheNovice · Posted 08-15-2020 12:49 AM

Hi all,

The dataset is like below with accounts undergoing treatment R. At some point they will switch to treatment S. Once they switch to S they will not switch back to R. I would like to identify 2 consecutive records where an ID had R and then received S

Note: Some id's will never move to S

ID	date	type
99997887	05-May-20	R
99997887	04-Jun-20	R
99998960	02-Jun-20	S
99998960	03-Jul-20	S
99998960	03-Aug-20	S
100003216	14-Jun-20	R
100003216	15-Jul-20	R
100005355	05-May-20	R
100005355	04-Jun-20	R
100005355	05-Jul-20	R
100005981	03-Jun-20	R
100005981	03-Aug-20	S
100009933	10-May-20	R
100009933	10-Jul-20	R
100009933	09-Aug-20	R
100010577	22-May-20	R
100010577	22-Jul-20	S

Shmuel · Posted 08-15-2020 02:31 AM

Next is a fixed tested code:

data have;
infile datalines dlm='09'x;
  input ID	date :date9. type :$1. ;
  format date date9. ;
datalines;
99997887	05-May-20	R
99997887	04-Jun-20	R
99998960	02-Jun-20	S
99998960	03-Jul-20	S
99998960	03-Aug-20	S
100003216	14-Jun-20	R
100003216	15-Jul-20	R
100005355	05-May-20	R
100005355	04-Jun-20	R
100005355	05-Jul-20	R
100005981	03-Jun-20	R
100005981	03-Aug-20	S
100009933	10-May-20	R
100009933	10-Jul-20	R
100009933	09-Aug-20	R
100010577	22-May-20	R
100010577	22-Jul-20	S
run;
data want;
 set have;
   by ID;
      retain lag_type  lag_date;
      lag_type = lag(type); lag_date=lag(date); 
      if first.ID then do; lag_type=" "; lag_date=.; end;

if ID in (100005981	, 100010577	) then 
    putlog  ID= date= type= lag_type= lag_date;

     if not first.ID and type='S' and lag_type='R' then do;
        R_type = lag_type; R_date=lag_date;
        S_type = type;  S_date = date;
        output;
       keep ID R_: S_: ;     end;    /* line added */
run;

View solution in original post

mkeintz · Posted 08-15-2020 12:59 AM

Do you want a report? Or a dataset? What do you want it to look like?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

TheNovice · Posted 08-15-2020 01:02 AM

A dataset so i can further analysis.

so i would want all id's where there was an event R and then an event S after.

I only care about extracting 2 rows per ID this way. R would be my champion and S would be the challenger

mkeintz · Posted 08-15-2020 01:11 AM

Assuming that (1) the data are sorted by ID/DATE, and (2) you want both the R record and the S record for all occasions when a given ID has an R immediately followed by an S:

data have;
  input ID	date :date9. type :$1. ;
  format date date9. ;
datalines;
99997887	05-May-20	R
99997887	04-Jun-20	R
99998960	02-Jun-20	S
99998960	03-Jul-20	S
99998960	03-Aug-20	S
100003216	14-Jun-20	R
100003216	15-Jul-20	R
100005355	05-May-20	R
100005355	04-Jun-20	R
100005355	05-Jul-20	R
100005981	03-Jun-20	R
100005981	03-Aug-20	S
100009933	10-May-20	R
100009933	10-Jul-20	R
100009933	09-Aug-20	R
100010577	22-May-20	R
100010577	22-Jul-20	S
run;

data want (drop=nxt_:);
  set have (keep=id);
  by id;
  merge have have (firstobs=2 keep=type rename=(type=nxt_type));

  if (last.id=0 and type='R' and nxt_type='S')
  or (first.id=0 and type='S' and lag(type)='R');
run;

The SET statement only retrieves the ID variable, because (in combination with the subsequent BY statement) its only purpose is to provide a way to determine whether the record-in-hand is not the start of an ID (first.id=0) or not the end of an ID (last.id=0).
The MERGE retrieves each record and simultaneously one variable (type renamed to nxt_type) from the following record.
Once you know whether the record-in-hand is an R (not at the end of an ID) followed by an S, or is an S (not at the beginning of an ID) preceded by an R, then keep it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

TheNovice · Posted 08-15-2020 02:09 AM

It worked. Thank you so much. I was going to go the proc sql way by subsetting and then joining again but your method is fast and elegant. Thanks for the explanation. I would just like to compare this to the second solution posted and see if there is a variance in output. Your extract is what i was looking for. Your assumption is bang on.

Shmuel · Posted 08-15-2020 01:36 AM

You can subset two records as required and may be prefer next output format

having both records data in one observation:

data want;
 set have;
   by ID;
      retain lag_type  lag_date;
      if first.ID then do; lag_type=" "; lag_date=.; end;
      else do; lag_type = type; lag_date=date; end;

     if not first.ID and type='S' and lag_type='R' then do;
        R_type = lag_type; R_date=lag_date;
        S_type = type;  S_date = date;
        output;
       keep ID R_: S_: ;
     end;    /* line added */
run;

TheNovice · Posted 08-15-2020 02:07 AM

Thank you Shmuel,

But the output was blank... no errors. the second do block was not closed so i closed it like this:

data want1;
set have;
   by ID;
      retain lag_type lag_date;
      if first.ID then do; lag_type=" "; lag_date=.; end;
      else do; lag_type = type; lag_date=date; end;

     if not first.ID and type='Symend' and lag_type='Regular' then do;
        R_type = lag_type; R_date=lag_date;
        S_type = type; S_date = date;
        output;
       keep ID R_: S_: ;end;
run;

The solution by mkeintz worked. I would love to run both and see if there is a variance. Would appreciate you looking into it.

Thank you again

Shmuel · Posted 08-15-2020 02:31 AM

Next is a fixed tested code:

data have;
infile datalines dlm='09'x;
  input ID	date :date9. type :$1. ;
  format date date9. ;
datalines;
99997887	05-May-20	R
99997887	04-Jun-20	R
99998960	02-Jun-20	S
99998960	03-Jul-20	S
99998960	03-Aug-20	S
100003216	14-Jun-20	R
100003216	15-Jul-20	R
100005355	05-May-20	R
100005355	04-Jun-20	R
100005355	05-Jul-20	R
100005981	03-Jun-20	R
100005981	03-Aug-20	S
100009933	10-May-20	R
100009933	10-Jul-20	R
100009933	09-Aug-20	R
100010577	22-May-20	R
100010577	22-Jul-20	S
run;
data want;
 set have;
   by ID;
      retain lag_type  lag_date;
      lag_type = lag(type); lag_date=lag(date); 
      if first.ID then do; lag_type=" "; lag_date=.; end;

if ID in (100005981	, 100010577	) then 
    putlog  ID= date= type= lag_type= lag_date;

     if not first.ID and type='S' and lag_type='R' then do;
        R_type = lag_type; R_date=lag_date;
        S_type = type;  S_date = date;
        output;
       keep ID R_: S_: ;     end;    /* line added */
run;

Jagadishkatam · Posted 08-15-2020 06:16 AM

you may also try the below code

data want;
length val $10.;
do until(last.id);
set have;
by id date;
retain val;
if first.id then val=type;
else val=cats(val,type);
end;
do until(last.id);
set have;
by id date;
if index(val,'RS') then output;
end;
run;

Thanks,
Jag

TheNovice · Posted 09-04-2020 11:52 AM

Sorry for the delay all. I was on vacation and taking a break from all work. I tested the next 2 suggestions and they both work.

Thanks again 🙂

Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Re: Want to subset a champion/Challenger dataset from larger population

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away