Solved: Re: Selecting rows with first unique observation

DathanMD · Posted 02-21-2020 03:39 AM

Dear

Hi Colleagues,

I am new to SAS (and to programming)! I am dealing with a dataset that contains patients (id) and medications (var=medicines) and their startdates and end_dates. Kindly note that the medicines are quite many (but won’t exceed 10) and are separated by a comma. I would like to select rows when a patient was exposed to atleast one medicine for the first time. The order of medicines should not matter since I am interested in retaining patients exposed to one or more medicines for the first time (once a patient is exposed to a single medicine that they haven’t been exposed before, I would like to retain that row).

Data I have

id	medicines	Start_date	End_date
1	abc,tdf,rtv,lpv	01jan2012	02dec2012
1	abc,tfd,drv	03dec2012	03dec2014
1	lpv,tdf	04dec2014	.
2	abc,tdf,rtv,lpv, idv	01jan2012	02dec2012
2	abc,tfd,drv,	03dec2012	03dec2014
2	tdf,abc ,rtv,lpv	04dec2014	06dec2016
2	Idv,abc	07dec2017	11march2019
2	idv	12march 2019	.

Data I want;

id	medicines	Start_date	End_date
1	abc,tdf,rtv,lpv	01jan2012	02dec2012
1	Abc,tfd,drv	03dec2012	03dec2014
2	abc,tdf,rtv,lpv, idv	01jan2012	02dec2012
2	abc,tfd,drv	03dec2012	03dec2014

Will someone hack the code for me please?

Thanks a bunch.

Dathan Byonanebye

ed_sas_member · Posted 02-21-2020 04:07 AM

Hi @DathanMD

Here is an attempt to do this:

data vital;
	infile datalines dlm="09"x dsd missover;
	input id medicines:$100. Start_date:date9. End_date:date9.;
	format Start_date End_date date9.;
	datalines;
1	abc,tdf,rtv,lpv	01jan2012	02dec2012
1	abc,tfd,drv	03dec2012	03dec2014 
1	lpv,tdf	04dec2014	.
2	abc,tdf,rtv,lpv, idv	01jan2012	02dec2012
2	abc,tfd,drv,	03dec2012	03dec2014
2	tdf,abc ,rtv,lpv	04dec2014	06dec2016 
2	Idv,abc	07dec2017	11mar2019
2	idv	12march2019	.
;
run;

data vital_exp;
	set vital;
	count + 1;
	do i=1 to countw(medicines);
		medicine_unit = scan(upcase(medicines),i);
		output;
	end;
	drop i;
run;

proc sort data=vital_exp;
	by id medicine_unit Start_date;
run;

data vital_select;
	set vital_exp;
	by id medicine_unit Start_date;
	medicines=propcase(medicines);
	if first.medicine_unit then output;
run;

proc sort data=vital_select out=want(drop=count medicine_unit) nodupkey;
	by id medicines count;
run;

View solution in original post

PeterClemmensen · Posted 02-21-2020 04:04 AM

You can do something like this..

data have;
input id medicines $ 3-22 (Start_date End_date) (:date9.);
format Start_date End_date date9.;
datalines;
1 abc,tdf,rtv,lpv     01jan2012 02dec2012
1 abc,tfd,drv         03dec2012 03dec2014
1 lpv,tdf             04dec2014 .        
2 abc,tdf,rtv,lpv,idv 01jan2012 02dec2012
2 abc,tfd,drv         03dec2012 03dec2014
2 tdf,abc,rtv,lpv     04dec2014 06dec2016
2 idv,abc             07dec2017 11mar2019
2 idv                 12mar2019 .        
;

data want(drop=m i flag);
    declare hash h ();
    h.definekey ('id', 'm');
    h.definedone();

    do until (lr);
        set have end=lr;
        flag=0;
        do i = 1 to countw(medicines, ',');
           m = scan(medicines, i, ',');
           if h.check() ne 0 then flag=1;
           h.ref();
        end;
        if flag then output;
    end;
run;

Result:

id  medicines            Start_date  End_date 
1   abc,tdf,rtv,lpv      01JAN2012   02DEC2012 
1   abc,tfd,drv          03DEC2012   03DEC2014 
2   abc,tdf,rtv,lpv,idv  01JAN2012   02DEC2012 
2   abc,tfd,drv          03DEC2012   03DEC2014

The DATA to DATA Step Macro
Blog: SASnrd

yabwon · Posted 02-21-2020 05:22 AM

Nice use of hash table! 🙂

Small additional thing, if I may.

I would add Strip() function to

m = STRIP(scan(medicines, i, ','));

When I was playing your code I've noticed that if there is space between comma and drug name SAS is keeping that space, e.g.

data have;
input id medicines $ 3-22 (Start_date End_date) (:date9.);
format Start_date End_date date9.;
datalines;
1 abc,tdf,rtv,lpv     01jan2012 02dec2012
1 abc,tfd,drv         03dec2012 03dec2014
1 lpv,tdf             04dec2014 .        
2 abc,tdf,rtv,lpv,idv 01jan2012 02dec2012
2 abc,tfd,drv         03dec2012 03dec2014
2 tdf,abc,rtv,lpv     04dec2014 06dec2016
2 idv,abc             07dec2017 11mar2019
2 idv, xxx            12mar2019 .        
2 xxx                 .         .
;
run;

last two rows.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

PeterClemmensen · Posted 02-26-2020 03:05 AM

@yabwon thanks 🙂 And yes, I agree completely.

The DATA to DATA Step Macro
Blog: SASnrd

DathanMD · Posted 02-26-2020 03:04 PM

Thanks @yabwon and @PeterClemmensen

ed_sas_member · Posted 02-21-2020 04:07 AM

Hi @DathanMD

Here is an attempt to do this:

data vital;
	infile datalines dlm="09"x dsd missover;
	input id medicines:$100. Start_date:date9. End_date:date9.;
	format Start_date End_date date9.;
	datalines;
1	abc,tdf,rtv,lpv	01jan2012	02dec2012
1	abc,tfd,drv	03dec2012	03dec2014 
1	lpv,tdf	04dec2014	.
2	abc,tdf,rtv,lpv, idv	01jan2012	02dec2012
2	abc,tfd,drv,	03dec2012	03dec2014
2	tdf,abc ,rtv,lpv	04dec2014	06dec2016 
2	Idv,abc	07dec2017	11mar2019
2	idv	12march2019	.
;
run;

data vital_exp;
	set vital;
	count + 1;
	do i=1 to countw(medicines);
		medicine_unit = scan(upcase(medicines),i);
		output;
	end;
	drop i;
run;

proc sort data=vital_exp;
	by id medicine_unit Start_date;
run;

data vital_select;
	set vital_exp;
	by id medicine_unit Start_date;
	medicines=propcase(medicines);
	if first.medicine_unit then output;
run;

proc sort data=vital_select out=want(drop=count medicine_unit) nodupkey;
	by id medicines count;
run;

ChrisNZ · Posted 02-21-2020 04:16 AM

Likes this?

data WANT;
 set HAVE;
 length OLDMEDS NEWMEDS MED $20;
 retain OLDMEDS;
 keep ID MEDS START_DATE END_DATE NEWMEDS;
 by ID;
 if first.ID then call missing(OLDMEDS);
 do MEDNO=1 to countw(MEDS);
   MED=scan(MEDS,MEDNO);
   if ^find(OLDMEDS,MED,'t') then do;
     OLDMEDS=catx(',',OLDMEDS,MED);
     NEWMEDS=catx(',',NEWMEDS,MED);
   end;
 end;
 if NEWMEDS ne ' ' then output;
run;

ID	MEDS	Start_date	End_date	NEWMEDS
1	abc,tdf,rtv,lpv	01JAN2012	02DEC2012	abc,tdf,rtv,lpv
1	abc,tfd,drv	03DEC2012	03DEC2014	tfd,drv
2	abc,tdf,rtv,lpv,idv	01JAN2012	02DEC2012	abc,tdf,rtv,lpv,idv
2	abc,tfd,drv	03DEC2012	03DEC2014	tfd,drv

High-Performance SAS Coding - Third Edition

RichardDeVen · Posted 02-27-2020 08:22 AM

Really like tracking the new meds, as they are the core feature of the processing.

You might want to make OLDMEDS longer to accomodate the new meds it is accumulating.

 length OLDMEDS $500 NEWMEDS MED $20;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away