For the following problem I'm having, I'm not sure where to even begin when it comes to the code, so I don't have much code to present here, but I will describe my problem.
1. I need to find all patients who have the same MRN, then look at their date of surgeries. If they are listed more than once in the dataset by MRN, and they have the same date of surgery, this means they had more than one tumor and both tumors were removed on the same day. I need to remove the record with the lower tumor size.
2. If the patient is in the dataset more than once (their MRN will be in there more than once), but their date of surgeries are different, I then need to remove the observation(s) with the higher sequence number(s).
This is for analysis purposes. I will be running things like PROC FREQ and PROC LOGISTIC on this data set. Can anyone help, please?
DateOfSurgery1 | SequenceNum1 | MRN1 | TumorSizecm1 |
12OCT2018:00:00:00 | 2 | 223XXX | 4 |
12OCT2018:00:00:00 | 1 | 223XXX | 2.5 |
14SEP2013:00:00:00 | 2 | 328XXX | 2 |
14SEP2013:00:00:00 | 1 | 328XXX | 0.6 |
10JAN2013:00:00:00 | 2 | 336XXX | 4.5 |
20AUG2011:00:00:00 | 1 | 336XXX | 4.5 |
17JUL2015:00:00:00 | 2 | 337XXX | 4 |
06MAY2002:00:00:00 | 1 | 337XXX | . |
09AUG2017:00:00:00 | 3 | 350XXX | 1.2 |
04OCT2017:00:00:00 | 4 | 350XXX | 0.3 |
09JUL2011:00:00:00 | 2 | 350XXX | 2.5 |
30AUG2012:00:00:00 | 1 | 350XXX | 1 |
07JUN2016:00:00:00 | 4 | 354XXX | 1.2 |
19NOV2006:00:00:00 | 3 | 354XXX | 1 |
17FEB2014:00:00:00 | 1 | 355XXX | 2.8 |
08JAN2013:00:00:00 | 2 | 355XXX | 1.2 |
10DEC2013:00:00:00 | 2 | 359XXX | 0.7 |
10DEC2013:00:00:00 | 1 | 359XXX | 2.2 |
27MAR2015:00:00:00 | 2 | 368XXX | 0.6 |
27MAR2015:00:00:00 | 1 | 368XXX | 0.9 |
The expected dataset should look like this:
DateOfSurgery1 | SequenceNum1 | MRN1 | TumorSizecm1 |
12OCT2018:00:00:00 | 2 | 223XXX | 4 |
14SEP2013:00:00:00 | 2 | 328XXX | 2 |
20AUG2011:00:00:00 | 1 | 336XXX | 4.5 |
06MAY2002:00:00:00 | 1 | 337XXX | . |
30AUG2012:00:00:00 | 1 | 350XXX | 1 |
19NOV2006:00:00:00 | 3 | 354XXX | 1 |
17FEB2014:00:00:00 | 1 | 355XXX | 2.8 |
10DEC2013:00:00:00 | 1 | 359XXX | 2.2 |
27MAR2015:00:00:00 | 1 | 368XXX | 0.9 |
so, how come I am able to match your expected output
Here i tested again,
this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted
data have;
input DateOfSurgery1 :datetime20. SequenceNum1 MRN1 $ TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00 2 223XXU 5
12OCT2018:00:00:00 2 223XXX 4
12OCT2018:00:00:00 1 223XXX 2.5
14SEP2013:00:00:00 2 328XXX 2
14SEP2013:00:00:00 1 328XXX 0.6
10JAN2013:00:00:00 2 336XXX 4.5
20AUG2011:00:00:00 1 336XXX 4.5
17JUL2015:00:00:00 2 337XXX 4
06MAY2002:00:00:00 1 337XXX .
09AUG2017:00:00:00 3 350XXX 1.2
04OCT2017:00:00:00 4 350XXX 0.3
09JUL2011:00:00:00 2 350XXX 2.5
30AUG2012:00:00:00 1 350XXX 1
07JUN2016:00:00:00 4 354XXX 1.2
19NOV2006:00:00:00 3 354XXX 1
17FEB2014:00:00:00 1 355XXX 2.8
08JAN2013:00:00:00 2 355XXX 1.2
10DEC2013:00:00:00 2 359XXX 0.7
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 2 368XXX 0.6
27MAR2015:00:00:00 1 368XXX 0.9
;
proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;
OUTPUT:
SAS Output
The SAS System |
DateOfSurgery1 | SequenceNum1 | MRN1 | TumorSizecm1 |
---|---|---|---|
12OCT2018:00:00:00 | 2 | 223XXU | 5.0 |
12OCT2018:00:00:00 | 2 | 223XXX | 4.0 |
14SEP2013:00:00:00 | 2 | 328XXX | 2.0 |
20AUG2011:00:00:00 | 1 | 336XXX | 4.5 |
06MAY2002:00:00:00 | 1 | 337XXX | . |
30AUG2012:00:00:00 | 1 | 350XXX | 1.0 |
19NOV2006:00:00:00 | 3 | 354XXX | 1.0 |
17FEB2014:00:00:00 | 1 | 355XXX | 2.8 |
10DEC2013:00:00:00 | 1 | 359XXX | 2.2 |
27MAR2015:00:00:00 | 1 | 368XXX | 0.9 |
data have;
input DateOfSurgery1 :datetime20. SequenceNum1 MRN1 $ TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00 2 223XXX 4
12OCT2018:00:00:00 1 223XXX 2.5
14SEP2013:00:00:00 2 328XXX 2
14SEP2013:00:00:00 1 328XXX 0.6
10JAN2013:00:00:00 2 336XXX 4.5
20AUG2011:00:00:00 1 336XXX 4.5
17JUL2015:00:00:00 2 337XXX 4
06MAY2002:00:00:00 1 337XXX .
09AUG2017:00:00:00 3 350XXX 1.2
04OCT2017:00:00:00 4 350XXX 0.3
09JUL2011:00:00:00 2 350XXX 2.5
30AUG2012:00:00:00 1 350XXX 1
07JUN2016:00:00:00 4 354XXX 1.2
19NOV2006:00:00:00 3 354XXX 1
17FEB2014:00:00:00 1 355XXX 2.8
08JAN2013:00:00:00 2 355XXX 1.2
10DEC2013:00:00:00 2 359XXX 0.7
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 2 368XXX 0.6
27MAR2015:00:00:00 1 368XXX 0.9
;
proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
order by mrn1,SequenceNum1 desc;
quit;
Novinosrin,
Your solution gets rid of everyone in the data set who is also in there only once. I don't need to do anything with those one-time cancer patients, They need to stay in the data set. I only need to work on the patients who are in the data set more than once and filter them by tumor size and/or date of surgery. So in my mind, it's more like an IF statement. IF the patient is in the data set more than once (by MRN), then DO the following:...
Does that make sense?
ok
try adding
count(mrn1)=1 in having clause
i.e
proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;
The revised having
/*Condition 1*/
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
/*Condition 2*/
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
/*Condition 3= one time patients undisturbed in the dataset*/
count(mrn1)=1
Basically , I didn't think of the one time patients as your sample didn't have any and so my objective was to match your sample output. The revised one should handle all 3
Proc sort data=have; by mrn1 dateofsurgery1 tumorsizecm1; run; data want; set have; by mrn1 dateofsurgery1 tumorsizecm1; if last.dateofsurgery1; run;
But this requires the date to be an actual date value and not a character value.
Also when I see so many variables ending in 1 is suspect there may be similar ending in 2, 3 or 4. Which means that we may not have a complete description of your data.
The 1's in the variable names are only due to a merger between two data sets so that similar variables wouldn't automatically merge. The similar variables have already been merged on a one-by-one basis in order to conserve the integrity of the data as data from one data set was considered more "credible" than data from the other data set. The 1s are just leftover from the merger.
data have;
input DateOfSurgery1 :datetime20. SequenceNum1 MRN1 $ TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00 2 223XXX 4
12OCT2018:00:00:00 1 223XXX 2.5
14SEP2013:00:00:00 2 328XXX 2
14SEP2013:00:00:00 1 328XXX 0.6
10JAN2013:00:00:00 2 336XXX 4.5
20AUG2011:00:00:00 1 336XXX 4.5
17JUL2015:00:00:00 2 337XXX 4
06MAY2002:00:00:00 1 337XXX .
09AUG2017:00:00:00 3 350XXX 1.2
04OCT2017:00:00:00 4 350XXX 0.3
09JUL2011:00:00:00 2 350XXX 2.5
30AUG2012:00:00:00 1 350XXX 1
07JUN2016:00:00:00 4 354XXX 1.2
19NOV2006:00:00:00 3 354XXX 1
17FEB2014:00:00:00 1 355XXX 2.8
08JAN2013:00:00:00 2 355XXX 1.2
10DEC2013:00:00:00 2 359XXX 0.7
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 2 368XXX 0.6
27MAR2015:00:00:00 1 368XXX 0.9
;
data want;
call missing(min_seq,max_tum,n1);
do n=1 by 1 until(last.mrn1);
set have;
by mrn1 DateOfSurgery1 notsorted;
if first.DateOfSurgery1 then n1+1;
min_seq=min(SequenceNum1,min_seq);
max_tum=max(TumorSizecm1,max_tum);
end;
do until(last.mrn1);
set have;
by mrn1 DateOfSurgery1 notsorted;
if n>1 and n1=1 and TumorSizecm1=max_tum then output;
else if n>1 and n1>1 and SequenceNum1=min_seq then output;
else if n=1 then output;
end;
drop n: min_seq max_tum;
run;
Everyone's suggestions aren't working correctly yet, so....
How about something like this? I haven't finished the code, but I got it started. Right now there is an error and I don't know what it means, but I'll post it below.
proc sql; DELETE FROM practice WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1; QUIT;
Error message:
7589 proc sql;
7590 DELETE FROM practice
7591 WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) =
7591! TumorSizecm1;
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
7592 QUIT;
Are you deleting records where this condition
COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1;
is true?
If so, your expected output at the very top is contradicitng the latest post of yours
Take a look at
DateOfSurgery1 | SequenceNum1 | MRN1 | TumorSizecm1 |
12OCT2018:00:00:00 | 2 | 223XXX | 4 |
12OCT2018:00:00:00 | 1 | 223XXX | 2.5 |
If you are deleting the bold row, why is that in the expected output?
Actually, that was sort of a cut and paste of your code that didn't work. What your code did was replace all of the tumor sizes with the MAX tumor size between two records of the same patient.
My original logic still applies, the observation associated with the minimum tumor size is the one that needs to be deleted if the MRNs and dates of surgeries are the same.
so, how come I am able to match your expected output
Here i tested again,
this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted
data have;
input DateOfSurgery1 :datetime20. SequenceNum1 MRN1 $ TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00 2 223XXU 5
12OCT2018:00:00:00 2 223XXX 4
12OCT2018:00:00:00 1 223XXX 2.5
14SEP2013:00:00:00 2 328XXX 2
14SEP2013:00:00:00 1 328XXX 0.6
10JAN2013:00:00:00 2 336XXX 4.5
20AUG2011:00:00:00 1 336XXX 4.5
17JUL2015:00:00:00 2 337XXX 4
06MAY2002:00:00:00 1 337XXX .
09AUG2017:00:00:00 3 350XXX 1.2
04OCT2017:00:00:00 4 350XXX 0.3
09JUL2011:00:00:00 2 350XXX 2.5
30AUG2012:00:00:00 1 350XXX 1
07JUN2016:00:00:00 4 354XXX 1.2
19NOV2006:00:00:00 3 354XXX 1
17FEB2014:00:00:00 1 355XXX 2.8
08JAN2013:00:00:00 2 355XXX 1.2
10DEC2013:00:00:00 2 359XXX 0.7
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 2 368XXX 0.6
27MAR2015:00:00:00 1 368XXX 0.9
;
proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;
OUTPUT:
SAS Output
The SAS System |
DateOfSurgery1 | SequenceNum1 | MRN1 | TumorSizecm1 |
---|---|---|---|
12OCT2018:00:00:00 | 2 | 223XXU | 5.0 |
12OCT2018:00:00:00 | 2 | 223XXX | 4.0 |
14SEP2013:00:00:00 | 2 | 328XXX | 2.0 |
20AUG2011:00:00:00 | 1 | 336XXX | 4.5 |
06MAY2002:00:00:00 | 1 | 337XXX | . |
30AUG2012:00:00:00 | 1 | 350XXX | 1.0 |
19NOV2006:00:00:00 | 3 | 354XXX | 1.0 |
17FEB2014:00:00:00 | 1 | 355XXX | 2.8 |
10DEC2013:00:00:00 | 1 | 359XXX | 2.2 |
27MAR2015:00:00:00 | 1 | 368XXX | 0.9 |
I don't know what to tell you. WHen I run the following code to be sure your code actually deleted the files I wanted:
PROC SQL; SELECT DateOfSurgery1, SequenceNum, MRN1, TumorSizecm1 FROM practice GROUP BY MRN1 HAVING COUNT(MRN1) > 1; QUIT;
I get the following truncated results:
SAS Output
20NOV2008:00:00:00 | 2 | 0036XXXX | 2 |
20NOV2008:00:00:00 | 1 | 0036XXXX | 2 |
21DEC2011:00:00:00 | 2 | 0204XXXX | 1.2 |
21DEC2011:00:00:00 | 1 | 0204XXXX | 1.2 |
27OCT2009:00:00:00 | 2 | 1010XXXX | 0.5 |
27OCT2009:00:00:00 | 1 | 1010XXXX | 0.5 |
14JUL2005:00:00:00 | 1 | 2281XXXX | 4.8 |
14JUL2005:00:00:00 | . | 2281XXXX | 4.8 |
09OCT2007:00:00:00 | 5 | 2304XXXX | 0.1 |
09OCT2007:00:00:00 | 5 | 2304XXXX | 0.1 |
11OCT2012:00:00:00 | 2 | 2401XXXX | 1.2 |
11OCT2012:00:00:00 | 1 | 2401XXXX | 1.2 |
13SEP2004:00:00:00 | 5 | 2816XXXX | 2.8 |
Actually, give me just a second to check the data again, a little bit more thoroughly this time. I apologize.
Okay, I have checked things more thoroughly this time and your code is working just like it should. It turns out I have patients with more than one record and those records have the same tumor size as well. I will have to look into these as to why. I thought I'd deleted all duplicates, so it may be multiple tumors or it may not.
Thank you for your helpful code.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.