BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lady8506
Quartz | Level 8

For the following problem I'm having, I'm not sure where to even begin when it comes to the code, so I don't have much code to present here, but I will describe my problem.

 

1. I need to find all patients who have the same MRN, then look at their date of surgeries. If they are listed more than once in the dataset by MRN, and they have the same date of surgery, this means they had more than one tumor and both tumors were removed on the same day. I need to remove the record with the lower tumor size.

 

2. If the patient is in the dataset more than once (their MRN will be in there more than once), but their date of surgeries are different, I then need to remove the observation(s) with the higher sequence number(s).

 

This is for analysis purposes. I will be running things like PROC FREQ and PROC LOGISTIC on this data set. Can anyone help, please?

 

DateOfSurgery1SequenceNum1MRN1TumorSizecm1
12OCT2018:00:00:002223XXX4
12OCT2018:00:00:001223XXX2.5
14SEP2013:00:00:002328XXX2
14SEP2013:00:00:001328XXX0.6
10JAN2013:00:00:002336XXX4.5
20AUG2011:00:00:001336XXX4.5
17JUL2015:00:00:002337XXX4
06MAY2002:00:00:001337XXX.
09AUG2017:00:00:003350XXX1.2
04OCT2017:00:00:004350XXX0.3
09JUL2011:00:00:002350XXX2.5
30AUG2012:00:00:001350XXX1
07JUN2016:00:00:004354XXX1.2
19NOV2006:00:00:003354XXX1
17FEB2014:00:00:001355XXX2.8
08JAN2013:00:00:002355XXX1.2
10DEC2013:00:00:002359XXX0.7
10DEC2013:00:00:001359XXX2.2
27MAR2015:00:00:002368XXX0.6
27MAR2015:00:00:001368XXX0.9

 

The expected dataset should look like this:

DateOfSurgery1SequenceNum1MRN1TumorSizecm1
12OCT2018:00:00:002223XXX4
14SEP2013:00:00:002328XXX2
20AUG2011:00:00:001336XXX4.5
06MAY2002:00:00:001337XXX.
30AUG2012:00:00:001350XXX1
19NOV2006:00:00:003354XXX1
17FEB2014:00:00:001355XXX2.8
10DEC2013:00:00:001359XXX2.2
27MAR2015:00:00:001368XXX0.9
1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

so, how come I am able to match your expected output

Here i tested again,

this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXU	5
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;



proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;

OUTPUT:

 

SAS Output

The SAS System

DateOfSurgery1 SequenceNum1 MRN1 TumorSizecm1
12OCT2018:00:00:00 2 223XXU 5.0
12OCT2018:00:00:00 2 223XXX 4.0
14SEP2013:00:00:00 2 328XXX 2.0
20AUG2011:00:00:00 1 336XXX 4.5
06MAY2002:00:00:00 1 337XXX .
30AUG2012:00:00:00 1 350XXX 1.0
19NOV2006:00:00:00 3 354XXX 1.0
17FEB2014:00:00:00 1 355XXX 2.8
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 1 368XXX 0.9

View solution in original post

15 REPLIES 15
novinosrin
Tourmaline | Level 20
data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;

proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
order by mrn1,SequenceNum1 desc;
quit;
lady8506
Quartz | Level 8

 

 

novinosrin
Tourmaline | Level 20

ok

try adding

count(mrn1)=1 in having clause

 

i.e

 

proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;
novinosrin
Tourmaline | Level 20

The revised having

/*Condition 1*/

having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or

/*Condition 2*/
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or

 

/*Condition 3= one time patients undisturbed in the dataset*/
count(mrn1)=1

 

Basically , I didn't think of the one time patients as your sample didn't have any and so my objective was to match your sample output. The revised one should handle all 3

ballardw
Super User
Proc sort data=have;
   by mrn1 dateofsurgery1 tumorsizecm1;
run;

data want;
   set have;
   by mrn1 dateofsurgery1 tumorsizecm1;
   if last.dateofsurgery1;
run;

But this requires the date to be an actual date value and not a character value.

 

Also when I see so many variables ending in 1 is suspect there may be similar ending in 2, 3 or 4. Which means that we may not have a complete description of your data.

 

lady8506
Quartz | Level 8

The 1's in the variable names are only due to a merger between two data sets so that similar variables wouldn't automatically merge. The similar variables have already been merged on a one-by-one basis in order to conserve the integrity of the data as data from one data set was considered more "credible" than data from the other data set. The 1s are just leftover from the merger.

novinosrin
Tourmaline | Level 20
data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;
data want;
call missing(min_seq,max_tum,n1);
do n=1 by 1 until(last.mrn1);
	set have;
	by mrn1 DateOfSurgery1 notsorted;
	if first.DateOfSurgery1 then n1+1;
	min_seq=min(SequenceNum1,min_seq);
	max_tum=max(TumorSizecm1,max_tum);
end;
do until(last.mrn1);
	set have;
	by mrn1 DateOfSurgery1 notsorted;
if n>1 and n1=1 and  TumorSizecm1=max_tum then output;
else if n>1 and n1>1 and  SequenceNum1=min_seq then output;
else if n=1 then output;
end;
drop n: min_seq max_tum;
run;
lady8506
Quartz | Level 8

Everyone's suggestions aren't working correctly yet, so....

 

How about something like this? I haven't finished the code, but I got it started. Right now there is an error and I don't know what it means, but I'll post it below.

proc sql;
	DELETE FROM practice
		WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1;
QUIT;

Error message:


7589 proc sql;
7590 DELETE FROM practice
7591 WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) =
7591! TumorSizecm1;
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
7592 QUIT;

novinosrin
Tourmaline | Level 20

Are you deleting records where this condition

COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1;

 

is true?

 

If so, your expected output  at the very top  is contradicitng the latest post of yours 

 

Take a look at 

DateOfSurgery1 SequenceNum1 MRN1 TumorSizecm1
12OCT2018:00:00:00 2 223XXX 4
12OCT2018:00:00:00 1 223XXX 2.5

If you are deleting the bold row, why is that in the expected output?

lady8506
Quartz | Level 8

Actually, that was sort of a cut and paste of your code that didn't work. What your code did was replace all of the tumor sizes with the MAX tumor size between two records of the same patient. 

 

My original logic still applies, the observation associated with the minimum tumor size is the one that needs to be deleted if the MRNs and dates of surgeries are the same.

novinosrin
Tourmaline | Level 20

so, how come I am able to match your expected output

Here i tested again,

this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXU	5
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;



proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;

OUTPUT:

 

SAS Output

The SAS System

DateOfSurgery1 SequenceNum1 MRN1 TumorSizecm1
12OCT2018:00:00:00 2 223XXU 5.0
12OCT2018:00:00:00 2 223XXX 4.0
14SEP2013:00:00:00 2 328XXX 2.0
20AUG2011:00:00:00 1 336XXX 4.5
06MAY2002:00:00:00 1 337XXX .
30AUG2012:00:00:00 1 350XXX 1.0
19NOV2006:00:00:00 3 354XXX 1.0
17FEB2014:00:00:00 1 355XXX 2.8
10DEC2013:00:00:00 1 359XXX 2.2
27MAR2015:00:00:00 1 368XXX 0.9
lady8506
Quartz | Level 8

I don't know what to tell you. WHen I run the following code to be sure your code actually deleted the files I wanted:

 

PROC SQL;
	SELECT DateOfSurgery1, SequenceNum, MRN1, TumorSizecm1
	FROM practice
	GROUP BY MRN1
	HAVING COUNT(MRN1) > 1;
QUIT;

I get the following truncated results:

SAS Output

DateOfSurgery1 SequenceNum1 MRN1 TumorSizecm1
20NOV2008:00:00:0020036XXXX2
20NOV2008:00:00:0010036XXXX2
21DEC2011:00:00:0020204XXXX1.2
21DEC2011:00:00:0010204XXXX1.2
27OCT2009:00:00:0021010XXXX0.5
27OCT2009:00:00:0011010XXXX0.5
14JUL2005:00:00:0012281XXXX4.8
14JUL2005:00:00:00.2281XXXX4.8
09OCT2007:00:00:0052304XXXX0.1
09OCT2007:00:00:0052304XXXX0.1
11OCT2012:00:00:0022401XXXX1.2
11OCT2012:00:00:0012401XXXX1.2
13SEP2004:00:00:0052816XXXX2.8
 

 

lady8506
Quartz | Level 8

Actually, give me just a second to check the data again, a little bit more thoroughly this time. I apologize.

lady8506
Quartz | Level 8

Okay, I have checked things more thoroughly this time and your code is working just like it should. It turns out I have patients with more than one record and those records have the same tumor size as well. I will have to look into these as to why. I thought I'd deleted all duplicates, so it may be multiple tumors or it may not. 

 

Thank you for your helpful code.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 7931 views
  • 0 likes
  • 3 in conversation