Solved: Re: Find variables having count > 1

lady8506 · Posted 11-05-2018 10:08 AM

For the following problem I'm having, I'm not sure where to even begin when it comes to the code, so I don't have much code to present here, but I will describe my problem.

1. I need to find all patients who have the same MRN, then look at their date of surgeries. If they are listed more than once in the dataset by MRN, and they have the same date of surgery, this means they had more than one tumor and both tumors were removed on the same day. I need to remove the record with the lower tumor size.

2. If the patient is in the dataset more than once (their MRN will be in there more than once), but their date of surgeries are different, I then need to remove the observation(s) with the higher sequence number(s).

This is for analysis purposes. I will be running things like PROC FREQ and PROC LOGISTIC on this data set. Can anyone help, please?

DateOfSurgery1	SequenceNum1	MRN1	TumorSizecm1
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9

The expected dataset should look like this:

DateOfSurgery1	SequenceNum1	MRN1	TumorSizecm1
12OCT2018:00:00:00	2	223XXX	4
14SEP2013:00:00:00	2	328XXX	2
20AUG2011:00:00:00	1	336XXX	4.5
06MAY2002:00:00:00	1	337XXX	.
30AUG2012:00:00:00	1	350XXX	1
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	1	368XXX	0.9

novinosrin · Posted 11-05-2018 01:48 PM

so, how come I am able to match your expected output

Here i tested again,

this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXU	5
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;



proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;

OUTPUT:

SAS Output

The SAS System

DateOfSurgery1	SequenceNum1	MRN1	TumorSizecm1
12OCT2018:00:00:00	2	223XXU	5.0
12OCT2018:00:00:00	2	223XXX	4.0
14SEP2013:00:00:00	2	328XXX	2.0
20AUG2011:00:00:00	1	336XXX	4.5
06MAY2002:00:00:00	1	337XXX	.
30AUG2012:00:00:00	1	350XXX	1.0
19NOV2006:00:00:00	3	354XXX	1.0
17FEB2014:00:00:00	1	355XXX	2.8
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	1	368XXX	0.9

View solution in original post

novinosrin · Posted 11-05-2018 10:32 AM

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;

proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
order by mrn1,SequenceNum1 desc;
quit;

lady8506 · Posted 11-05-2018 10:43 AM

Novinosrin,

Your solution gets rid of everyone in the data set who is also in there only once. I don't need to do anything with those one-time cancer patients, They need to stay in the data set. I only need to work on the patients who are in the data set more than once and filter them by tumor size and/or date of surgery. So in my mind, it's more like an IF statement. IF the patient is in the data set more than once (by MRN), then DO the following:...

Does that make sense?

novinosrin · Posted 11-05-2018 10:49 AM

ok

try adding

count(mrn1)=1 in having clause

i.e

proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;

novinosrin · Posted 11-05-2018 10:54 AM

The revised having

/*Condition 1*/

having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or

/*Condition 2*/
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or

/*Condition 3= one time patients undisturbed in the dataset*/
count(mrn1)=1

Basically , I didn't think of the one time patients as your sample didn't have any and so my objective was to match your sample output. The revised one should handle all 3

ballardw · Posted 11-05-2018 10:50 AM

Proc sort data=have;
   by mrn1 dateofsurgery1 tumorsizecm1;
run;

data want;
   set have;
   by mrn1 dateofsurgery1 tumorsizecm1;
   if last.dateofsurgery1;
run;

But this requires the date to be an actual date value and not a character value.

Also when I see so many variables ending in 1 is suspect there may be similar ending in 2, 3 or 4. Which means that we may not have a complete description of your data.

lady8506 · Posted 11-05-2018 12:44 PM

The 1's in the variable names are only due to a merger between two data sets so that similar variables wouldn't automatically merge. The similar variables have already been merged on a one-by-one basis in order to conserve the integrity of the data as data from one data set was considered more "credible" than data from the other data set. The 1s are just leftover from the merger.

novinosrin · Posted 11-05-2018 12:17 PM

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;
data want;
call missing(min_seq,max_tum,n1);
do n=1 by 1 until(last.mrn1);
	set have;
	by mrn1 DateOfSurgery1 notsorted;
	if first.DateOfSurgery1 then n1+1;
	min_seq=min(SequenceNum1,min_seq);
	max_tum=max(TumorSizecm1,max_tum);
end;
do until(last.mrn1);
	set have;
	by mrn1 DateOfSurgery1 notsorted;
if n>1 and n1=1 and  TumorSizecm1=max_tum then output;
else if n>1 and n1>1 and  SequenceNum1=min_seq then output;
else if n=1 then output;
end;
drop n: min_seq max_tum;
run;

lady8506 · Posted 11-05-2018 01:26 PM

Everyone's suggestions aren't working correctly yet, so....

How about something like this? I haven't finished the code, but I got it started. Right now there is an error and I don't know what it means, but I'll post it below.

proc sql;
	DELETE FROM practice
		WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1;
QUIT;

Error message:

7589 proc sql;
7590 DELETE FROM practice
7591 WHERE COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) =
7591! TumorSizecm1;
ERROR: Summary functions are restricted to the SELECT and HAVING clauses only.
7592 QUIT;

novinosrin · Posted 11-05-2018 01:32 PM

Are you deleting records where this condition

COUNT(MRN1) > 1 AND COUNT(DISTINCT DateOfSurgery1) = 1 AND MAX(TumorSizecm1) = TumorSizecm1;

is true?

If so, your expected output at the very top is contradicitng the latest post of yours

Take a look at

DateOfSurgery1	SequenceNum1	MRN1	TumorSizecm1
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5

If you are deleting the bold row, why is that in the expected output?

lady8506 · Posted 11-05-2018 01:42 PM

Actually, that was sort of a cut and paste of your code that didn't work. What your code did was replace all of the tumor sizes with the MAX tumor size between two records of the same patient.

My original logic still applies, the observation associated with the minimum tumor size is the one that needs to be deleted if the MRNs and dates of surgeries are the same.

novinosrin · Posted 11-05-2018 01:48 PM

so, how come I am able to match your expected output

Here i tested again,

this I added one more to the sample at the top 223XXU with count of mrn at 1 meaning 1 time patient to make sure those are not deleted

data have;
input DateOfSurgery1 :datetime20.	SequenceNum1	MRN1 $	TumorSizecm1;
format DateOfSurgery1 datetime20.;
cards;
12OCT2018:00:00:00	2	223XXU	5
12OCT2018:00:00:00	2	223XXX	4
12OCT2018:00:00:00	1	223XXX	2.5
14SEP2013:00:00:00	2	328XXX	2
14SEP2013:00:00:00	1	328XXX	0.6
10JAN2013:00:00:00	2	336XXX	4.5
20AUG2011:00:00:00	1	336XXX	4.5
17JUL2015:00:00:00	2	337XXX	4
06MAY2002:00:00:00	1	337XXX	.
09AUG2017:00:00:00	3	350XXX	1.2
04OCT2017:00:00:00	4	350XXX	0.3
09JUL2011:00:00:00	2	350XXX	2.5
30AUG2012:00:00:00	1	350XXX	1
07JUN2016:00:00:00	4	354XXX	1.2
19NOV2006:00:00:00	3	354XXX	1
17FEB2014:00:00:00	1	355XXX	2.8
08JAN2013:00:00:00	2	355XXX	1.2
10DEC2013:00:00:00	2	359XXX	0.7
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	2	368XXX	0.6
27MAR2015:00:00:00	1	368XXX	0.9
;



proc sql;
create table want as
select *
from have
group by mrn1
having (count(mrn1)>1 and count(distinct DateOfSurgery1)=1) and max(TumorSizecm1)=TumorSizecm1
or
(count(mrn1)>1 and count(distinct DateOfSurgery1)>1) and SequenceNum1=min(SequenceNum1)
or
count(mrn1)=1
order by mrn1,SequenceNum1 desc;
quit;

OUTPUT:

SAS Output

The SAS System

DateOfSurgery1	SequenceNum1	MRN1	TumorSizecm1
12OCT2018:00:00:00	2	223XXU	5.0
12OCT2018:00:00:00	2	223XXX	4.0
14SEP2013:00:00:00	2	328XXX	2.0
20AUG2011:00:00:00	1	336XXX	4.5
06MAY2002:00:00:00	1	337XXX	.
30AUG2012:00:00:00	1	350XXX	1.0
19NOV2006:00:00:00	3	354XXX	1.0
17FEB2014:00:00:00	1	355XXX	2.8
10DEC2013:00:00:00	1	359XXX	2.2
27MAR2015:00:00:00	1	368XXX	0.9

lady8506 · Posted 11-05-2018 01:56 PM

I don't know what to tell you. WHen I run the following code to be sure your code actually deleted the files I wanted:

PROC SQL;
	SELECT DateOfSurgery1, SequenceNum, MRN1, TumorSizecm1
	FROM practice
	GROUP BY MRN1
	HAVING COUNT(MRN1) > 1;
QUIT;

I get the following truncated results:

SAS Output

DateOfSurgery1 SequenceNum1 MRN1 TumorSizecm1

20NOV2008:00:00:00	2	0036XXXX	2
20NOV2008:00:00:00	1	0036XXXX	2
21DEC2011:00:00:00	2	0204XXXX	1.2
21DEC2011:00:00:00	1	0204XXXX	1.2
27OCT2009:00:00:00	2	1010XXXX	0.5
27OCT2009:00:00:00	1	1010XXXX	0.5
14JUL2005:00:00:00	1	2281XXXX	4.8
14JUL2005:00:00:00	.	2281XXXX	4.8
09OCT2007:00:00:00	5	2304XXXX	0.1
09OCT2007:00:00:00	5	2304XXXX	0.1
11OCT2012:00:00:00	2	2401XXXX	1.2
11OCT2012:00:00:00	1	2401XXXX	1.2
13SEP2004:00:00:00	5	2816XXXX	2.8

lady8506 · Posted 11-05-2018 02:04 PM

Actually, give me just a second to check the data again, a little bit more thoroughly this time. I apologize.

lady8506 · Posted 11-05-2018 02:11 PM

Okay, I have checked things more thoroughly this time and your code is working just like it should. It turns out I have patients with more than one record and those records have the same tumor size as well. I will have to look into these as to why. I thought I'd deleted all duplicates, so it may be multiple tumors or it may not.

Thank you for your helpful code.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away