Solved: Re: Delete subjects that have duplicates; not the duplicates

Mystik · Posted 11-27-2019 05:17 PM

Hello,

Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.

The prog below does not work for me but only deletes duplicates but not the subject having duplicates;

proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;

Reeza · Posted 11-27-2019 06:21 PM

Then this is pretty trivial, use BY group processing.

proc sort data=have;
by ID;
run;

data want;
set have;
by ID:

if first.id and last.id then output;

run;

The record can only be the first and last if its a single record.

View solution in original post

ballardw · Posted 11-27-2019 05:23 PM

@Mystik wrote:

Hello,

Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.

The prog below does not work for me but only deletes duplicates but not the subject having duplicates;

proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;

There are a couple of ways your requirement may be interpreted. It may help to provide a small example of the data, you would only need the subject_id and hx_60 values, include some records that should be removed and some that shouldn't. Then show what the desired output for that example should look like.

Tom · Posted 11-27-2019 05:25 PM

So if subject 1 has 5 observations and two or more of them have the same value of HX_60 you want to remove all 5 observations?

And if subject 2 has 5 observations with 5 different values of HX_60 you want to keep all of them?

What about missing values of HC_60? Do you want to delete subjects with more than one observations with a missing value. What if there are two missing values, but one is normal missing value and the other is a special missing value?

Anyway you could try this.

proc sort data=krag.px_60 out=krag.px_60a ;
  by subject_ID hx_60;
run;
data krag_px_60a ;
do until (last.subject_id);
  set krag_px_60a ;
  by subject_ID hx_60;
  flag=min(flag,first.hx_60);
end;
do until (last.subject_id);
  set krag_px_60a ;
  by subject_ID hx_60;
  if not flag then output;
end;
run;

The FIRST.HX_60 variable will be true (=1) when it is the first observation for that value of HX_60 (for this SUBJECT_ID). So if the MIN() of all of them is 1 then all of them are 1. So when FLAG=0 (FALSE) it means that there was at least one duplicate HX_60 value and you want to remove ALL of the observations for that subject.

PS The NODUPREC options is almost never what you want to use with PROC SORT. Especially if the BY statement does not included every variable in the dataset.

Mystik · Posted 11-27-2019 05:47 PM

Thanks.

Actually there are no missing values;

for instance this is how the data looks like;

ID indexdate Grp 1 Grp2 hxdate hx_60

T01 11Jun2006 OCS OCS 11Jun2006 1

T01 11Jun2006 OCS ** 21Jul2006 1 **=MISSING

T01 11Jun2006 OCS OCS 16Sep2006 1

T02 06Jan2010 BASA BASA 06Jan2010 1

T03 12May2013 LOMA LOMA 12May2013 1

T04 17Sep2007 ABAS ABAS 17Sep2007 1

T04 17Sep2007 ABAS ** 19Oct2007 1

T04 17Sep2007 ABAS ** . 11Dec2007 1

As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04

unison · Posted 11-27-2019 06:09 PM

I'm having trouble understanding what you mean. Do you want delete the whole observation or the value that is in ID?

Here's deleting the value of ID which has duplicates:

data have;
length id $3. grp1 $4. grp2 $4.;
input ID $ indexdate :date9. Grp1 $ Grp2 $ hxdate :date9. hx_60;
format indexdate hxdate date9.;
datalines;
T01 11Jun2006 OCS OCS 11Jun2006 1
T01 11Jun2006 OCS . 21Jul2006 1
T01 11Jun2006 OCS OCS 16Sep2006 1
T02 06Jan2010 BASA BASA 06Jan2010 1
T03 12May2013 LOMA LOMA 12May2013 1
T04 17Sep2007 ABAS ABAS 17Sep2007 1
T04 17Sep2007 ABAS . 19Oct2007 1
T04 17Sep2007 ABAS . 11Dec2007 1
;
run;

proc sql;
create table counts as 
select id, count(hx_60) as cnt
from have
group by id
having cnt>1
;quit;

data want;
merge counts(in=cnts drop=cnt) have(in=hve);
by id;
if cnts then id='';
run;

-unison

Mystik · Posted 11-27-2019 06:22 PM

i have over 30,000 subjects with over 120, 000 observations and 15 variables.

Pls your first data step with the datalines wouldn't work.

I gave the above dataline just to give a brief description of the data.

if you kindly give me another data steps to run the first step.

thank you.

Reeza · Posted 11-27-2019 06:33 PM

@Mystik wrote:

i have over 30,000 subjects with over 120, 000 observations and 15 variables.

Pls your first data step with the datalines wouldn't work.

I gave the above dataline just to give a brief description of the data.

if you kindly give me another data steps to run the first step.

thank you.

The purpose of the datalines is to create sample data ONLY. You would replace references to HAVE, with whatever your data set name is instead.

Reeza · Posted 11-27-2019 06:14 PM

So what would be the expected output if this was the input data set?

@Mystik wrote:

Thanks.

Actually there are no missing values;

for instance this is how the data looks like;

ID indexdate Grp 1 Grp2 hxdate hx_60

T01 11Jun2006 OCS OCS 11Jun2006 1

T01 11Jun2006 OCS ** 21Jul2006 1 **=MISSING

T01 11Jun2006 OCS OCS 16Sep2006 1

T02 06Jan2010 BASA BASA 06Jan2010 1

T03 12May2013 LOMA LOMA 12May2013 1

T04 17Sep2007 ABAS ABAS 17Sep2007 1

T04 17Sep2007 ABAS ** 19Oct2007 1

T04 17Sep2007 ABAS ** . 11Dec2007 1

As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04

Mystik · Posted 11-27-2019 06:17 PM

sorry this is a data with over 120,000 observations and 16 variables.

the expected out put is to work with subjects (such as T02 and T03) that do not have duplicates.

This stage is the exclusion stage before the incoming logistic analyses.

Reeza · Posted 11-27-2019 06:21 PM

Then this is pretty trivial, use BY group processing.

proc sort data=have;
by ID;
run;

data want;
set have;
by ID:

if first.id and last.id then output;

run;

The record can only be the first and last if its a single record.

Mystik · Posted 11-27-2019 06:33 PM

WORKED!!!

THANKS

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away