BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mystik
Obsidian | Level 7

Hello,

Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.

 

The prog below does not work for me but only deletes duplicates but not the subject having duplicates;

 

proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Then this is pretty trivial, use BY group processing.

 

proc sort data=have;
by ID;
run;

data want;
set have;
by ID:

if first.id and last.id then output;

run;

The record can only be the first and last if its a single record.

 

View solution in original post

10 REPLIES 10
ballardw
Super User

@Mystik wrote:

Hello,

Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.

 

The prog below does not work for me but only deletes duplicates but not the subject having duplicates;

 

proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;


There are a couple of ways your requirement may be interpreted. It may help to provide a small example of the data, you would only need the subject_id and hx_60 values, include some records that should be removed and some that shouldn't. Then show what the desired output for that example should look like.

Tom
Super User Tom
Super User

So if subject 1 has 5 observations and two or more of them have the same value of HX_60 you want to remove all 5 observations?

And if subject 2 has 5 observations with 5 different values of HX_60 you want to keep all of them?

What about missing values of HC_60?  Do you want to delete subjects with more than one observations with a missing value.  What if there are two missing values, but one is normal missing value and the other is a special missing value?

 

Anyway you could try this.

proc sort data=krag.px_60 out=krag.px_60a ;
  by subject_ID hx_60;
run;
data krag_px_60a ;
do until (last.subject_id);
  set krag_px_60a ;
  by subject_ID hx_60;
  flag=min(flag,first.hx_60);
end;
do until (last.subject_id);
  set krag_px_60a ;
  by subject_ID hx_60;
  if not flag then output;
end;
run;

The FIRST.HX_60 variable will be true (=1) when it is the first observation for that value of HX_60 (for this SUBJECT_ID).  So if the MIN() of all of them is 1 then all of them are 1.  So when FLAG=0 (FALSE) it means that there was at least one duplicate HX_60 value and you want to remove ALL of the observations for that subject.

 

PS  The NODUPREC options is almost never what you want to use with PROC SORT. Especially if the BY statement does not included every variable in the dataset.

Mystik
Obsidian | Level 7

Thanks.

Actually there are no missing values;

 

for instance this is how the data looks like;

ID      indexdate           Grp 1      Grp2      hxdate             hx_60

T01   11Jun2006         OCS        OCS      11Jun2006       1

T01   11Jun2006         OCS        **           21Jul2006         1              **=MISSING

T01   11Jun2006         OCS        OCS      16Sep2006       1

T02   06Jan2010         BASA      BASA     06Jan2010       1

T03   12May2013        LOMA     LOMA    12May2013       1

T04   17Sep2007        ABAS      ABAS     17Sep2007       1

T04   17Sep2007        ABAS      **            19Oct2007        1

T04   17Sep2007        ABAS      ** .          11Dec2007       1

 

As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04

unison
Lapis Lazuli | Level 10

I'm having trouble understanding what you mean. Do you want delete the whole observation or the value that is in ID?

 

Here's deleting the value of ID which has duplicates:

 

data have;
length id $3. grp1 $4. grp2 $4.;
input ID $ indexdate :date9. Grp1 $ Grp2 $ hxdate :date9. hx_60;
format indexdate hxdate date9.;
datalines;
T01 11Jun2006 OCS OCS 11Jun2006 1
T01 11Jun2006 OCS . 21Jul2006 1
T01 11Jun2006 OCS OCS 16Sep2006 1
T02 06Jan2010 BASA BASA 06Jan2010 1
T03 12May2013 LOMA LOMA 12May2013 1
T04 17Sep2007 ABAS ABAS 17Sep2007 1
T04 17Sep2007 ABAS . 19Oct2007 1
T04 17Sep2007 ABAS . 11Dec2007 1
;
run;

proc sql;
create table counts as 
select id, count(hx_60) as cnt
from have
group by id
having cnt>1
;quit;

data want;
merge counts(in=cnts drop=cnt) have(in=hve);
by id;
if cnts then id='';
run;

-unison

-unison
Mystik
Obsidian | Level 7

i have over 30,000 subjects with over 120, 000 observations and 15 variables.

Pls your first data step with the datalines wouldn't work.

I gave the above dataline just to give a brief description of the data.

if you kindly give me another data steps to run the first step.

thank you.

Reeza
Super User

@Mystik wrote:

i have over 30,000 subjects with over 120, 000 observations and 15 variables.

Pls your first data step with the datalines wouldn't work.

I gave the above dataline just to give a brief description of the data.

if you kindly give me another data steps to run the first step.

thank you.


The purpose of the datalines is to create sample data ONLY. You would replace references to HAVE, with whatever your data set name is instead. 

 

Reeza
Super User

So what would be the expected output if this was the input data set?

 


@Mystik wrote:

Thanks.

Actually there are no missing values;

 

for instance this is how the data looks like;

ID      indexdate           Grp 1      Grp2      hxdate             hx_60

T01   11Jun2006         OCS        OCS      11Jun2006       1

T01   11Jun2006         OCS        **           21Jul2006         1              **=MISSING

T01   11Jun2006         OCS        OCS      16Sep2006       1

T02   06Jan2010         BASA      BASA     06Jan2010       1

T03   12May2013        LOMA     LOMA    12May2013       1

T04   17Sep2007        ABAS      ABAS     17Sep2007       1

T04   17Sep2007        ABAS      **            19Oct2007        1

T04   17Sep2007        ABAS      ** .          11Dec2007       1

 

As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04


 

Mystik
Obsidian | Level 7

sorry this is a data with over 120,000 observations and 16 variables.

the expected out put is to work with subjects (such as T02 and T03) that do not have duplicates.

This stage is the exclusion stage before the incoming logistic analyses.

Reeza
Super User

Then this is pretty trivial, use BY group processing.

 

proc sort data=have;
by ID;
run;

data want;
set have;
by ID:

if first.id and last.id then output;

run;

The record can only be the first and last if its a single record.

 

Mystik
Obsidian | Level 7

WORKED!!!

THANKS

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3452 views
  • 1 like
  • 5 in conversation