Hello,
Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.
The prog below does not work for me but only deletes duplicates but not the subject having duplicates;
proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;
Then this is pretty trivial, use BY group processing.
proc sort data=have;
by ID;
run;
data want;
set have;
by ID:
if first.id and last.id then output;
run;
The record can only be the first and last if its a single record.
@Mystik wrote:
Hello,
Pls can someone help me to delete subjects that have duplicates; not the duplicates but any subject that has duplicates recordings (variable is hx_60), that subject must be deleted.
The prog below does not work for me but only deletes duplicates but not the subject having duplicates;
proc sort data=krag.px_60 out=krag.px_60a noduprecs;
by subject_ID hx_60;
run;
There are a couple of ways your requirement may be interpreted. It may help to provide a small example of the data, you would only need the subject_id and hx_60 values, include some records that should be removed and some that shouldn't. Then show what the desired output for that example should look like.
So if subject 1 has 5 observations and two or more of them have the same value of HX_60 you want to remove all 5 observations?
And if subject 2 has 5 observations with 5 different values of HX_60 you want to keep all of them?
What about missing values of HC_60? Do you want to delete subjects with more than one observations with a missing value. What if there are two missing values, but one is normal missing value and the other is a special missing value?
Anyway you could try this.
proc sort data=krag.px_60 out=krag.px_60a ;
by subject_ID hx_60;
run;
data krag_px_60a ;
do until (last.subject_id);
set krag_px_60a ;
by subject_ID hx_60;
flag=min(flag,first.hx_60);
end;
do until (last.subject_id);
set krag_px_60a ;
by subject_ID hx_60;
if not flag then output;
end;
run;
The FIRST.HX_60 variable will be true (=1) when it is the first observation for that value of HX_60 (for this SUBJECT_ID). So if the MIN() of all of them is 1 then all of them are 1. So when FLAG=0 (FALSE) it means that there was at least one duplicate HX_60 value and you want to remove ALL of the observations for that subject.
PS The NODUPREC options is almost never what you want to use with PROC SORT. Especially if the BY statement does not included every variable in the dataset.
Thanks.
Actually there are no missing values;
for instance this is how the data looks like;
ID indexdate Grp 1 Grp2 hxdate hx_60
T01 11Jun2006 OCS OCS 11Jun2006 1
T01 11Jun2006 OCS ** 21Jul2006 1 **=MISSING
T01 11Jun2006 OCS OCS 16Sep2006 1
T02 06Jan2010 BASA BASA 06Jan2010 1
T03 12May2013 LOMA LOMA 12May2013 1
T04 17Sep2007 ABAS ABAS 17Sep2007 1
T04 17Sep2007 ABAS ** 19Oct2007 1
T04 17Sep2007 ABAS ** . 11Dec2007 1
As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04
I'm having trouble understanding what you mean. Do you want delete the whole observation or the value that is in ID?
Here's deleting the value of ID which has duplicates:
data have;
length id $3. grp1 $4. grp2 $4.;
input ID $ indexdate :date9. Grp1 $ Grp2 $ hxdate :date9. hx_60;
format indexdate hxdate date9.;
datalines;
T01 11Jun2006 OCS OCS 11Jun2006 1
T01 11Jun2006 OCS . 21Jul2006 1
T01 11Jun2006 OCS OCS 16Sep2006 1
T02 06Jan2010 BASA BASA 06Jan2010 1
T03 12May2013 LOMA LOMA 12May2013 1
T04 17Sep2007 ABAS ABAS 17Sep2007 1
T04 17Sep2007 ABAS . 19Oct2007 1
T04 17Sep2007 ABAS . 11Dec2007 1
;
run;
proc sql;
create table counts as
select id, count(hx_60) as cnt
from have
group by id
having cnt>1
;quit;
data want;
merge counts(in=cnts drop=cnt) have(in=hve);
by id;
if cnts then id='';
run;
-unison
i have over 30,000 subjects with over 120, 000 observations and 15 variables.
Pls your first data step with the datalines wouldn't work.
I gave the above dataline just to give a brief description of the data.
if you kindly give me another data steps to run the first step.
thank you.
@Mystik wrote:
i have over 30,000 subjects with over 120, 000 observations and 15 variables.
Pls your first data step with the datalines wouldn't work.
I gave the above dataline just to give a brief description of the data.
if you kindly give me another data steps to run the first step.
thank you.
The purpose of the datalines is to create sample data ONLY. You would replace references to HAVE, with whatever your data set name is instead.
So what would be the expected output if this was the input data set?
@Mystik wrote:
Thanks.
Actually there are no missing values;
for instance this is how the data looks like;
ID indexdate Grp 1 Grp2 hxdate hx_60
T01 11Jun2006 OCS OCS 11Jun2006 1
T01 11Jun2006 OCS ** 21Jul2006 1 **=MISSING
T01 11Jun2006 OCS OCS 16Sep2006 1
T02 06Jan2010 BASA BASA 06Jan2010 1
T03 12May2013 LOMA LOMA 12May2013 1
T04 17Sep2007 ABAS ABAS 17Sep2007 1
T04 17Sep2007 ABAS ** 19Oct2007 1
T04 17Sep2007 ABAS ** . 11Dec2007 1
As depicted above, Subject with ID=T01 has 3 duplicates and needs to be excluded, i.e delete T01 and not the duplicates. Also T04
sorry this is a data with over 120,000 observations and 16 variables.
the expected out put is to work with subjects (such as T02 and T03) that do not have duplicates.
This stage is the exclusion stage before the incoming logistic analyses.
Then this is pretty trivial, use BY group processing.
proc sort data=have;
by ID;
run;
data want;
set have;
by ID:
if first.id and last.id then output;
run;
The record can only be the first and last if its a single record.
WORKED!!!
THANKS
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.