Solved: Re: Removing duplicate pairs i.e keeping only unique values that weren...

JBOrlov · Posted 11-14-2013 08:15 AM

Hello all,

I managed to resolve my first (harder) query, I'm now stuck which this one, which should be easy!

I have a list which contains ~50 variables per observation. Some observations (which I'm usinag as the ID) are duplicates, sorted by observation/ID.

I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair.

I've scoured the web and found nothing! But loads about removing duplicates to leave one of the duplicate pair, which isn't what I'm after!

Any help is truly appreciated, it should be so easy I'll kick myself.

Many thanks,

afhood · Posted 11-14-2013 08:25 AM

proc sql;

select *

from table

where ID_columns not in (

select

ID_columns

from (

select

ID_columns

, count(*) as records

from table

group by ID_columns

)

where records > 1

);

quit;run;

View solution in original post

afhood · Posted 11-14-2013 08:25 AM

proc sql;

select *

from table

where ID_columns not in (

select

ID_columns

from (

select

ID_columns

, count(*) as records

from table

group by ID_columns

)

where records > 1

);

quit;run;

JBOrlov · Posted 11-14-2013 08:40 AM

Thanks,

Could I be a fool and ask you to spell it out, I'm having a try but getting a few errors (cannot find work.data.table etc)

Thanks!

afhood · Posted 11-14-2013 08:44 AM

Replace ID_columns with the columns that make up your id you mentioned.

Replace table with the dataset you are using (eg. work.some_ds or whatever).

I don't see any specific details about the dataset you are using in this post.

c36911 · Posted 11-14-2013 03:58 PM

/* maybe a sample dataset in the example would help */

data shoes;

set sashelp.shoes;id_columns = _n_;

run;

data shoes1; set sashelp.shoes(obs=11) ;

id_columns = _n_;run;

data shoes1;

set shoes shoes1;

run;

proc sql;

create table shoes2 as select *

from shoes1 where ID_columns not in ( select ID_columns from ( select ID_columns , count(*) as records

from shoes1

group by ID_columns )

where records > 1 );

quit;

run;

data ck; set shoes2;

run;

afhood · Posted 11-15-2013 10:59 AM

How about this? A little cleanup.

/* maybe a sample dataset in the example would help */

data shoes;

set sashelp.shoes;id_columns = _n_;

run;

data shoes1; set sashelp.shoes(obs=11) ;

id_columns = _n_;run;

data shoes1;

set shoes shoes1;

run;

proc sql;

select

region

, product

, subsidiary

, max(ID_columns) as id_columns

from shoes1

group by

region

, product

, subsidiary

;

quit;

proc sql;

create table shoes2 as select *

from shoes1

where id_columns in (

select

max(ID_columns) as id_columns

from shoes1

group by

region

, product

, subsidiary

);

quit;

run;

data ck; set shoes2;

run;

Vince28_Statcan · Posted 11-15-2013 01:26 PM

I struggle to understand how your dataset is built esp with the naming of ID as Observation but anyway from what I get

you have a dataset, let's call it "have" as by the standards around the forums. It is already sorted by a variable called ID and you have duplicate IDs. You wish to subset to have only records that have no duplicates? If so, and again, this relies on the existing sortation, you can use by processing to trivialize the job.

data want;

set have;

by ID;

if first.id and last.id then output;

run;

The above outputs a record to the dataset WANT only if it is simultaneously the first and the last value of a by-group effectively only outputting IDs that have no duplicates. It is also significantly faster than query on a subquery with count() from SQL if you are not running on a DBMS

Vincent

Haikuo · Posted 11-15-2013 01:56 PM

Just for the sake of SQL, subquery is not needed, but I am not sure how that affects performance:

proc sql;

create table want as

select * from have

group by id

having count(*) <2;

quit;

Haikuo

afhood · Posted 11-18-2013 01:52 PM

This won't work. You will be completely removing observations with more than 1 record. You want to keep one observation.

By the way, you could also use proc sort and dedupkey.

Haikuo · Posted 11-18-2013 02:02 PM

Not sure about OP's real intention, but this is the quote from the initial post, maybe the course has been shifted down the line?

"I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair."

You are right, it will remove both, because that is exactly what OP desires, or at least appeared to be so, judging from the above statement. Well, at the same time, I agree this may NOT be a case that is 'beyond reasonable doubt", per se .

Haikuo

Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Re: Removing duplicate pairs i.e keeping only unique values that weren't part of a pair

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!