BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JBOrlov
Calcite | Level 5

Hello all,

I managed to resolve my first (harder) query, I'm now stuck which this one, which should be easy!

I have a list which contains ~50 variables per observation. Some observations (which I'm usinag as the ID) are duplicates, sorted by observation/ID.

I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair.

I've scoured the web and found nothing! But loads about removing duplicates to leave one of the duplicate pair, which isn't what I'm after!

Any help is truly appreciated, it should be so easy I'll kick myself.

Many thanks,

1 ACCEPTED SOLUTION

Accepted Solutions
afhood
Calcite | Level 5

proc sql;

select *

from table

where ID_columns not in (

     select

     ID_columns

     from (

          select

          ID_columns

          , count(*) as records

          from table

          group by ID_columns

     )

     where records > 1

);

quit;run;

View solution in original post

9 REPLIES 9
afhood
Calcite | Level 5

proc sql;

select *

from table

where ID_columns not in (

     select

     ID_columns

     from (

          select

          ID_columns

          , count(*) as records

          from table

          group by ID_columns

     )

     where records > 1

);

quit;run;

JBOrlov
Calcite | Level 5

Thanks,

Could I be a fool and ask you to spell it out, I'm having a try but getting a few errors (cannot find work.data.table etc)

Thanks!

afhood
Calcite | Level 5

Replace ID_columns with the columns that make up your id you mentioned.

Replace table with the dataset you are using (eg. work.some_ds or whatever).

I don't see any specific details about the dataset you are using in this post.

c36911
Calcite | Level 5

/* maybe a sample dataset in the example would help */

data shoes;

set sashelp.shoes;id_columns = _n_;

run;

data shoes1; set sashelp.shoes(obs=11) ;

id_columns = _n_;run;

data shoes1;

set shoes shoes1;

run;

proc sql;

create table shoes2 as select *

from shoes1 where ID_columns not in (     select     ID_columns     from (           select           ID_columns           , count(*) as records         

from shoes1         

group by ID_columns     )   

where records > 1 );

quit;

run;

data ck; set shoes2;

run;

afhood
Calcite | Level 5

How about this? A little cleanup.

/* maybe a sample dataset in the example would help */

data shoes;

set sashelp.shoes;id_columns = _n_;

run;

data shoes1; set sashelp.shoes(obs=11) ;

id_columns = _n_;run;

data shoes1;

set shoes shoes1;

run;

proc sql;

    

  select          

  region

  , product

  , subsidiary

  , max(ID_columns) as id_columns

  from shoes1        

  group by

  region

  , product

  , subsidiary

;

quit;

proc sql;

create table shoes2 as select *

from shoes1

where id_columns in (

  select          

  max(ID_columns) as id_columns

  from shoes1        

  group by

  region

  , product

  , subsidiary

);

quit;

run;

data ck; set shoes2;

run;

Vince28_Statcan
Quartz | Level 8

I struggle to understand how your dataset is built esp with the naming of ID as Observation but anyway from what I get

you have a dataset, let's call it "have" as by the standards around the forums. It is already sorted by a variable called ID and you have duplicate IDs. You wish to subset to have only records that have no duplicates? If so, and again, this relies on the existing sortation, you can use by processing to trivialize the job.

data want;

     set have;

     by ID;

     if first.id and last.id then output;

run;

The above outputs a record to the dataset WANT only if it is simultaneously the first and the last value of a by-group effectively only outputting IDs that have no duplicates. It is also significantly faster than query on a subquery with count() from SQL if you are not running on a DBMS

Vincent

Haikuo
Onyx | Level 15

Just for the sake of SQL, subquery is not needed, but I am not sure how that affects performance:

proc sql;

  create table want as

    select * from have

   group by id

   having count(*) <2;

quit;

Haikuo

afhood
Calcite | Level 5

This won't work. You will be completely removing observations with more than 1 record. You want to keep one observation.

By the way, you could also use proc sort and dedupkey.

Haikuo
Onyx | Level 15

Not sure about OP's real intention, but this is the quote from the initial post, maybe the course has been shifted down the line?

"I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair."

You are right, it will remove both, because that is exactly what OP desires, or at least appeared to be so, judging from the above statement. Well, at the same time, I agree this may NOT be a case that is 'beyond reasonable doubt", per se Smiley Happy.

Haikuo

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 5336 views
  • 2 likes
  • 5 in conversation