Hello all,
I managed to resolve my first (harder) query, I'm now stuck which this one, which should be easy!
I have a list which contains ~50 variables per observation. Some observations (which I'm usinag as the ID) are duplicates, sorted by observation/ID.
I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair.
I've scoured the web and found nothing! But loads about removing duplicates to leave one of the duplicate pair, which isn't what I'm after!
Any help is truly appreciated, it should be so easy I'll kick myself.
Many thanks,
proc sql;
select *
from table
where ID_columns not in (
select
ID_columns
from (
select
ID_columns
, count(*) as records
from table
group by ID_columns
)
where records > 1
);
quit;run;
proc sql;
select *
from table
where ID_columns not in (
select
ID_columns
from (
select
ID_columns
, count(*) as records
from table
group by ID_columns
)
where records > 1
);
quit;run;
Thanks,
Could I be a fool and ask you to spell it out, I'm having a try but getting a few errors (cannot find work.data.table etc)
Thanks!
Replace ID_columns with the columns that make up your id you mentioned.
Replace table with the dataset you are using (eg. work.some_ds or whatever).
I don't see any specific details about the dataset you are using in this post.
/* maybe a sample dataset in the example would help */
data shoes;
set sashelp.shoes;id_columns = _n_;
run;
data shoes1; set sashelp.shoes(obs=11) ;
id_columns = _n_;run;
data shoes1;
set shoes shoes1;
run;
proc sql;
create table shoes2 as select *
from shoes1 where ID_columns not in ( select ID_columns from ( select ID_columns , count(*) as records
from shoes1
group by ID_columns )
where records > 1 );
quit;
run;
data ck; set shoes2;
run;
How about this? A little cleanup.
/* maybe a sample dataset in the example would help */
data shoes;
set sashelp.shoes;id_columns = _n_;
run;
data shoes1; set sashelp.shoes(obs=11) ;
id_columns = _n_;run;
data shoes1;
set shoes shoes1;
run;
proc sql;
select
region
, product
, subsidiary
, max(ID_columns) as id_columns
from shoes1
group by
region
, product
, subsidiary
;
quit;
proc sql;
create table shoes2 as select *
from shoes1
where id_columns in (
select
max(ID_columns) as id_columns
from shoes1
group by
region
, product
, subsidiary
);
quit;
run;
data ck; set shoes2;
run;
I struggle to understand how your dataset is built esp with the naming of ID as Observation but anyway from what I get
you have a dataset, let's call it "have" as by the standards around the forums. It is already sorted by a variable called ID and you have duplicate IDs. You wish to subset to have only records that have no duplicates? If so, and again, this relies on the existing sortation, you can use by processing to trivialize the job.
data want;
set have;
by ID;
if first.id and last.id then output;
run;
The above outputs a record to the dataset WANT only if it is simultaneously the first and the last value of a by-group effectively only outputting IDs that have no duplicates. It is also significantly faster than query on a subquery with count() from SQL if you are not running on a DBMS
Vincent
Just for the sake of SQL, subquery is not needed, but I am not sure how that affects performance:
proc sql;
create table want as
select * from have
group by id
having count(*) <2;
quit;
Haikuo
This won't work. You will be completely removing observations with more than 1 record. You want to keep one observation.
By the way, you could also use proc sort and dedupkey.
Not sure about OP's real intention, but this is the quote from the initial post, maybe the course has been shifted down the line?
"I'm wanting to delete all pairs (i.e both entires) leaving me with only the observations which only appear once i.e were not one of a pair."
You are right, it will remove both, because that is exactly what OP desires, or at least appeared to be so, judging from the above statement. Well, at the same time, I agree this may NOT be a case that is 'beyond reasonable doubt", per se .
Haikuo
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.