I want to remove the samples that are repeated. For example, if I have data like:
Sample v1
A 1
B 1
B 2
C 1
I want the output like:
Sample v1
A 1
C 1
So not just remove the duplicate (I know how to do that).
So you still want the records but the values to be erased?
You can set them to missing using CALL MISSING() and the same technique you would use to remove duplicates.
data want;
set have;
by sample;
if not (first.sample and last.sample) then call missing(sample, v1);
run;
So you still want the records but the values to be erased?
You can set them to missing using CALL MISSING() and the same technique you would use to remove duplicates.
data want;
set have;
by sample;
if not (first.sample and last.sample) then call missing(sample, v1);
run;
Thank you for the help, I want to remove them, not changing to missing calls.
Based on your code, I can just change
if not (first.sample and last.sample) then call missing(sample, v1);
to
if not (first.sample and last.sample) then delete;
Works perfect, thanks again.
Are you looking to "remove" or assign missing values to the group that has more than one record?
data have;
input Sample:$ v1;
cards;
A 1
B 1
B 2
C 1
;
proc sql;
create table want as select sample,sum(v1) as sum, v1 from have group by sample having sum=1;
quit;
Hi @Jagadishkatam I do like the sql approach possibly even extending the sort to be performed intrinsically at in database level however I'm afraid logic of your code will produce erroneous results in case of the just an addition of one more group D as illustrated below. On a sorted dataset, @Reeza solution tweaked by OP @y_fu is simplest
data have;
input Sample:$ v1;
cards;
A 1
B 1
B 2
C 1
D 0
D 1
;
dm log 'clear';
proc sql;
create table want_yours as
select sample,sum(v1) as sum, v1
from have
group by sample
having sum=1;
quit;
data want_reeza_and_OP;
set have;
by sample;
if not (first.sample and last.sample) then delete;
run;
proc sql;
create table want_mine as
select *
from have
group by sample
having count(sample)=1;
quit;
My 2 cents for what it's worth & Regards!
More notes, Sample being sorted The performance of data step approach should essentially beat the SQL in my opinion for the reason "remerging is an overhead or in other words an extra pass as I believe first and last pointers are quicker as opposed to group, count , remerge and filter in sql.
And if your dataset isn't sorted, Hash is handy
Fun:
data have;
input Sample:$ v1;
cards;
A 1
B 2
C 1
D 0
B 1
D 1
;
data _null_;
if _n_=1 then do;
dcl hash h(ordered:'y');
h.definekey ("sample") ;
h.definedata ("sample","v1",'_N_') ;
h.definedone () ;
end;
set have end=lr;
if h.check() ne 0 then do; _N_=1; h.replace();end;
else do;_N_=_N_+1; h.replace();end;
if lr then h.output(dataset:'want(where=(_N_=1))');
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.