Solved: Re: remove samples that are repeated

y_fu · Posted 08-23-2018 01:24 PM

I want to remove the samples that are repeated. For example, if I have data like:

Sample v1

A 1

B 1

B 2

C 1

I want the output like:

Sample v1

A 1

C 1

So not just remove the duplicate (I know how to do that).

Reeza · Posted 08-23-2018 01:30 PM

So you still want the records but the values to be erased?

You can set them to missing using CALL MISSING() and the same technique you would use to remove duplicates.

data want;
set have;
by sample;

if not (first.sample and last.sample) then call missing(sample, v1);

run;

View solution in original post

Reeza · Posted 08-23-2018 01:30 PM

So you still want the records but the values to be erased?

You can set them to missing using CALL MISSING() and the same technique you would use to remove duplicates.

data want;
set have;
by sample;

if not (first.sample and last.sample) then call missing(sample, v1);

run;

y_fu · Posted 08-23-2018 01:41 PM

Thank you for the help, I want to remove them, not changing to missing calls.

Based on your code, I can just change

if not (first.sample and last.sample) then call missing(sample, v1);

to

if not (first.sample and last.sample) then delete;

Works perfect, thanks again.

novinosrin · Posted 08-23-2018 01:32 PM

Are you looking to "remove" or assign missing values to the group that has more than one record?

Jagadishkatam · Posted 08-23-2018 01:45 PM

data have;
input Sample:$ v1;
cards;  
A   1
B   1
B   2
C   1
;

proc sql;
create table want as select sample,sum(v1) as sum, v1 from have group by sample having sum=1;
quit;

Thanks,
Jag

novinosrin · Posted 08-23-2018 02:04 PM

Hi @Jagadishkatam I do like the sql approach possibly even extending the sort to be performed intrinsically at in database level however I'm afraid logic of your code will produce erroneous results in case of the just an addition of one more group D as illustrated below. On a sorted dataset, @Reeza solution tweaked by OP @y_fu is simplest

data have;
input Sample:$ v1;
cards;  
A   1
B   1
B   2
C   1
D   0
D   1
;
dm log 'clear';
proc sql;
create table want_yours as 
select sample,sum(v1) as sum, v1 
from have 
group by sample 
having sum=1;
quit;
data want_reeza_and_OP;
set have;
by sample;
if not (first.sample and last.sample) then delete;
run;

proc sql;
create table want_mine as 
select * 
from have 
group by sample 
having count(sample)=1;
quit;

My 2 cents for what it's worth & Regards!

novinosrin · Posted 08-23-2018 02:12 PM

More notes, Sample being sorted The performance of data step approach should essentially beat the SQL in my opinion for the reason "remerging is an overhead or in other words an extra pass as I believe first and last pointers are quicker as opposed to group, count , remerge and filter in sql.

novinosrin · Posted 08-23-2018 02:45 PM

And if your dataset isn't sorted, Hash is handy

Fun:

data have;
input Sample:$ v1;
cards;  
A   1
B   2
C   1
D   0
B   1
D   1
;
data _null_;
if _n_=1 then do;
dcl hash h(ordered:'y');
 h.definekey  ("sample") ;
 h.definedata ("sample","v1",'_N_') ;
 h.definedone () ;
end;
set have end=lr;
if h.check() ne 0 then do; _N_=1; h.replace();end;
else do;_N_=_N_+1; h.replace();end;
if lr then h.output(dataset:'want(where=(_N_=1))');
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away