topic Re: remove samples that are repeated in SAS Programming

remove samples that are repeated

y_fu — Thu, 23 Aug 2018 17:24:54 GMT

I want to remove the samples that are repeated. For example, if I have data like:

Sample v1

A 1

B 1

B 2

C 1

I want the output like:

Sample v1

A 1

C 1

So not just remove the duplicate (I know how to do that).

Re: remove samples that are repeated

Reeza — Thu, 23 Aug 2018 17:30:02 GMT

So you still want the records but the values to be erased?

You can set them to missing using CALL MISSING() and the same technique you would use to remove duplicates.

data want;
set have;
by sample;

if not (first.sample and last.sample) then call missing(sample, v1);

run;

Re: remove samples that are repeated

novinosrin — Thu, 23 Aug 2018 17:32:39 GMT

Are you looking to "remove" or assign missing values to the group that has more than one record?

Re: remove samples that are repeated

y_fu — Thu, 23 Aug 2018 17:41:48 GMT

Thank you for the help, I want to remove them, not changing to missing calls.

Based on your code, I can just change

if not (first.sample and last.sample) then call missing(sample, v1);

if not (first.sample and last.sample) then delete;

Works perfect, thanks again.

Re: remove samples that are repeated

Jagadishkatam — Thu, 23 Aug 2018 17:45:58 GMT

data have;
input Sample:$ v1;
cards;  
A   1
B   1
B   2
C   1
;

proc sql;
create table want as select sample,sum(v1) as sum, v1 from have group by sample having sum=1;
quit;

Re: remove samples that are repeated

novinosrin — Thu, 23 Aug 2018 18:04:13 GMT

Hi @Jagadishkatam I do like the sql approach possibly even extending the sort to be performed intrinsically at in database level however I'm afraid logic of your code will produce erroneous results in case of the just an addition of one more group D as illustrated below. On a sorted dataset, @Reeza solution tweaked by OP @y_fu is simplest

data have;
input Sample:$ v1;
cards;  
A   1
B   1
B   2
C   1
D   0
D   1
;
dm log 'clear';
proc sql;
create table want_yours as 
select sample,sum(v1) as sum, v1 
from have 
group by sample 
having sum=1;
quit;
data want_reeza_and_OP;
set have;
by sample;
if not (first.sample and last.sample) then delete;
run;

proc sql;
create table want_mine as 
select * 
from have 
group by sample 
having count(sample)=1;
quit;

My 2 cents for what it's worth & Regards!

Re: remove samples that are repeated

novinosrin — Thu, 23 Aug 2018 18:12:47 GMT

More notes, Sample being sorted The performance of data step approach should essentially beat the SQL in my opinion for the reason "remerging is an overhead or in other words an extra pass as I believe first and last pointers are quicker as opposed to group, count , remerge and filter in sql.

Re: remove samples that are repeated

novinosrin — Thu, 23 Aug 2018 18:45:52 GMT

And if your dataset isn't sorted, Hash is handy

Fun:

data have;
input Sample:$ v1;
cards;  
A   1
B   2
C   1
D   0
B   1
D   1
;
data _null_;
if _n_=1 then do;
dcl hash h(ordered:'y');
 h.definekey  ("sample") ;
 h.definedata ("sample","v1",'_N_') ;
 h.definedone () ;
end;
set have end=lr;
if h.check() ne 0 then do; _N_=1; h.replace();end;
else do;_N_=_N_+1; h.replace();end;
if lr then h.output(dataset:'want(where=(_N_=1))');
run;