DATA Step, Macro, Functions and more

Deleting duplicates based on multiple criteria

Reply
Occasional Contributor
Posts: 8

Deleting duplicates based on multiple criteria

[ Edited ]

hello,

 

there are few edits in the below code 

 

I am stuck with something please help me out with this..

I have a table as-

 

wr_no    rfa_no   stk_ref     status

1               1          x             fp

1               2          x             fp

1               1          y             fp

1               1          x             nu

2               1

2               1          x             fp

1               1          x             gd

1               2

1               1          x             gp

1               1          x             gc

 

I want to create  a flag as 0 for records having status 'nu' i.e for every  'nu' of  stk_ref i want to flag 0  'fp' but not ('gd','gp','gc') with reference to my stk_ref ,wr_no and rfa_no.so basically here i want to flag 0 1st and 4th record 

because for the same stk_ref i got 'nu' which has unique  wr_no and unique rfa_no.there are  blank stk_ref and status also for which the flag should be zero.

 

And the table contains 5billion rows.

 

 

Frequent Contributor
Posts: 104

Re: Deleting duplicates based on multiple criteria

Hi,

 

something like that?

DATA have;
input wr_no rfa_no stk_ref $ status $;
datalines;
1 1 x fp
1 2 x fp
1 1 y fp
1 1 x nu
2 1 x fp
;
run;

PROC SQL;
   CREATE TABLE help AS
      SELECT a.*,b.flag
      FROM have a
      LEFT JOIN
        (SELECT *,1 AS flag
         FROM have
         WHERE status eq 'nu') b 
      ON a.wr_no eq b.wr_no
      AND a.rfa_no eq b.rfa_no
      AND a.stk_ref eq b.stk_ref
   ;
quit;

data want;
   set help;
   where flag ne 1;
   drop flag;
run;

Cheers

 

________________________

- That still only counts as one -

Occasional Contributor
Posts: 8

Re: Deleting duplicates based on multiple criteria

Amazing!!thankyou so much..

just 1 que can you please explain me how for the other status='FP' it takes flag=1.?

 

Frequent Contributor
Posts: 104

Re: Deleting duplicates based on multiple criteria

(SELECT *,1 AS flag FROM have WHERE status eq 'nu') b
is a subquery table where every row gets flag=1 and names it 'b' henceforth.

SELECT a.*,b.flag
this selects all columns from 'a' and the column 'flag' from 'b' (which value is 1 everywhere)

The join performed is done by comparing the wr_no, rft_no and stk_ref rows. For each match between table have(a) and the subquery(b), the value of the column flag is appended, that is appended with '1'

Hope this is clear.
There are many join explanations on the web.
Maybe you can refer to:
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

Cheers
________________________

- That still only counts as one -

Super User
Posts: 6,964

Re: Deleting duplicates based on multiple criteria

With a data step:

data have;
input wr_no rfa_no stk_ref $ status $;
cards;
1               1          x              fp
1               2          x              fp
1               1           y             fp
1               1           x             nu
2               1           x             fp
;
run;

proc sort data=have;
by wr_no rfa_no stk_ref;
run;

data want;
merge
  have (in=a)
  have (in=b rename=(status=_status) where=(_status = 'nu'))
;
by wr_no rfa_no stk_ref;
if a and not b;
drop _status;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 8

Re: Deleting duplicates based on multiple criteria

Thankyou the code works fine.

Frequent Contributor
Posts: 104

Re: Deleting duplicates based on multiple criteria

[ Edited ]

Hi,

 

with 5 billions rows, hash may be a much faster than my SQL:

data _NULL_;
   if 0 then set have;
   
   if _N_ eq 1 then do;
      declare hash hnu(dataset: 'work.have (where=(status eq "nu")) duplicate: "add"');
      hnu.defineKey('wr_no','rfa_no','stk_ref');
      hnu.defineDone();
      
      declare hash w(ordered: 'ascending');
      w.defineKey('wr_no','rfa_no','stk_ref');
      w.defineData('wr_no','rfa_no','stk_ref','status');
      w.defineDone();
   end;
   set have end=eof;

   if hnu.find()>0;
   if w.add()>=0;
   if eof then rc=w.output(dataset: 'work.want');
run;

 

________________________

- That still only counts as one -

Super User
Posts: 6,964

Re: Deleting duplicates based on multiple criteria


Oligolas wrote:

Hi,

 

with 5 billions rows, hash may be a much faster than my SQL:

 

 


Only if you have enough memory available:

assuming that 10 % of observations have status = 'nu', you get 26 * 500,000,000 ~ 12.4 Gbytes raw size for the hash object, without the hash tree itself.

Operations on datasets, OTOH, scale until you run out of disk space, which comes much later.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 9,687

Re: Deleting duplicates based on multiple criteria

data have;
input wr_no rfa_no stk_ref $ status $;
cards;
1               1          x              fp
1               2          x              fp
1               1           y             fp
1               1           x             nu
2               1           x             fp
;
run;
proc sql;
select *
 from have
  where catx('|',wr_no ,rfa_no ,stk_ref)
   not in (select catx('|',wr_no ,rfa_no ,stk_ref) from have where status='nu');
quit;
Ask a Question
Discussion stats
  • 8 replies
  • 183 views
  • 6 likes
  • 4 in conversation