Re: Deleting duplicates based on multiple criteria

nid197 · Posted 07-19-2017 04:38 AM

hello,

there are few edits in the below code

I am stuck with something please help me out with this..

I have a table as-

wr_no rfa_no stk_ref status

1 1 x fp

1 2 x fp

1 1 y fp

1 1 x nu

2 1

2 1 x fp

1 1 x gd

1 2

1 1 x gp

1 1 x gc

I want to create a flag as 0 for records having status 'nu' i.e for every 'nu' of stk_ref i want to flag 0 'fp' but not ('gd','gp','gc') with reference to my stk_ref ,wr_no and rfa_no.so basically here i want to flag 0 1st and 4th record

because for the same stk_ref i got 'nu' which has unique wr_no and unique rfa_no.there are blank stk_ref and status also for which the flag should be zero.

And the table contains 5billion rows.

Oligolas · Posted 07-19-2017 04:57 AM

Hi,

something like that?

DATA have;
input wr_no rfa_no stk_ref $ status $;
datalines;
1 1 x fp
1 2 x fp
1 1 y fp
1 1 x nu
2 1 x fp
;
run;

PROC SQL;
   CREATE TABLE help AS
      SELECT a.*,b.flag
      FROM have a
      LEFT JOIN
        (SELECT *,1 AS flag
         FROM have
         WHERE status eq 'nu') b 
      ON a.wr_no eq b.wr_no
      AND a.rfa_no eq b.rfa_no
      AND a.stk_ref eq b.stk_ref
   ;
quit;

data want;
   set help;
   where flag ne 1;
   drop flag;
run;

Cheers

________________________
- Cheers -

nid197 · Posted 07-19-2017 06:48 AM

Amazing!!thankyou so much..

just 1 que can you please explain me how for the other status='FP' it takes flag=1.?

Oligolas · Posted 07-19-2017 07:13 AM

(SELECT *,1 AS flag FROM have WHERE status eq 'nu') b
is a subquery table where every row gets flag=1 and names it 'b' henceforth.

SELECT a.*,b.flag
this selects all columns from 'a' and the column 'flag' from 'b' (which value is 1 everywhere)

The join performed is done by comparing the wr_no, rft_no and stk_ref rows. For each match between table have(a) and the subquery(b), the value of the column flag is appended, that is appended with '1'

Hope this is clear.
There are many join explanations on the web.
Maybe you can refer to:
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

Cheers

________________________
- Cheers -

Kurt_Bremser · Posted 07-19-2017 05:48 AM

With a data step:

data have;
input wr_no rfa_no stk_ref $ status $;
cards;
1               1          x              fp
1               2          x              fp
1               1           y             fp
1               1           x             nu
2               1           x             fp
;
run;

proc sort data=have;
by wr_no rfa_no stk_ref;
run;

data want;
merge
  have (in=a)
  have (in=b rename=(status=_status) where=(_status = 'nu'))
;
by wr_no rfa_no stk_ref;
if a and not b;
drop _status;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nid197 · Posted 07-19-2017 06:44 AM

Thankyou the code works fine.

Oligolas · Posted 07-19-2017 08:22 AM

Hi,

with 5 billions rows, hash may be a much faster than my SQL:

data _NULL_;
   if 0 then set have;
   
   if _N_ eq 1 then do;
      declare hash hnu(dataset: 'work.have (where=(status eq "nu")) duplicate: "add"');
      hnu.defineKey('wr_no','rfa_no','stk_ref');
      hnu.defineDone();
      
      declare hash w(ordered: 'ascending');
      w.defineKey('wr_no','rfa_no','stk_ref');
      w.defineData('wr_no','rfa_no','stk_ref','status');
      w.defineDone();
   end;
   set have end=eof;

   if hnu.find()>0;
   if w.add()>=0;
   if eof then rc=w.output(dataset: 'work.want');
run;

________________________
- Cheers -

Kurt_Bremser · Posted 07-19-2017 08:36 AM

@Oligolas wrote:

Hi,

with 5 billions rows, hash may be a much faster than my SQL:

Only if you have enough memory available:

assuming that 10 % of observations have status = 'nu', you get 26 * 500,000,000 ~ 12.4 Gbytes raw size for the hash object, without the hash tree itself.

Operations on datasets, OTOH, scale until you run out of disk space, which comes much later.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 07-19-2017 09:16 AM

data have;
input wr_no rfa_no stk_ref $ status $;
cards;
1               1          x              fp
1               2          x              fp
1               1           y             fp
1               1           x             nu
2               1           x             fp
;
run;
proc sql;
select *
 from have
  where catx('|',wr_no ,rfa_no ,stk_ref)
   not in (select catx('|',wr_no ,rfa_no ,stk_ref) from have where status='nu');
quit;

Classroom Training Available!