DATA Step, Macro, Functions and more

Appending Non-Duplicate based on three variables

Accepted Solution Solved
Reply
Contributor
Posts: 65
Accepted Solution

Appending Non-Duplicate based on three variables

Hello,

 

I've two datasets, Lets say A and B,  there are  215 Variables in each of the dataset and datasize is 90 G.

 

I want to append B dataset into A but not to append the duplicate values which are common based on "Three Variables, not based on all 215 variable".

 

I can't use (syncadd=no uniquesave=yes) and also can't use below code, as it will check all variables for duplicity. 

 


proc sql feedback;
create table lib_w.want
select * from lib_A.dsA
union corresponding
select * from lib_B.dsB
;
quit;

 

I'm thininking to creat new varialbe as  catx(all three variable) in both A and B and then  "A.catx_var ne B.Catx_Var"

 

is there is other way around ?

 

Please help, Thanks in Advance. 

 

 

 

 


Accepted Solutions
Solution
‎06-09-2017 07:10 AM
Contributor
Posts: 65

Re: Appending Non-Duplicate based on three variables

[ Edited ]
Posted in reply to atul_desh

This is how I'm doing it now.. now it is less time consuming and giving what I wanted : 

 


proc sort data=A;
by Var1 Var2 Var3;
run;

 

proc sort data=B;
by Var1 Var2 Var3;
run;


data pre_want;
merge A (in=a) B (in=b);
byVar1 Var2 Var3;
if not a and b;
run;

proc append base=A data=pre_want;
run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,942

Re: Appending Non-Duplicate based on three variables

Posted in reply to atul_desh

So, you have 2 datasets which are 90gb each?  That is a really large amount of data.  Do you want to drop data from A or B based on being in both?  As a suggestion, you could get a distinct list of the 3 variables from the table you want to keep data, and then drop from the other before doing a proc append:

proc sort data=a out=distlist nodupkey;
  by var1 var2 var3;
run;

data _null_;
  set distlist end=last;
  if _n_=1 then call execute('data b_proc; set b;');
  call execute('if var1="'||strip(var1)||'" and var2="'||strip(var2)||'" and var3="'||strip(var3)||'" then delete;');
  if last then call execute('run;');
run;

proc append base=a data=b_proc force;
run;

This will create a big datastep with lots of if statements to drop the data from B based on a distinct list of var1-var3 from A.  

 

 

 

 

Contributor
Posts: 65

Re: Appending Non-Duplicate based on three variables

what if I append all the observation and in last I do nodupkey ?

Super User
Super User
Posts: 7,942

Re: Appending Non-Duplicate based on three variables

Posted in reply to atul_desh

Theres only one way to know, try it.  The proc append literally should just drop the header block and tag the data onto the existing dataset, so that shouldn't matter either way, but the proc sort/datastep is where the real work is being done, so anything to minimize that.

Super User
Posts: 10,020

Re: Appending Non-Duplicate based on three variables

[ Edited ]
Posted in reply to atul_desh

Did you try Hash Table ? Or try SQL.

 

proc sql;
create table only_in_B as
select var1,var2,var3  from B
except
select var1,var2,var3 from A ;
quit;
Contributor
Posts: 65

Re: Appending Non-Duplicate based on three variables

how can use only_in_b for append b to a ??
Solution
‎06-09-2017 07:10 AM
Contributor
Posts: 65

Re: Appending Non-Duplicate based on three variables

[ Edited ]
Posted in reply to atul_desh

This is how I'm doing it now.. now it is less time consuming and giving what I wanted : 

 


proc sort data=A;
by Var1 Var2 Var3;
run;

 

proc sort data=B;
by Var1 Var2 Var3;
run;


data pre_want;
merge A (in=a) B (in=b);
byVar1 Var2 Var3;
if not a and b;
run;

proc append base=A data=pre_want;
run;

PROC Star
Posts: 102

Re: Appending Non-Duplicate based on three variables

Posted in reply to atul_desh

@atul_desh: You could try with a hash table for checking if a row is in table A (assuming your key variables are called VAR1, VAR2 and VAR3):

Data A; /* we are modifying A in place */
  if 0 then modify A; /* we are not reading any obs. here */
  if _N_=1 then do;
    declare hash h(dataset: 'A(keep=var1 var2 var3)');
    rc=h.definekey('var1','var2','var3');
    h.definedone();
    end;
  set B;
  if h.find() then
    output A; /* if key not found, append to A */
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 145 views
  • 0 likes
  • 4 in conversation