topic Re: Appending Non-Duplicate based on three variables in SAS Programming

Appending Non-Duplicate based on three variables

atul_desh — Thu, 08 Jun 2017 10:46:15 GMT

Hello,

I've two datasets, Lets say A and B, there are 215 Variables in each of the dataset and datasize is 90 G.

I want to append B dataset into A but not to append the duplicate values which are common based on "Three Variables, not based on all 215 variable".

I can't use (syncadd=no uniquesave=yes) and also can't use below code, as it will check all variables for duplicity.

proc sql feedback;
create table lib_w.want
select * from lib_A.dsA
union corresponding
select * from lib_B.dsB
;
quit;

I'm thininking to creat new varialbe as catx(all three variable) in both A and B and then "A.catx_var ne B.Catx_Var"

is there is other way around ?

Please help, Thanks in Advance.

Re: Appending Non-Duplicate based on three variables

RW9 — Thu, 08 Jun 2017 10:56:18 GMT

So, you have 2 datasets which are 90gb each? That is a really large amount of data. Do you want to drop data from A or B based on being in both? As a suggestion, you could get a distinct list of the 3 variables from the table you want to keep data, and then drop from the other before doing a proc append:

proc sort data=a out=distlist nodupkey;
  by var1 var2 var3;
run;

data _null_;
  set distlist end=last;
  if _n_=1 then call execute('data b_proc; set b;');
  call execute('if var1="'||strip(var1)||'" and var2="'||strip(var2)||'" and var3="'||strip(var3)||'" then delete;');
  if last then call execute('run;');
run;

proc append base=a data=b_proc force;
run;

This will create a big datastep with lots of if statements to drop the data from B based on a distinct list of var1-var3 from A.

Re: Appending Non-Duplicate based on three variables

atul_desh — Thu, 08 Jun 2017 11:06:16 GMT

what if I append all the observation and in last I do nodupkey ?

Re: Appending Non-Duplicate based on three variables

RW9 — Thu, 08 Jun 2017 11:16:49 GMT

Theres only one way to know, try it. The proc append literally should just drop the header block and tag the data onto the existing dataset, so that shouldn't matter either way, but the proc sort/datastep is where the real work is being done, so anything to minimize that.

Re: Appending Non-Duplicate based on three variables

Ksharp — Thu, 08 Jun 2017 12:24:30 GMT

Did you try Hash Table ? Or try SQL.

proc sql;
create table only_in_B as
select var1,var2,var3  from B
except
select var1,var2,var3 from A ;
quit;

Re: Appending Non-Duplicate based on three variables

atul_desh — Thu, 08 Jun 2017 14:01:59 GMT

how can use only_in_b for append b to a ??

Re: Appending Non-Duplicate based on three variables

atul_desh — Fri, 09 Jun 2017 11:10:58 GMT

This is how I'm doing it now.. now it is less time consuming and giving what I wanted :

proc sort data=A;
by Var1 Var2 Var3;
run;

proc sort data=B;
by Var1 Var2 Var3;
run;

data pre_want;
merge A (in=a) B (in=b);
byVar1 Var2 Var3;
if not a and b;
run;

proc append base=A data=pre_want;
run;

Re: Appending Non-Duplicate based on three variables

s_lassen — Fri, 09 Jun 2017 11:27:07 GMT

@atul_desh: You could try with a hash table for checking if a row is in table A (assuming your key variables are called VAR1, VAR2 and VAR3):

Data A; /* we are modifying A in place */
  if 0 then modify A; /* we are not reading any obs. here */
  if _N_=1 then do;
    declare hash h(dataset: 'A(keep=var1 var2 var3)');
    rc=h.definekey('var1','var2','var3');
    h.definedone();
    end;
  set B;
  if h.find() then
    output A; /* if key not found, append to A */
run;