Solved: PROC SQL: Update a column in one dataset from another dataset

AZFXL · Posted 04-03-2022 09:46 AM

Dear SAS community,

update libname.dataset1 as u set columnc=(select columnc from libname.dataset2 as n where u.columna=n.columna and u.columnb=n.columnb);

Was wondering if there is any other more efficient way to update a dataset from another dataset. The SAS code as above could cater for my need, but the runtime is way too long (probably due to large dataset, my dataset1 has around 3-5million records while dataset2 has 1million records).

Thank you very much in advance.

mkeintz · Posted 04-03-2022 11:20 AM

If the data are not sorted, you may want to use a hash object with all the observations from DATASET2 that have non-missing values of COLUMNC.

In addition, if DATASET2 is small and DATASET1 is large, you might save a lot of disk output activity by updating in place, via the MODIFY statement in a DATA step. This avoids re-writing all the observations in DATASET1.

data libname.dataset1;
  modify libname.dataset1;
  if _n_=1 then do;
    declare hash h (dataset:'dataset2 (where=(not missing(columnc)))');
      h.definekey('columna','columnb');
      h.definedata('columnc');
      h.definedone();
  end;
  if h.find()=0 then replace;
run;

I used the "not missing(columnc)" filter to replicate the UPDATE statement, i.e. to prevent COLUMNC values in the master dataset from being replaced by missing values .

Editted note: This assumes that DATASET2 is a subset of DATASET1. And it also assumes one observation per COLUMNA/COLLUMNB in DATASET2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

SASJedi · Posted 04-03-2022 10:12 AM

As long as both datasets are sorted by columns and column, try this:

data libname.dataset1;  
   update libname.dataset1 libname.dataset2 (keep=columna columnb columnc;
   by columna columnb;
run;

Check out my Jedi SAS Tricks for SAS Users

AZFXL · Posted 04-03-2022 11:13 AM

I have more than 100 columns in dataset1, does that mean that I have to list out all the 100 columns in the keep argument?

Tom · Posted 04-03-2022 11:44 AM

You should only KEEP the variables from the second dataset that are (1) needed to do the merge (2) you want to update.

You can include as many variables from the second dataset as you want to update. But if the list of variables in the second dataset that you don't want to update include into the result is smaller than the list you do want then use the DROP= dataset option instead of the KEEP= dataset option.

You might also want to use the IN= dataset option so you can avoid adding new observations for combinations of the BY variables that only appear in the second dataset.

data want;
  merge dataset1 (in=in1) dateset2(keep=by1 by2 var1);
  by by1 by2;
  if in1;
run;

SASJedi · Posted 04-03-2022 03:33 PM

No, you don't have to specify to keep the columns if they all are part of the update. However, for the observations in the dataset containing the updates (dataset2), only columns you want to update should have values in them. If a column contains a missing value in the update dataset, the corresponding column in the master (dataset1) will not be modified.

Check out my Jedi SAS Tricks for SAS Users

mkeintz · Posted 04-03-2022 11:20 AM

If the data are not sorted, you may want to use a hash object with all the observations from DATASET2 that have non-missing values of COLUMNC.

In addition, if DATASET2 is small and DATASET1 is large, you might save a lot of disk output activity by updating in place, via the MODIFY statement in a DATA step. This avoids re-writing all the observations in DATASET1.

data libname.dataset1;
  modify libname.dataset1;
  if _n_=1 then do;
    declare hash h (dataset:'dataset2 (where=(not missing(columnc)))');
      h.definekey('columna','columnb');
      h.definedata('columnc');
      h.definedone();
  end;
  if h.find()=0 then replace;
run;

I used the "not missing(columnc)" filter to replicate the UPDATE statement, i.e. to prevent COLUMNC values in the master dataset from being replaced by missing values .

Editted note: This assumes that DATASET2 is a subset of DATASET1. And it also assumes one observation per COLUMNA/COLLUMNB in DATASET2.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Re: PROC SQL: Update a column in one dataset from another dataset

Registration is open