BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AZFXL
Calcite | Level 5

Dear SAS community, 

 

update libname.dataset1 as u set columnc=(select columnc from libname.dataset2 as n where u.columna=n.columna and u.columnb=n.columnb);

Was wondering if there is any other more efficient way to update a dataset from another dataset. The SAS code as above could cater for my need, but the runtime is way too long (probably due to large dataset, my dataset1 has around 3-5million records while dataset2 has 1million records).

 

Thank you very much in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If the data are not sorted, you may want to use a hash object with all the observations from DATASET2 that have non-missing values of COLUMNC.

 

In addition, if DATASET2 is small and DATASET1 is large, you might save a lot of disk output activity by updating in place, via the MODIFY statement in a DATA step.  This avoids re-writing all the observations in DATASET1.

 

data libname.dataset1;
  modify libname.dataset1;
  if _n_=1 then do;
    declare hash h (dataset:'dataset2 (where=(not missing(columnc)))');
      h.definekey('columna','columnb');
      h.definedata('columnc');
      h.definedone();
  end;
  if h.find()=0 then replace;
run;

 

I used the "not missing(columnc)" filter to replicate the UPDATE statement, i.e. to prevent COLUMNC values in the master dataset from being replaced by missing values .  

 

Editted note:  This assumes that DATASET2 is a subset of DATASET1.  And it also assumes one observation per COLUMNA/COLLUMNB in DATASET2.  

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

5 REPLIES 5
SASJedi
Ammonite | Level 13

As long as both datasets are sorted by columns and column, try this:

data libname.dataset1;  
   update libname.dataset1 libname.dataset2 (keep=columna columnb columnc;
   by columna columnb;
run;
Check out my Jedi SAS Tricks for SAS Users
AZFXL
Calcite | Level 5

I have more than 100 columns in dataset1, does that mean that I have to list out all the 100 columns in the keep argument? 

Tom
Super User Tom
Super User

You should only KEEP the variables from the second dataset that are (1) needed to do the merge (2) you want to update.   

 

You can include as many variables from the second dataset as you want to update.  But if the list of variables in the second dataset that you don't want to update include into the result is smaller than the list you do want then use the DROP= dataset option instead of the KEEP= dataset option.

 

You might also want to use the IN= dataset option so you can avoid adding new observations for combinations of the BY variables that only appear in the second dataset.

 

data want;
  merge dataset1 (in=in1) dateset2(keep=by1 by2 var1);
  by by1 by2;
  if in1;
run;
SASJedi
Ammonite | Level 13

No, you don't have to specify to keep the columns if they all are part of the update. However, for the observations in the dataset containing the updates (dataset2), only columns you want to update should have values in them. If a column contains a missing value in the update dataset, the corresponding column in the master (dataset1) will not be modified. 

Check out my Jedi SAS Tricks for SAS Users
mkeintz
PROC Star

If the data are not sorted, you may want to use a hash object with all the observations from DATASET2 that have non-missing values of COLUMNC.

 

In addition, if DATASET2 is small and DATASET1 is large, you might save a lot of disk output activity by updating in place, via the MODIFY statement in a DATA step.  This avoids re-writing all the observations in DATASET1.

 

data libname.dataset1;
  modify libname.dataset1;
  if _n_=1 then do;
    declare hash h (dataset:'dataset2 (where=(not missing(columnc)))');
      h.definekey('columna','columnb');
      h.definedata('columnc');
      h.definedone();
  end;
  if h.find()=0 then replace;
run;

 

I used the "not missing(columnc)" filter to replicate the UPDATE statement, i.e. to prevent COLUMNC values in the master dataset from being replaced by missing values .  

 

Editted note:  This assumes that DATASET2 is a subset of DATASET1.  And it also assumes one observation per COLUMNA/COLLUMNB in DATASET2.  

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2869 views
  • 4 likes
  • 4 in conversation