Hello,
I have a very large dataset that needs to be updated using another very large reference dataset. I've tried Proc Sql Update but it is taking a very long time. I've come to know that running Proc Sql Update with Call Execute is a better alternative. I would appreciate it if someone can provide the Call Execute codes for the following Proc Sql:
proc sql;
update datawant a
set var1=(select var1 from datahave where a.var=var),
var2=(select var2 from datahave where a.var=var);
quit;
Your log shows this from a data _null_ step creating the call execute:
27 call execute("update work.datawant a set proc_start_date_time1='"||proc_start_date_time1||"' where 27 ! TRANSACTION_ID="||TRANSACTION_ID||";"); 28 if final then call execute('quit;'); 29 run;
Pay very close attention to the bit
set proc_start_date_time1='"
As that is what generates the error shown in the log
+ update work.datawant a set proc_start_date_time1=' 1776505260' where TRA
So the call execute result is telling the update to use the character value within quotes ' 1776505260' . Which is not numeric.
Remove the single quotes from the call execute.
I suspect borrowing some code from an example updating a character variable.
If you are updating a data set where there is a unique key value to match your data on then a datastep update might be faster and would allow updating multiple variables in a single step. For this to work you would need to have the data sorted by that key variable(s), use a by statement with those variables and the base data cannot have any repeats of the key variable combinations.
A very brief program would look like this assuming the data is sorted prior:
data bigdata; update bigdata transactiondata ; by keyvariable; run;
By default the values of variables from the transactiondata set would replace the values of the same named variables on the matching records in bigdata. An option on the UPDATE statement, UPDATEMODE sets whether missing values in the transaction data would replace values in the bigdata set. UPDATEMODE=MISSINGCHECK , the default, prevents missing from replacing the values, NOMISSINGCHECK would allow replace values with missing from the transaction data.
Note that with this approach any other variables in the transaction data will be added to the data so you may need to drop some variables so only the key variable(s) and the desired variables to update are present.
I've come to know that running Proc Sql Update with Call Execute is a better alternative.
I'm not aware of this. It seems to me that the execution time of a PROC SQL call doesn't depend on how the PROC SQL call was initiated, via plain old PROC SQL, or by CALL EXECUTE, or via a macro. Please explain.
I would appreciate it if someone can provide the Call Execute codes for the following Proc Sql:
proc sql;
update datawant a
set var1=(select var1 from datahave where a.var=var),
var2=(select var2 from datahave where a.var=var);
quit;
I see absolutely no reason and no benefit to doing this from CALL EXECUTE, which allows you to dynamically change the code based upon the contents of a data set. I see no need in your code or in your explanation where this dynamic capability is needed.
The dataset that needs updating and the reference dataset have about 4 million rows. When I ran the Proc Sql I let SAS run for about 30min before I stopped it. Although the question have the Proc Sql code for updating of 2 variables, the dataset, in fact, has 35 variables that need updating (mostly date variables).
I first came across the following webpage but I'm having trouble adapting its Call Execute codes to my needs: https://heuristically.wordpress.com/2010/08/06/make-proc-sql-update-faster-sas-call-execute-data-ste...
Thank you
Okay, I did not know that, and still not sure I understand it. However...
I'm having trouble adapting its Call Execute codes to my needs:
As always here in the SAS Community, show us your SASLOG and other indication of problems. We can't help you if you just say you are having trouble, you have to be specific and show us exactly what you did and what isn't working.
I would like to see if the Call Execute works with one variable (date variable) before I add another. Here's the code:
proc sql;
update work.datawant a
set proc_start_date_time1=(select proc_start_date_time1 from work.datahave where a.transaction_id=transaction_id);
quit;
data _null_;
set work.datahave end=final;
if _n_ = 1 then call execute("proc sql ;");
call execute("update work.datawant a set proc_start_date_time1='"||proc_start_date_time1||"' where transaction_id="||transaction_id||";");
if final then call execute('quit;');
run;
I've attached the log.
ERROR: proc_start_date_time1, a numeric column, can only be updated with a numeric expression.
Seems pretty clear to me, the variable proc_start_date_time1 is numeric, and you are trying to insert a character value. You can't do that. You can only insert numeric varlues.
Both are date variables:
Here's the summary table for both datasets:
Your log shows this from a data _null_ step creating the call execute:
27 call execute("update work.datawant a set proc_start_date_time1='"||proc_start_date_time1||"' where 27 ! TRANSACTION_ID="||TRANSACTION_ID||";"); 28 if final then call execute('quit;'); 29 run;
Pay very close attention to the bit
set proc_start_date_time1='"
As that is what generates the error shown in the log
+ update work.datawant a set proc_start_date_time1=' 1776505260' where TRA
So the call execute result is telling the update to use the character value within quotes ' 1776505260' . Which is not numeric.
Remove the single quotes from the call execute.
I suspect borrowing some code from an example updating a character variable.
If you are updating a data set where there is a unique key value to match your data on then a datastep update might be faster and would allow updating multiple variables in a single step. For this to work you would need to have the data sorted by that key variable(s), use a by statement with those variables and the base data cannot have any repeats of the key variable combinations.
A very brief program would look like this assuming the data is sorted prior:
data bigdata; update bigdata transactiondata ; by keyvariable; run;
By default the values of variables from the transactiondata set would replace the values of the same named variables on the matching records in bigdata. An option on the UPDATE statement, UPDATEMODE sets whether missing values in the transaction data would replace values in the bigdata set. UPDATEMODE=MISSINGCHECK , the default, prevents missing from replacing the values, NOMISSINGCHECK would allow replace values with missing from the transaction data.
Note that with this approach any other variables in the transaction data will be added to the data so you may need to drop some variables so only the key variable(s) and the desired variables to update are present.
Thank you very much, ballardw! Removal of the single quotes worked.
If I were to update two variables, the code is this:
data _null_;
set work.datahave end=final;
if _n_ = 1 then call execute("proc sql ;");
call execute("update work.datawant a set proc_start_date_time1="||proc_start_date_time1||" where transaction_id="||transaction_id||";");
call execute("update work.datawant a set proc_start_date_time2="||proc_start_date_time2||" where transaction_id="||transaction_id||";");
if final then call execute('quit;');
run;
When I tried to update the large dataset using Call Execute the SAS server stopped around 4min mark. I'll try to get this sorted and try your data step suggestion as well.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.