Re: PROC SQL update always slow? Why?

Jeg123 · Posted 10-26-2018 03:16 AM

I am running some proc sql statements which first uses alters the table a new column and then updates the column.

This is taking forever, it is faster to do a select all, even though it contains 50+ columns and create a new column in the select statement. This does not make sense.

I tried finding an answer to this and the only thing I found was this comment

"FYI - SQL Update unfortunately as supported in SAS is almost always terrible. SQL has many uses in SAS, but update should almost never be one, unless it's a single value update that's not sourced from a table. For whatever reason, that always seems to be incredibly slow, even for not-all-that-large datasets, and even for updates that in SQL Server are quite quick. – Joe Jun 10 '15 at 14:27"

From https://stackoverflow.com/questions/30708324/proc-sql-update-efficiency-for-large-datasets

How come is update so inefficient?

RW9 · Posted 10-26-2018 04:08 AM

Proc SQL can be slow all the time, especially with large data. The reason, its not the primary language, it is a sub component. Its provide as a procedure with a basic SQL compiler. Base SAS is the programming language, and is optimised for data processing. It is always recommended to use Base SAS above SQL, unless there is any specific benefits to using SQL.

As for why, put _method and _tree on and see what the SQL compiler is doing;

http://www2.sas.com/proceedings/sugi30/101-30.pdf

I suspect its likely reading/writing each observation individually as there is no PDV. Using a datastep reads each observation into the PDV then writes it out, being more efficient. SQL does not have that, and has none of the tricks that databases use to speed up code running.

Another point, if your aiming to use SQL a lot, then look at normalising your data. SQL works best with few columns and many rows. 50+ columns is a lot of columns.

SASKiwi · Posted 10-26-2018 06:37 AM

Please post your full log including elapsed time and row numbers.

I've had good success getting reasonable performance with SQL Server updates by choosing an appropriate method along with tweaking database update options.

RW9 · Posted 10-26-2018 06:49 AM

But then that is using the SQL Server parser, and pass through is it not? My impression of the question was that he was using SQL within SAS alone. If its running on a DB then yes, it should be a lot quicker.

Jeg123 · Posted 10-26-2018 10:44 AM

Here is a example

Create random data set

data A;
call streaminit(123);     
do i = 1 to 100000000;
   u = rand("Uniform");    
   output;
end;
run;

Some random query which creates a new column and also writes a whole new data set

proc sql;
create table B AS
	SELECT
		*
	,	1 AS id
FROM A
;quit;

Perform same as above, but using alter table

proc sql;
alter table A
	add id num
;quit;

proc sql;
update A
	set id = monotonic()
;quit;

Solution 1: 10.2 seconds

Solution 2: 44 seconds

It takes 4x longer, even though the first solution also has to copy all the additional columns. This is also exponentially increasing, increasing to 100 million rows it takes 40 seconds vs 7.5 minutes.

SASKiwi · Posted 10-26-2018 06:54 PM

Monotonic is an unsupported SAS function. Does it perform similarly with a supported function like rand?

I confess I have only used your solution 1 so I've never struck the problem in solution 2. Using SET in SAS SQL is seriously limiting as a separate sub-query is required to populate each SET from another dataset. There are better ways to update in SAS.

I update mainly with DATA steps and that certainly provides best performance with SAS data.

Jeg123 · Posted 10-27-2018 02:12 AM

I see that my example was wrong, as I am using a constant 1 in one example and then monotonic() in another. But you can replace the

monotonic with 1 and experience the same slowness.

Do you mind showing a short example how you do when you update a column in data step?

RW9 · Posted 10-29-2018 04:32 AM

The equivalent of the monotonic() example would be:

data a;
  set a;
  id=_n_;
run;

_n_ is an automatic internal variable which is observation number.

It is very easy to update, alter, manipulate, data using datastep language - it is at the end of the day the basis of the whole SAS system. The SQL is a simple ANSI parser component accessed through Base SAS and hence is never going to be a good as the base language. Think of it like a VB script to generate a C++ program.

Jeg123 · Posted 10-29-2018 05:15 AM

Aware of the _n_. But is this really the "proper" way to add a column using

data a;
  set a;
  id=_n_;
run;

From my understanding, the above will overwrite a total new file, which seems inefficient when only adding one column. Or is this the "correct" way to add a new column?

RW9 · Posted 10-29-2018 05:32 AM

That depends on scenario, however in general, yes, that is the proper way to add variables.

In certain circumstances the datastep update function can be applied:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm

And in big data scenarios then the whole programming part tends to be quite different. However in most programming, the datastep is the way to go. It will also be the most efficient as its a one directional stream. SQL creates intermediary steps where the data is sorted, merged and otherwise processed, which can lead to lots of reads and writes. On a proper database there are processes in place to handle this.

You can find a whole raft of videos about most SAS related items here:

https://video.sas.com/category/videos/how-to-tutorials

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away