I am running some proc sql statements which first uses alters the table a new column and then updates the column.
This is taking forever, it is faster to do a select all, even though it contains 50+ columns and create a new column in the select statement. This does not make sense.
I tried finding an answer to this and the only thing I found was this comment
"FYI - SQL Update unfortunately as supported in SAS is almost always terrible. SQL has many uses in SAS, but update should almost never be one, unless it's a single value update that's not sourced from a table. For whatever reason, that always seems to be incredibly slow, even for not-all-that-large datasets, and even for updates that in SQL Server are quite quick. – Joe Jun 10 '15 at 14:27"
From https://stackoverflow.com/questions/30708324/proc-sql-update-efficiency-for-large-datasets
How come is update so inefficient?
Proc SQL can be slow all the time, especially with large data. The reason, its not the primary language, it is a sub component. Its provide as a procedure with a basic SQL compiler. Base SAS is the programming language, and is optimised for data processing. It is always recommended to use Base SAS above SQL, unless there is any specific benefits to using SQL.
As for why, put _method and _tree on and see what the SQL compiler is doing;
http://www2.sas.com/proceedings/sugi30/101-30.pdf
I suspect its likely reading/writing each observation individually as there is no PDV. Using a datastep reads each observation into the PDV then writes it out, being more efficient. SQL does not have that, and has none of the tricks that databases use to speed up code running.
Another point, if your aiming to use SQL a lot, then look at normalising your data. SQL works best with few columns and many rows. 50+ columns is a lot of columns.
Please post your full log including elapsed time and row numbers.
I've had good success getting reasonable performance with SQL Server updates by choosing an appropriate method along with tweaking database update options.
But then that is using the SQL Server parser, and pass through is it not? My impression of the question was that he was using SQL within SAS alone. If its running on a DB then yes, it should be a lot quicker.
Here is a example
Create random data set
data A;
call streaminit(123);
do i = 1 to 100000000;
u = rand("Uniform");
output;
end;
run;
Some random query which creates a new column and also writes a whole new data set
proc sql;
create table B AS
SELECT
*
, 1 AS id
FROM A
;quit;
Perform same as above, but using alter table
proc sql;
alter table A
add id num
;quit;
proc sql;
update A
set id = monotonic()
;quit;
Solution 1: 10.2 seconds
Solution 2: 44 seconds
It takes 4x longer, even though the first solution also has to copy all the additional columns. This is also exponentially increasing, increasing to 100 million rows it takes 40 seconds vs 7.5 minutes.
Monotonic is an unsupported SAS function. Does it perform similarly with a supported function like rand?
I confess I have only used your solution 1 so I've never struck the problem in solution 2. Using SET in SAS SQL is seriously limiting as a separate sub-query is required to populate each SET from another dataset. There are better ways to update in SAS.
I update mainly with DATA steps and that certainly provides best performance with SAS data.
I see that my example was wrong, as I am using a constant 1 in one example and then monotonic() in another. But you can replace the
monotonic with 1 and experience the same slowness.
Do you mind showing a short example how you do when you update a column in data step?
The equivalent of the monotonic() example would be:
data a;
set a;
id=_n_;
run;
_n_ is an automatic internal variable which is observation number.
It is very easy to update, alter, manipulate, data using datastep language - it is at the end of the day the basis of the whole SAS system. The SQL is a simple ANSI parser component accessed through Base SAS and hence is never going to be a good as the base language. Think of it like a VB script to generate a C++ program.
Aware of the _n_. But is this really the "proper" way to add a column using
data a; set a; id=_n_; run;
From my understanding, the above will overwrite a total new file, which seems inefficient when only adding one column. Or is this the "correct" way to add a new column?
That depends on scenario, however in general, yes, that is the proper way to add variables.
In certain circumstances the datastep update function can be applied:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm
And in big data scenarios then the whole programming part tends to be quite different. However in most programming, the datastep is the way to go. It will also be the most efficient as its a one directional stream. SQL creates intermediary steps where the data is sorted, merged and otherwise processed, which can lead to lots of reads and writes. On a proper database there are processes in place to handle this.
You can find a whole raft of videos about most SAS related items here:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.