I'm connecting to a database hosted somewhere in the cloud, via an ODBC LIBNAME statement.
When I try to load the data into memory with a DATA step, it takes around 4 minutes to bring in ~25,000 observations. Is this slower than expected? Is there a method I can use that makes bringing the data into my WORK folder faster? Would using a hash table speed this up, or are hash tables more useful for joins?
DATA want;
set cloud.have (keep=id and other vars)
where year >= 2019;
RUN;
Look at the speed of your network connection.
It might also be that the execution of your where condition takes too much time because no index is defined for year.
For diagnostic purposes, have the select run natively on the database for comparison.
Thanks for the ideas. I also discovered some of the variables were stored as massive strings (1000+ characters). I can't imagine that helps with processing time...
Test a row count from your ODBC connection:
proc sql;
select count(*)
from cloud.have
where year >= 2019;
quit;
If this runs a lot faster then you can conclude your network connection is what is slowing you down.
@MB_Analyst wrote:
Thanks for the ideas. I also discovered some of the variables were stored as massive strings (1000+ characters). I can't imagine that helps with processing time...
Network throughput is often the bottleneck and the length of the variables will impact on performance.
If you know that the variable lengths are too big then you could reduce them on the DB side before transferring the data over the network. This would certainly work with explicit SQL pass-through but you could also give below code a try.
I'm not sure if SAS will push the length definition to the database so you'll have to check in the log how the actual SQL sent to the DB looked like (the options posted will create such SAS log information).
options sastrace='d' sastraceloc=saslog nostsuffix;
proc sql;
create table want as
select
c.id,
strip(c.varLen1000) as varLen length=$100,
....
from cloud.have c
where c.year >= 2019
;
quit
Massive strings may be handled by adding this option to your LIBNAME statement:
Determines the length of any very long DBMS character data type, such as BLOB or CLOB, that is read into SAS or written from SAS when using a SAS/ACCESS engine.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
That's to define the max string length exchangeable between SAS and a DB and to avoid truncation for long strings. It has no impact on performance (unless there is string truncation).
Agreed in some context...However, I've seen multiple applications of this in customer use cases where it does impact performance. Especially when leveraging in-database PROC's (e.g., MEANS, TABULATE, FREQ, REPORT, PRINT). During the "READ" operation is where this occurs. Also, connecting via a Libname in EG (or other SAS UI) and viewing Hadoop data (for example) that has 32767 byte columns will lock your session even if it is only one column.
Of course the optimal "best practice" is to format the data at its source properly and not leave it to be defined by a default value of STRING and without a column width specification. Typically this requires the Data Management team to be more intentional about data quality.
The problem can also be solved by creating well structured "Views" of the data if the source tables cannot be modified.
Here are some additional links that discuss relevant topics:
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Try option:
libname cloud ODBC ...... readbuff=10000 ;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.