SAS Programming

Haris · Posted 02-03-2024 05:32 PM

I am reading from a large MS-SQL database that's split into three separate tables due to column limitation. I only have SAS/ACCESS ODBC -- no specific SQL driver.

The PROC SQL joins the three tables (all of them have identical unique rowIDs) and filters the intake with the WHERE clause. Unfortunately, it appears that SAS reads all three tables in full and merges them on my desktop.

Here's the code:

proc sql;
create table want as
select a.<VarList>, b.<VarList>, c.<VarList>
from         MSSQL_A as a
  inner join MSSQL_B as b  ON a.ID=b.ID
  inner join MSSQL_C as c  ON a.ID=c.ID
where a.Variable = x
  AND b.Variable = y
  AND c.Variable = z
;
quit;

The source tables have millions of records and thousands of columns; however, the query only returns a handful of rows after where-filter and about a hundred columns that are manually listed in the SELECT clause.

Looking for advice on how to make this more efficient. It takes way too long to read all the data into SAS and I even run out of room on the disk. How can I push the filtering task from SAS to the server? Will pass-through method work on MS-SQL database if I only have ODBC driver? Is it worth purchasing SAS/ACCESS for SQL specifically? I used to work with SAS/ACCESS ORACLE and that shaved of a lot of time. New to ODBC/MS-SQL.

Should I import the three tables into SAS separately, filtering each as best as I can, and then join them in SAS?

Patrick · Posted 02-03-2024 05:56 PM

The SQL you share can't be what you're actually executing - starting with one-level table names which SAS would interpret as WORK.<table name>.

Please share also your libname that points to MS-SQL. For performance reasons make sure to use libname option readbuff=<some value> because the default is normally too low.

I can't see anything in your "prototype" SQL that SAS couldn't send to the DB for execution.

Passing Joins to the DBMS

Passing SAS Functions to Microsoft SQL Server

Your where clause looks a bit suspicious. There are comparisons between the source tables missing - a.key_var=b.key_var ... - which means that you've got potentially a cartesian join creating a massive number of rows.

where a.Variable = x
  AND b.Variable = y
  AND c.Variable = z

To see in the SAS log what SQL SAS sends to the database add below options to your script.

options sastrace=(,,,d) nostsuffix sastraceloc=saslog;

Haris · Posted 02-03-2024 06:11 PM

Fair. I used shortcuts that I thought were obvious.

libname MySQL ODBC dsn=MySQL schema=DBO;

MySQL is defined as an ODBC Data Source 64-bit via ODBC Driver 17 for SQL Server.

Each table in the FROM clause is MySQL.TableA, MySQL.TableB and MySQL.TableC.

I don't think I am doing a cartesian join as column ID is a unique row identifier and each table contains exactly the same number of rows for each patient/ID in the database.

The WHERE clause is meant to highlight the fact that Surgery Date is one table, Patient Age is in anotehr, and Patient Status is the third table. I want only a small SurgDate range, Patients of a certain Age, and Status. Hence the three WHERE conditions. I think using the same name Variable for all three is misleading -- they are all three different variables in each table. The WHERE clause is not meant to eliminate cartesian joint like it is often used.

I agree with you, this code is relatively simple and should be executed on the server by default. But it is not! WHY?

Haris · Posted 02-03-2024 06:13 PM

Updated Query -- hope it's more clear this way:

proc sql;
create table want as
select a.<VarList>, b.<VarList>, c.<VarList>
from           MySQL.MSSQL_A as a
  inner join MySQL.MSSQL_B as b  ON a.IDVarA=b.IDVarB
  inner join MySQL.MSSQL_C as c  ON a.IDVarA=c.IDVarC
where a.SurgDate > x
  AND b.Age > y
  AND c.Status = z
;
quit;

Tom · Posted 02-03-2024 07:03 PM

Just some general guidelines.

1) Make sure the query is being run efficiently inside the remote database so that only the data you need to transfer to SAS has to make the transfer.

You would probably get better answers on this by asking the owner of the data how they would make sure a query in the database they use. I know a lot of databases (oracle, teradata, snowflake) have ways you can ask it to show you how it plans to solve the problem you have given it. Sometimes a simple change in the way the query is written can have a drastic performance change. For your example it might be better to move the WHERE clauses to the place they impact so fewer observations are being joined.

from   (select a.variables from MySQL.MSSQL_A where surgdate > x) as a
  inner join ...

Or perhaps they need to create indexes on some of the variables.

2) Watch out for long character variables. A lot of database systems are using variable length character strings and some of them are defaulting to a ridiculously large maximum possible length. This can cause real trouble when trying to move into SAS datasets where characters strings are fixed length. There is no need to move 65,000 bytes to store a field that can only have YES or NO as the result. You might want to cast() those variables to something shorter when pulling.

Try just getting a summary of how many observations the join returns before trying to move the data. If they query generates a billion observations you probably don't want to move it into a SAS dataset.

connect using mysql;
select nobs from connection to mysql
(select count(*) as nobs from 
  ( your query here
  )
);

Kurt_Bremser · Posted 02-04-2024 03:53 AM

Are x, y and z literal values, or do you have something more complex there, like function or macro calls?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Haris · Posted 02-04-2024 01:31 PM

Straight up SAS date ‘ddmmmyyyy’d, and two numbers — an age and a category number for a status.

Patrick · Posted 02-03-2024 08:01 PM

I want only a small SurgDate range, Patients of a certain Age, and Status. Hence the three WHERE conditions.

Agree, it's not a cartesian join. I've "missed" the on clauses in your code.

I believe with SQL Server the joins get executed before the where clauses. If one of your where conditions significantly reduces the data volumes then it's eventually worth to amend your SQL so the data in the source table gets sub-set prior to the join.

I agree with you, this code is relatively simple and should be executed on the server by default. But it is not! WHY?

How do you know that the code doesn't execute in full on the DB? Have you verified in the SAS log or on the DB side which SQL actually gets executed on the DB? Have you run your code with below options defined?

options sastrace=(,,,d) nostsuffix sastraceloc=saslog;

I'm not sure if NOEXEC will still test what SQL syntax it can send to the DB - but it's worth a shot as it would help to speed-up testing and tweaking your SQL to ensure full in-database processing.

I've also added in below code how you could sub-set a table prior to a join. The disadvantage of such an approach is of course that the join then can't use indexes from this table in case they exist - but if the where clause significantly reduces volumes prior to the join then it's highly likely still the better option.

options sastrace=(,,,d) nostsuffix sastraceloc=saslog;

proc sql /*noexec*/;
  create table want as
    select a.<VarList>, b.<VarList>, c.<VarList>
    from 
      (select * from MySQL.MSSQL_A where a.SurgDate > x) as a
      inner join MySQL.MSSQL_B as b  
        ON a.IDVarA=b.IDVarB

      inner join MySQL.MSSQL_C as c  
        ON a.IDVarA=c.IDVarC

    where b.Age > y
          AND c.Status = z
  ;
quit;

...and of course you can also write explicit pass-through SQL the way @Tom already proposed for counting rows. In such a case it's 100% ensured that the SQL runs fully in-database because all that SAS does is to send the SQL code as-is to the DB and then transfer the result set back to SAS.

And last but not least: Check on the SAS side into what character lengths the SQL Server variables map. If you see on the SAS side a character variable with a length of 32KB then you might need to cast the variable on the SQL side (using explicit pass-through SQL) to a data type with a length that still doesn't truncate the string but maps into a shorter character length on the SAS side.

Haris · Posted 02-05-2024 11:00 AM

CULPRIT FOUND!

Thank you all for the input. I've used some of the proposed ideas and will try the rest too.

SAS MISSING() in the WHERE statement was causing the failure to pass the querry to DBMS for fetching data! As soon as I replaced the commented out code that contains "missing()" with equivalent "is missing" syntax, no more memory hogging on the WORK folder and execution in seconds rather than >30 minutes.

	where SurgDt between '01Oct2022'd and '30Sep2023'd
	  and Age >= 18
	  and CountryUS = 'Y'

/***	  and missing(ExpiredInOR)                                 */
/***	  and not(  (missing(OpValve) or OpValve = '2')	           */
/***		and (missing(OpCAB)   or OpCAB   = '2') )  */

	  and ExpiredInOR is missing
	  and not(  (OpValve is missing or OpValve = '2')	
		and (OpCAB   is missing or OpCAB   = '2') )

Any input on MISSING() function? Weird, isn't it?

As to how I knew that SAS was importing the data: the original querry would (1) take way too long to retrieve <50 rows of data and (2) the memory on the c:\ drive would disappear like rain in the desert while the query was executing. I have well over 0.5TB on the main drive and it would all get gobbled up by SAS. No way a dataset with <50 rows and <200 columns consumes >700GB of memory.

Kurt_Bremser · Posted 02-05-2024 11:15 AM

That was the point of my previous question. Posting only part of the code does not help in finding the root cause of issues 😞

Using SAS functions in SQL code will often prevent the implicit pass-through to the database.

Do a Google search for "sas implicit pass through" to find out more.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 02-05-2024 07:52 PM

Any input on MISSING() function? Weird, isn't it?

It's fully documented which SAS functions can get sent to the DB for execution.

Passing SAS Functions to ODBC

As to how I knew that SAS was importing the data: the original querry would (1) take way too long to retrieve <50 rows of data and (2) the memory....

You don't need to derive such information. Option sastrace writes to the SAS log which SQL gets sent to the database. If you would have used it and looked at the log you would have seen that SAS function missing didn't get translated to MS SQL syntax and though had to execute on the SAS side.

options sastrace=(,,,d) nostsuffix sastraceloc=saslog;

SAS Programming

Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

Re: Read data from MS-SQL into SAS via SAS/ACCESS ODBC

SAS/ACCESS: Universal SAS Methods to Access Just About Any Data, Anywh...

Reading MS SQL Server Views, not tables, from an ODBC Connection

Sneak Preview - SAS/ACCESS Interface to Snowflake

SAS/ACCESS in SAS 9.4M4 – SAS In-Database Procedures push-down to more...

[SAS 활용 FAQ] SAS/ACCESS Interface to ODBC 라이선스로 Oracle DB 연결하는 방법을 알고 ...

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...