BookmarkSubscribeRSS Feed

Using FedSQL to Pre-process Database Tables In-Database before Loading to CAS

Started ‎09-08-2023 by
Modified ‎09-08-2023 by
Views 1,097

To process data in CAS, data from CASLIB data sources are first copied into a memory resident CAS tables (see CAS Fundamentals: Data Lifecycle). When the size of the data source is large it is best load only the data that you require for CAS processing to limit I/O, network transport of data, and memory resources.

 

The LOAD statement in PROC CASUTIL and the LoadTable CAS action provide WHERE= and VAR= options to limit rows and columns to load and these are surely great options to use to your advantage.

 

But that's not what I'm talking about here. I'm talking about how you can use FEDSQL to execute a database query on a CASLIB that uses database tables as it's data source. IF (and only if) your CASLIB is a database CASLIB, you can execute a database query (perhaps a summary query, perhaps a join, any valid query) directly on the data source BEFORE it is loaded into CAS and only the result of that query is loaded to CAS.  An earlier post by Nicolas Roberts  covered the implicit pass-through method and I'll say some more about that here. In addition I'll show you a more recently supported explicit pass-through method.  I will show examples of these techniques using a CASLIB that connects to a postgres database, but these methods works for many other database CASLIBs as well.

 

Both implicit and explicit pass-through with CAS are accomplished with PROC FEDSQL. Implicit pass-through involves writing a query using FEDSQL syntax. The CASLIB data connector engine translates this on your behalf into a native database query, sends that query to the database for execution, and returns the result to CAS where further processing is done, if necessary, to complete the query request. With explicit pass-through, you write a database  query using the native syntax of the database and embed that query within your FEDSQL code. The native query is submitted directly to the database for execution and the results are returned to CAS.

 

FEDSQL Implicit SQL pass-through example

 

In the example below, after starting a CAS session, the CASLIB called caspgr connects to a Postgres database. In PROC FEDSQL the SESSREF= option will direct the procedure to execute in CAS, rather than the SAS compute server. The CREATE table statement will create a new in-memory CAS table called revenuesummary and it will be populated with the results of the SELECT query. That query names the "orders_large" table from the Postgres caspgr CASLIB. The syntax used is not native to Postgres and is written entirely in FEDSQL syntax.

 

cas mysession;
CASLIB caspgr datasource=(srctype="postgres" database=DCON
			  server="server.demo.sas.com" port=30432 
                          user=student password='Metadata0' ) libref=caspgr;
proc fedsql sessref=mysession;
 create table caspgr.revenuesummary{options replace=true} as
   select order_type, sum(total_sale) as revenue
   from caspgr."orders_large"
   where order_date < date'2015-01-01' 
   group by order_type;
quit;

 

If a caspgr.orders_large table does not exist as a CAS table in CAS memory, the select query will be converted as much as possible into a native Postgres query, sent to Postgres to execute, and the results will be returned to populate the CAS table caspgr.revenuesummary. Thus,  filtered and summarized query result is all that is returned to CAS. (Notice that when naming the table "orders_large" on the FROM clause, the name must be quoted when naming you are accessing from the caslib data source. This is necessary for Postgres tables, but not for other types of database caslibs.)

 

Let's check the SAS log to compare performance for the implicit pass-through method (method A)  versus pre-loading the orders_large table into memory and running the same query on the in-memory table (method B). The orders_large Postgres table contains 2,000,000 rows and 31 columns and the code was run in a small-scale development environment with a single CAS node.

 

Method A: Implicit SQL Pass-Through  

 

84   proc fedsql sessref=mysession ;
85    create table caspgr.revenuesummary{options replace=true} as
86      select order_type, sum(total_sale) as revenue
87      from caspgr."large_orders"
88      where order_date < date'2015-01-01'
89      group by order_type;
NOTE: The SQL statement was fully offloaded to the underlying data source via full pass-through
90   quit;
NOTE: PROCEDURE FEDSQL used (Total process time):
      real time           0.32 seconds
      cpu time            0.02 seconds

 

Method B: Load Database Table to CAS and Query

 

92   proc casutil;
NOTE: The UUID '399c10ee-2aed-ae4e-89e2-c5fbd28d203a' is connected using session MYSESSION.
93     load casdata="large_orders" inCASLIB="caspgr" outCASLIB="caspgr" casout="large_orders" replace;
NOTE: Performing serial LoadTable action using SAS Data Connector to PostgreSQL.
NOTE: Cloud Analytic Services made the external data from large_orders available as table LARGE_ORDERS in CASLIB caspgr.
NOTE: The Cloud Analytic Services server processed the request in 38.5743 seconds.
94   run;
95   quit;
NOTE: PROCEDURE CASUTIL used (Total process time):
      real time           38.59 seconds
      cpu time            0.08 seconds
 
96   proc fedsql sessref=mysession ;
97    create table caspgr.revenuesummary{options replace=true} as
98      select order_type, sum(total_sale) as revenue
99      from caspgr."large_orders"
100     where order_date < date'2015-01-01'
101     group by order_type;
NOTE: Table REVENUESUMMARY was created in CASLIB CASPGR with 3 rows returned.
102  quit;
NOTE: PROCEDURE FEDSQL used (Total process time):
      real time           1.06 seconds
      cpu time            0.02 seconds

 

A note following the first PROC FEDSQL (method A) indicates that the "SQL statement was fully offloaded to the underlying data source via full pass-through" and the entire process to execute that query and load the revenuesummary table completes in 0.32 seconds of real time. In contrast, in method 2, when a PROC CASUTIL LOAD statement is used to first load the entire orders_large source table, that process requires  38.59 seconds. The final PROC FEDSQL step re-creates the revenuesummary table by executing the same FEDSQL query on the pre-loaded orders_large CAS table, which takes 1.06 seconds.

 

By querying the data source with implicit pass-through you can execute processes more quickly, take advantage of the database capacity to process data natively, reduce the volume of data you must transport across the network, and reduce the memory required in CAS to create the CAS table.

 

Sometimes the implicit pass-through method only converts some of the query elements for in-database execution. For instance, it may convert the WHERE clause, but not perform the summarization. Sometimes it does no more than simply generate a database query to return ALL the data from the data source and all the rest is done in CAS. In the query shown in the log below, the original query (see method A above) is modified to use a KURTOSIS function in place of the  SUM function. KURTOSIS isn't  converted into an equivalent summary function in Postgres and the log no longer shows the message that the SQL statement was fully offloaded to the database. In this case, only the WHERE clause is offloaded to the database but many more rows were returned to CAS and CAS then completes the summarization to calculate the KURTOSIS. The real time to process the second query was thus longer at 2.31 seconds versus 0.32 seconds when the SUM function was used.

 

87   proc fedsql sessref=mysession ;
88    create table caspgr.revenuekurtosis{options replace=true} as
89      select order_type, kurtosis(total_sale) as kurtosis
90      from caspgr."large_orders"
91      where order_date < date'2015-01-01'
92      group by order_type;
NOTE: Table REVENUEKURTOSIS was created in CASLIB CASPGR with 3 rows returned.
93   quit;
NOTE: PROCEDURE FEDSQL used (Total process time):
      real time           2.81 seconds
      cpu time            0.03 seconds

 

During code development, if you need to evaluate the extent of database pushdown of implicit queries you can use drivertrace options on the CASLIB statement.

 

FEDSQL Explicit SQL pass-through example

 

An alternative approach to query database tables for loading data to CAS is to submit native database queries directly with explicit SQL pass-through. This ensures that the query you write in the native language will execute entirely in-database and you can take advantage of existing native code and all available functionality of the database. For the example below native Postgres SQL that leverages the databases capability to manipulate JSON data is used with explicit SQL pass-through to parse a JSON column to extract and summarize revenue from customer order information. Only the summary result is returned to CAS making it available for further analysis within CAS.

 

The Postgres table contains an integer column called orderid and a column of the JSON data type called orderinfo. Orderinfo stores information about customer orders.

 

DG_1-original-json-table.png

Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Let's focus first on how I construct a native Postgres query to extract the information I want to bring to CAS. The query below uses JSON operators to extract the values for the customer name, customer state, product, and qty object keys within the orderinfo JSON column of the jsonorders table.

 

select orderinfo -> 'customer' ->>'name' as customername, 
       orderinfo -> 'customer' ->>'state' as customerstate,
       orderinfo -> 'items' ->> 'product' as product,
       orderinfo -> 'items' ->> 'qty' as quantity
from jsonorders

 

This result yields:

 

DG_2-extract-from-rows.png

 

We will use the first query above as sub-select for an outer query to summarize the quantity ordered by customerstate. Here is the complete native postgres query:

 

select customerstate,
       min(quantity) as minquantity,
       max(quantity) as maxquantity,
       sum(quantity) as sumquantity
from       
       (select orderinfo -> 'customer' ->>'state' as customerstate,
               orderinfo -> 'items' ->> 'product' as product,
     	       cast(orderinfo -> 'items' ->> 'qty' as integer) as quantity
       from jsonorders) t
group by customerstate

 

Which yields the final summary result:

 

DG_3-summarize.png

 

With the native postgres query in hand we can now include this in FedSQL pass-through code so that it is sent to execute in Postgres and so that the summary result is returned to CAS:

 

cas mysession; 
CASLIB caspgr datasource=(srctype="postgres" database=DCON
						 server="server.demo.sas.com" port=30432 
                         user=student password='Metadata0' ) libref=caspgr;
proc fedsql sessref=mysession;
create table caspgr.StateSalesSummary as
   select customerstate, minquantity, maxquantity, sumquantity from connection to caspgr
       (select customerstate,
               min(quantity) as minquantity,
               max(quantity) as maxquantity,
		       sum(quantity) as sumquantity
        from       
             (select orderinfo -> 'customer' ->>'state' as customerstate,
                     orderinfo -> 'items' ->> 'product' as product,
     		         cast(orderinfo -> 'items' ->> 'qty' as integer) as quantity
              from jsonorders) t
        group by customerstate;
quit;

 

Let's examine this step by step:

 

  1.  The CAS statement starts a CAS session named mysession.
  2.  A CASLIB statement defines a CASLIB called caspgr that connects to the postgres database.
  3.  A PROC FEDSQL statement with the SESSREF option starts PROC FEDSQL in the CAS session.
  4.  A CREATE TABLE statement creates a CAS table called caspgr.StateSalesSummary.
  5.  A SELECT clause in the create table statement will select data to populate the table.
  6. The FROM clause for the SELECT, indicates to read data FROM CONNECTION TO CASPGR. CASPGR is the CASLIB connection to Postgres defined in step 2.
  7.  The FROM CONNECTION TO CASPGR is followed by our native Postgres query enclosed in a set of parentheses. This native query will be executed by Postgres and the results will be returned to CAS and read by the SELECT clause from step 5 above the results will be stored in the CAS table caspgr.StateSalesSummary.

 

Any valid native database query can be used in this way for most database caslibs.

 

Both implicit and explicit pass-through capabilities of FEDSQL allow us to take advantage of database processing of data before returning results to CAS. Implicit pas-through allows us to write FedSQL syntax without the requirement to conform to the specific syntax of the database. With implict pass-through, logging methods available to evaluate how much processing was passed to the database but it is possible that some or most of the FedSQL syntax is not translated to the database and more data will be returned to CAS for processing. With explicit pass-through methods, since you write the native database code yourself, you have full knowledge and control over what executes in database and what is returned to CAS. It also allows you to leverage database methods that may be difficult to reproduce within SAS.

 

Further Reading

 

SAS Documentation: FEDSQL Implicit and Explicit Pass-Through Facilities,

SAS Documentation: FedSQL for CAS Reference

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

AWESOME Article @DavidGhan !!!

Version history
Last update:
‎09-08-2023 09:12 AM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags