We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Loading a subset of a data table to CAS

by SAS Employee UttamKumar on ‎05-25-2017 10:06 AM (665 Views)

In SAS® Viya™, SAS Data Connector enables you to access and load data to Cloud Analytic Services (CAS). The SAS Data Connector contains connection information and specifics to connect with such data sources as Hadoop, Oracle, or SAS data sets. While loading data to CAS, SAS Data Connector supports data filtration. You can load a subset of data to CAS from source data with execution of the data subset condition at the source environment. The source environment processes the data condition and sends only the subset of data to CAS over the network.

 

For simplicity, I am referring only to SAS Data Connector; the SAS Data Connect Accelerator also supports the same data filtration feature. However, SAS Data Connector is meant for serial data load from source to CAS and SAS Data Connect Accelerator is meant for parallel data load from source to CAS.  

 

The parameter values that you supply for SAS Data Connector are specific to the data source. In general, you specify parameters for SAS Data Connector in CAS code within the datasource= parameter when adding a ‘caslib’ or in the options= / dataSourceOptions= parameter when loading a data table to CAS using PROC CASUTIL or PROC CAS table action.  

 

The options= / dataSourceOptions= parameter is supported in PROC CASSUTIL and PROC CAS. It enables you to supply a “where” clause and a varlist= parameter value (a list of column names from a table) in order to process data filtration at the source environment.  

 

For example, the following CAS code loads data from the “stocks” Hive table to CAS with three columns and with data satisfying “stock” ticker equal to ‘IBM’ and “close” value greater than 100. SAS Data Connector sends a query with a list of columns and a “where” clause to the Hive server to process, and returns only a subset of data to CAS. When varlist= the parameter value supplied, the SAS Data Connector SQL construct contains a list of column names instead of ‘select *’.  

 

Example code:

proc casutil;

   load casdata="stocks" casout="stocks" outcaslib="hiveEP" incaslib="hiveEP"

   varlist={"stock", "open", "close"}

   options={where="stock='IBM' and close > 100" } ;

quit;  

 

 

Log Extract from code execution:

subset_data_cas_1.png  

 

The hadoop application job list subset_data_cas_2.png  

 

The Hadoop application job list shows that loading the “stock” table from Hive to CAS was a two-step process. In the first step, SAS Data Connector executed the query with a list of columns and a “where” clause to create a temporary Hive table with a subset of data. In the second step, SAS Data Connector accelerator reads out of temp the Hive table (subset of data) to load into CAS.  

 

To support data filtration at the source environment (Hadoop), you must have sufficient temp space (on HDFS) to process the query submitted by SAS Data Connector. The following Hadoop application job logs show creation of a temp table/file while working on the data load request. Notice the RECORD_IN: value and the record written value.

 

subset_data_cas_3.png  

When using varlist= and options= or dataSourceOptions= parameters in PROC CASUTIL and PROC CAS, there are some rules to follow for how to supply the column names and the “where” clause.

  • Column names casing must match casing of DBMS.
  • Column names need to be quoted if they contain special characters or do not match the default case for the DBMS.
  • A SQL statement uses the \ (backslash) as an escape character for double quotes.

 

 

If you are loading a subset of data from an external database table, such as PostgreSQL or Teradata, then use the dbmsWhere= data connector parameter. With this parameter, the “where” clause that you specify is passed directly to the external database for use while loading the data.

 

proc casutil;

    load casdata="cars" incaslib="tdlib" casout="cars_CAS"

    vars=((name="make"), (name="model"))

    dataSourceOptions=(dbmswhere="cylinders=8") ;

quit;  

 

Related reading

For more information on Loading a subset of Data table see, Loading a subset of data table .

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.