BookmarkSubscribeRSS Feed
Woo23
Fluorite | Level 6

Hi everyone,

we are using SaS Viya and CAS, and for some data preparation tasks my peers use proc sql (create table as select col1, col2 ... from table2).  Once I checked the automatically created tables, these always use the Data type CHAR(n) fixed length string for text/character attributes. This causes a very big table in CAS.

 

I am Data Engineer, if I create a new Table on a database (sql,nosql, streaming), i will always define a DDL (Data definition language) script with exact data types and length for the table. But I didn't find official documents from SAS for this, in most of Example the tables will be just created by data step, CASl, Proc SQL/Fedsql

 

What are your strategies for creating & managing Tables in CAS?  Do you use a DDL for a new Table? Have you some best practices?

 

Thanks 

12 REPLIES 12
Woo23
Fluorite | Level 6

e.g. some database table lifecycle process that I used.

1. writing DDL with Columns, Datatypes, Length for the new table

2. using DB-Provisioning tools like flyway, liquibase, sql server db-project promote the table in dev, test, prod database

3. inserting/updating data on the table that be created in step 2

4. using the table to writing/reading the data (e.g. training datasets, oltp/olap tables)

5. changing some columns backward compatible, and using DB-Provisiong tools to deploy the change into the existing tables

6. if the table is not be used, then write DDL-Script to drop the table with DB-Provisioning tools.

 

Because CAS is one in-memory db, i think, we should design the cas table carefully and use the smallest data length for columns.

How do you do this on SAS CAS in-memory DB?

LinusH
Tourmaline | Level 20

Using DDL's is one of many ways of creating data sets in CAS. But this is perhaps not a technical issue.

The question I think is relevant is who creates data sets in CAS, and for what purposes.

Generally speaking, analysts that uses data for ad-hoc reporting are usually not that concerned about data modelling. In this scenario I suggest some kind of educational approach.

For data used in applications such as VA, VS and others, you should have a better control on your data design and require that this data to be modelled according to your guidelines.

Data never sleeps
Woo23
Fluorite | Level 6

Thanks for your Answer and suggestion!

yes, you are right!  The most of data analysts use and analyze the data, but they don't design the data model for VA, Data&Feature-Sets.

The big problem on SAS SQL Proc ist,  Data types and Data length for the automatically generated columns are too big, something like char(400), char(10000)... with fixed length string,   with multiple million records, this will overburden the memory usages.

 

Do you have some examples/guidelines for automating the table deployment/provisionieng like on Database/Streaming app?

 

DerylHollick
SAS Employee

When the data set is loaded to CAS, you can convert the character strings to varchar to save on memory.  Another option is to compress the table.  In this example, I specified strings $16 and greater be converted with the compress option commented out.

 

data test;
   length x $10000 y $8;
   x='abcdefghijk';
   y='abcdefgh';
run;

* Load SAS data set from the client;
%let path=%sysfunc(pathname(WORK));
proc casutil outcaslib="CASUSER";
   load file="&path./test.sas7bdat" importoptions=(filetype="BASESAS" varcharconversion=16) casout="test" /*compress*/;
quit;

proc cas;
   table.columninfo / table={caslib="CASUSER" name="test"};
quit;

columns.PNG

Woo23
Fluorite | Level 6

thanks for your answer!

I asked the quetison, because I want to manage the globale CAS tables with version control system, these should automatically and audit-proof be created from deployment pipeline / SAS-Scripts. 

I think creating/promoting a cas global table with data step or create table x select x from table1 un-controled is not safe, can cause some unexcepted problems.

 

Do you have some best practices for CAS data/schema management like we did in other data systems (Database, apache Kafka, arvo usw..) ?

DerylHollick
SAS Employee

CAS is an in-memory engine, but it is not a DBMS with schema/DDL requirements.  "Loading data to CAS" usually refers to loading into memory, but a CASLIB also defines the "permanent" or "physical" location of the data.  A DBMS might be chosen for this location, depending on the requirements. 

 

In a pipeline-type process, I like to centralize all the CAS loads and use proc casutil.  This allows me to check all the column type/lengths, table sizes, decide whether to compress, etc.  If loading *.sas7bdat files from a Viya or SAS9 compute server, I usually use the varcharconversion option, as well as save the tables to *.sashdat files, so they can be easily reloaded without having to rerun the process.

nvdwalt
Calcite | Level 5

Hi @DerylHollick 

 

What version of SAS are you using ? varcharconversion= seems to be ignored in my code.

 

Thanks,

nvdwalt
Calcite | Level 5

nvdwalt_0-1688470245421.png

 

DerylHollick
SAS Employee

I used 3.5.  Your types show as varchar, so it wasn't ignored.  I get a similar result when running on Viya 4.  It looks like columnInfo shows the max actual length in 3.5 and the max possible length in 4.  

 

proc contents shows this for both 3.5 and 4:

contents.PNG

 

nvdwalt
Calcite | Level 5
Awesome, thanks @DerylHollick !!
Sajid01
Meteorite | Level 14

Hello @Woo23 
From your first post I see this
"we are using SaS Viya and CAS, and for some data preparation tasks my peers use proc sql (create table as select col1, col2 ... from table2).  Once I checked the automatically created tables, these always use the Data type CHAR(n) fixed length string for text/character attributes. This causes a very big table in CAS.'

The basic fact is that SAS is general and Proc SQL in particular does not support varchar data format. All character fields are fixed length. So have realistic expectations on data set sizes. The manner in which tables are created does not matter.
The SAS Proc FEDQL - the federated query language procedure for querying RDBMS databases supports varchar format. But his may not be of any help in your case.

Woo23
Fluorite | Level 6

we use the cas database, this support Varchar. this isn't like SAS 9.4, there are just number and char data types. Proc SQL only know these 2 datatype, but if we use proc fedsql, we can create the table with varchar in CAS-Session. What we need, is a datamanagement Guide in CAS, how one table version-safely created, modified und dropped...