Automatic Data Loading with a CAS Data Source Star Schema in Viya 3.5

2 Likes

Can you create a star schema directly on source files or database tables, without loading them first in CAS? Yes, you can! The CAS data source star schema automatically loads data, when you query it. You do not need to load the tables or refresh the star schema. Read the post, learn the benefits and avoid the most common pitfalls.

In the previous post Query Performance? Use a CAS Star Schema in Viya 3.5 you defined a CAS Star Schema on loaded tables. You saw different techniques to improve query performance.

Star Schema Data Model

CAS Data Source Star Schema Model.png

The fact table MAILORDER is an ORACLE table. PRODUCTS, CUSTOMERS and CATCODE are the dimensions:

Dimension CUSTOMERS is a .sashdat file in a DNFS caslib;
Dimensions PRODUCT is a .csv file in a DNFS caslib
Dimension CATCODE is a CSV file in a PATH caslib;

A star schema is a join between one fact table and one or more dimension tables. Each dimension table is joined to the fact table through a key.

Data Volumes

To test the star schema, I used the following tables: a decent fact table with 9 million records and a dimension with 28 million customers.

CAS Data Source Star Schema 20 Data Volumes.png

The time in seconds is the real time displayed in the SAS log.

Comparison:

at the bottom, the time to load a table, triggered by the star schema query
at the top, the time to load data (serial) from a PATH caslib
under, the time to load data (in parallel) from a DNFS caslib or multi-threaded from an ORACLE caslib.

Scenarios Tested and Results

In white: CAS Star data source star schema. The focus of this post.
In orange: CAS Star Schema (tables loaded serially)
In green: CAS Star Schema (tables loaded in parallel)
Orange and green were the focus of the previous post Query Performance? Use a CAS Star Schema in Viya 3.5.

I added the loading time, as a CAS Star data source star schema triggers the loading.

CAS Data Source Star Schema 30 Scenarios.png

Key Points

To summarize the tables above:

Use the CAS star schema on data sources when data is refreshed often, but the queries are infrequent.
Particularly suitable for any source that allows parallel loading e.g.: data sources stored as .sashdat files, tables in HIVE, or any sources that allow parallel loading.
Loading is initiated by the cas star schema at query time. Load time is dependent on source, access engine, type of load and many other factors.
In this scenario, it took 74 seconds in total. 14 seconds to load and 60 seconds to summarize. If Oracle would load multi-threaded (it can), instead of loading serial (as it does in a star schema), the total time would drop to 62 seconds, very comparable with the optimal scenario from the previous post Query Performance? Use a CAS Star Schema in Viya 3.5.

Findings

Data Load

Dimensions or facts are files: tested with .SASHDAT and .CSV files (loaded on demand).
It works with many other file types: EXCEL, XLS, BASESAS, DTA, ESP, FMT, HDAT, JMP, LASR, SPSS. See documentation
If you load a source table from Hive, "mailorder" for example, you need to define a caslib first, unless already defined:

* define HIVE caslib;
caslib cashive datasource=(srctype="hadoop",
server="sashdp01.race.sas.com",
username="hive",
dataTransferMode="auto",
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf",
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib",
schema="cashive"
dfdebug=epall) global;

It is faster to load from HIVE when you have the Embedded Process active, configured and the transferMode=parallel is specified in the caslib definition.
You can also use transferMode=auto. If the Embedded Process fails (parallel load), it will retry to load serially, without the Embedded Processing. Obviously, this is slower.
An Embedded Process runs the code in the source technology, lets the source do the heavy lifting, instead of CAS.
Source table from Oracle: even if you define the ORACLE caslib with numreadnode=x, the cas star schema view does not actually use the option. The load happens serially, not multi-threaded, as defined below.

* define ORA caslib;
caslib casora datasource=(srctype="oracle",username="CASDM",password='mypassword',
path="//sasdb.race.sas.com:1521/xe",schema="CASDM",
numreadnodes=10, numwritenodes=10) global;

Tried to load files from AWS S3 .SASHDAT and .CSV: didn’t work for me.

Data Join

Keys in the dimension tables must be unique. I will repeat: one record, one unique key.
In practice, you might have surprises, especially when you work with data you do not know. Check validity flags, dates, before you create the star schema. Bottom line: understand the data you want to join.
The key between the fact and the dimensions must be one column (multiple join keys are not supported)

View Details

Do not add execute tabledetails on a data source cas star schema! Tables will be loaded just to display the results!
Although you are safe with the following:

Proc cas;
table.tableinfo / caslib='casuser' name='mail_view'; 
table.columninfo / table='mail_view'; quit;

Data Refresh:

When dimensions or facts are updated (files or tables):

In SAS Studio V, you just need to refresh the star schema. The fact that data is loaded at query time, saves you a step. In the previous post, where the star schema was on a table, you had to reload the table then refresh.
For Visual Analytics:
- Run the code below to recreate your star schema (the operation is very fast).
- Then refresh the data in the report or, enable the ‘Periodically reload page data’ in the report page options
Only the view must be promoted, not all the facts and dimensions (because they are stored in files anyways). You will gain a step here.

CAS Data Source Star Schema

Create the CAS star schema

cas mySession sessopts=(caslib="casuser" timeout=1800 locale="en_US" 
	metrics="true" dqLocale="(ENUSA)");
* Drop CAS view/table if existent*;
proc casutil;
	droptable casdata="mail_view" incaslib="casuser" quiet;
	quit;
* Create Star Schema;
proc cas; 
table.view / caslib='casuser' name='mail_view' replace=true
tables={
	{caslib='dm_oradm' name='mailorder', 
	varlist={'Expenses', 'Revenue', 'Qty'}, as='m'},
	{keys={'m_pcode = p_pcode'},
    caslib='dm_dnfs' name='products.csv',  varlist={'price','cost', 'descrip','type'}, 
	computedVars={{name="price_cost_ratio"}}, computedVarsProgram="price_cost_ratio = price/cost;",	as='p'},
	{keys={'m_custnum = c_custnum'},
    caslib='dm_dnfs' name='customers.sashdat', 
	varList={'addr1','addr2','city','phone',
	'region','state','zip'}, as='c'},
	{keys={'m_catcode = cat_catcode'},
    caslib='dataproc' name='catcode.csv', varlist={'catalog'}, as='cat'}
	};
table.tableinfo / caslib='casuser' name='mail_view';
table.columninfo / table='mail_view';
quit;

Reminder:

The Fact MAILORDER is an ORACLE table.
Dimension CUSTOMERS is a .sashdat file in a DNFS caslib.
Dimensions PRODUCT is a .csv file in a DNFS caslib.
Dimension CATCODE is a CSV file in a PATH caslib.

Query the Star Schema: use cas actions

Summary + transpose cas actions gave good performance, slightly better than FedSQL. Summary used by Visual Analytics for aggregations.

* Query with summary;
proc cas;
simple.summary result=r status=s/
groupByLimit=25000,inputs={{name="v_Profit"},{name="m_QTY"},{name="p_COST"},
{name="p_PRICE"}},orderBy={"p_TYPE","p_DESCRIP","c_STATE"},orderByAgg={},orderByDesc={},subSet={"SUM"},
table={caslib="CASUSER",computedOnDemand="false",
computedVars={{name="v_Profit"}},
computedVarsProgram="'v_Profit'n=('m_QTY'n * ('p_PRICE'n - 'p_COST'n));;",
groupBy={{format="$.",name="p_TYPE"},{format="$.",name="p_DESCRIP"},
{format="$.",name="c_STATE"}},name="MAIL_VIEW"},
casout={name="mail_view_agg", caslib="casuser" replace=True};
run;
   if (s.severity == 0) then do; 
  transpose.transpose /
      table={name='mail_view_agg', caslib='casuser',
         groupBy={"p_TYPE","p_DESCRIP","c_STATE" }     },
      id={'_Column_'},
      casOut={name='MAIL_VIEW_TR', caslib='casuser', replace=true}, 
      transpose={'_Sum_'} ;
   end;
quit;
cas mySession terminate;

Conclusions

Viya 3.5 and the CAS star schemas can help you gain time when it comes to joining multiple tables, aggregating or calculating new columns on-demand.

With a CAS star schema directly on a data source (fact and dimensions are files, or database tables), you will save a step to load data. Data is loaded on demand, at query time.

CAS star schema refresh seems to be more suited for programmatic use, than in Visual Analytics.

Lastly, you need to understand and (try to) avoid the common CAS star schema pitfalls.

Documentation and Useful Resources

Want to Learn More about Viya 3.5?

Acknowledgements: Gordon Keener.

Thank you for your time reading this post. Please comment and share your experience with the CAS Star Schemas and help others.