Can you create a star schema directly on source files or database tables, without loading them first in CAS? Yes, you can! The CAS data source star schema automatically loads data, when you query it. You do not need to load the tables or refresh the star schema. Read the post, learn the benefits and avoid the most common pitfalls.
In the previous post Query Performance? Use a CAS Star Schema in Viya 3.5 you defined a CAS Star Schema on loaded tables. You saw different techniques to improve query performance.
The fact table MAILORDER is an ORACLE table. PRODUCTS, CUSTOMERS and CATCODE are the dimensions:
A star schema is a join between one fact table and one or more dimension tables. Each dimension table is joined to the fact table through a key.
To test the star schema, I used the following tables: a decent fact table with 9 million records and a dimension with 28 million customers.
The time in seconds is the real time displayed in the SAS log.
Comparison:
I added the loading time, as a CAS Star data source star schema triggers the loading.
To summarize the tables above:
* define HIVE caslib;
caslib cashive datasource=(srctype="hadoop",
server="sashdp01.race.sas.com",
username="hive",
dataTransferMode="auto",
hadoopconfigdir="/opt/sas/viya/config/data/hadoop/conf",
hadoopjarpath="/opt/sas/viya/config/data/hadoop/lib",
schema="cashive"
dfdebug=epall) global;
* define ORA caslib;
caslib casora datasource=(srctype="oracle",username="CASDM",password='mypassword',
path="//sasdb.race.sas.com:1521/xe",schema="CASDM",
numreadnodes=10, numwritenodes=10) global;
Proc cas;
table.tableinfo / caslib='casuser' name='mail_view';
table.columninfo / table='mail_view'; quit;
When dimensions or facts are updated (files or tables):
cas mySession sessopts=(caslib="casuser" timeout=1800 locale="en_US"
metrics="true" dqLocale="(ENUSA)");
* Drop CAS view/table if existent*;
proc casutil;
droptable casdata="mail_view" incaslib="casuser" quiet;
quit;
* Create Star Schema;
proc cas;
table.view / caslib='casuser' name='mail_view' replace=true
tables={
{caslib='dm_oradm' name='mailorder',
varlist={'Expenses', 'Revenue', 'Qty'}, as='m'},
{keys={'m_pcode = p_pcode'},
caslib='dm_dnfs' name='products.csv', varlist={'price','cost', 'descrip','type'},
computedVars={{name="price_cost_ratio"}}, computedVarsProgram="price_cost_ratio = price/cost;", as='p'},
{keys={'m_custnum = c_custnum'},
caslib='dm_dnfs' name='customers.sashdat',
varList={'addr1','addr2','city','phone',
'region','state','zip'}, as='c'},
{keys={'m_catcode = cat_catcode'},
caslib='dataproc' name='catcode.csv', varlist={'catalog'}, as='cat'}
};
table.tableinfo / caslib='casuser' name='mail_view';
table.columninfo / table='mail_view';
quit;
Reminder:
Summary + transpose cas actions gave good performance, slightly better than FedSQL. Summary used by Visual Analytics for aggregations.
* Query with summary;
proc cas;
simple.summary result=r status=s/
groupByLimit=25000,inputs={{name="v_Profit"},{name="m_QTY"},{name="p_COST"},
{name="p_PRICE"}},orderBy={"p_TYPE","p_DESCRIP","c_STATE"},orderByAgg={},orderByDesc={},subSet={"SUM"},
table={caslib="CASUSER",computedOnDemand="false",
computedVars={{name="v_Profit"}},
computedVarsProgram="'v_Profit'n=('m_QTY'n * ('p_PRICE'n - 'p_COST'n));;",
groupBy={{format="$.",name="p_TYPE"},{format="$.",name="p_DESCRIP"},
{format="$.",name="c_STATE"}},name="MAIL_VIEW"},
casout={name="mail_view_agg", caslib="casuser" replace=True};
run;
if (s.severity == 0) then do;
transpose.transpose /
table={name='mail_view_agg', caslib='casuser',
groupBy={"p_TYPE","p_DESCRIP","c_STATE" } },
id={'_Column_'},
casOut={name='MAIL_VIEW_TR', caslib='casuser', replace=true},
transpose={'_Sum_'} ;
end;
quit;
cas mySession terminate;
Viya 3.5 and the CAS star schemas can help you gain time when it comes to joining multiple tables, aggregating or calculating new columns on-demand.
With a CAS star schema directly on a data source (fact and dimensions are files, or database tables), you will save a step to load data. Data is loaded on demand, at query time.
CAS star schema refresh seems to be more suited for programmatic use, than in Visual Analytics.
Lastly, you need to understand and (try to) avoid the common CAS star schema pitfalls.
Acknowledgements: Gordon Keener.
Thank you for your time reading this post. Please comment and share your experience with the CAS Star Schemas and help others.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.