I wonder why my query takes nearly 4 minutes for completion.
It's the part of my code which performs slowest despite using exclusively castables and fedsql.
Only the left most table has over 1 million rows, the other tables are aggregation tables.
What should I do to speed up this query?
%let outcas=casuser;
proc cas;
source ETL_AC;
create table PUBLIC.X_TEST1{options replace=true} as
select a.*, b.Mean as mean_recipe, b.Std as std_recipe,
c.Mean as mean_batch_id, c.Std as std_batch_id,
d._NDis_ as _NDis_batch_id,
e._NDis_ as _NDis_recipe
from PUBLIC.X_TEST a
left join &outcas..SUMMARY_A1 b
on a.RECIPE=b.RECIPE and a._NAME_=b.column
left join &outcas..SUMMARY_A2 c
on a.batch_id=c.batch_id and a._NAME_=c.column
left join &outcas..NDIST_A2 d
on a.batch_id=d.batch_id and a._NAME_=d._column_
left join &outcas..NDIST_A1 e
on a.recipe=e.recipe and a._NAME_=e._column_ ;
endsource;
fedSQL.execDirect / query=ETL_AC
quit;
My log states:
I don't understand the timestamp comment. Is it the culprit of the poor performance?
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE
PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: Table X_TEST1 was created in caslib Public with 1295986 rows returned.
NOTE: PROCEDURE CAS used (Total process time):
real time 3:58.86
cpu time 0.43 seconds
I've found the culprit for the slow performance.
One of the columns had a length of 32167 ! despite only holding values at length 21 as maximum.
Now I've changed to varchar in a previous data step where I create the column.
Thanks for your help.
Hey @acordes! If you are working in a distributed CAS environment, you can try to make the small tables repeated tables prior to the join on the big table. This copies all rows of the small tables to all CAS nodes. This reduces the amount of shuffling needed, especially when joining a big table with small tables. I don't know if this will work in your case but it's worth a shot to see if it improves speed. For the DATA Step, this is the duplicate=yes option. For PROC CASUTIL or the loadTable action, this is the repeat option.
Examples:
data public.small(duplicate=yes);
set small;
run;
proc casutil;
load data=small outcaslib='public' repeat;
run;
For more information on this, see: CAS data distribution: DUPLICATE a REPLICATION using COPIES. Can you REPEAT?
Thanks @Stu_SAS , I've applied your trick. Nevertheless, the run time is more or less the same.
@acordes If you look into the link I've shared earlier you can see that FedSQL might not perform very well.
If you don't want to try the SAS data step with hash lookups then I'd suggest that you at least change your SQL along left joins with the same sort order.
Depending on how data is distributed over the worker nodes, data movement between nodes can consume quite a bit of time.
I have my doubts but who knows may-be the CAS is smart enough to move data blocks only once if there are two consecutive left joins that require the same sort order.
Also consider to run your FedSQL with the method option to get an idea how the query is performed. I would hope that at least your small tables use hash join.
If the relationship of table PUBLIC.X_TEST to the lookup tables is many:1 (or zero) then using a data step with hash lookups could perform better.
CAS Table Lookup (Left Outer Join) on SAS Viya
As long as the base table and the target table are in CAS the source for the hash tables doesn't need to be a CAS table for the process to execute in CAS.
Below untested code - but should be close.
%let outcas=casuser;
data PUBLIC.X_TEST1(drop=_rc);
if _N_ = 1 then do;
/* create variables */
if 0 then
do;
set
PUBLIC.X_TEST
&outcas..SUMMARY_A1(keep=column mean Std rename=(column=_name_ mean=mean_recipe Std=std_recipe))
&outcas..SUMMARY_A2(keep=column Mean Std rename=(column=_name_ Mean=mean_batch_id Std=std_batch_id))
&outcas..NDIST_A2 (keep=_column_ _NDis_ rename=(_column_=_name_ _NDis_=_NDis_batch_id))
&outcas..NDIST_A1 (keep= _column_ _NDis_ rename=(_column_=_name_ _NDis_=_NDis_recipe))
;
end;
/* define hash tables */
declare hash h_summary_a1(dataset: "&outcas..SUMMARY_A1(rename=(column=_name_ mean=mean_recipe Std=std_recipe))");
h_summary_a1.defineKey('RECIPE', '_name_');
h_summary_a1.defineData('mean_recipe', 'std_recipe');
h_summary_a1.defineDone();
declare hash h_summary_a2(dataset: "&outcas..SUMMARY_A2(rename=(column=_name_ Mean=mean_batch_id Std=std_batch_id))");
h_summary_a2.defineKey('batch_id', '_name_');
h_summary_a2.defineData('mean_batch_id', 'std_batch_id');
h_summary_a2.defineDone();
declare hash h_ndist_a2(dataset: "&outcas..NDIST_A2(rename=(_column_=_name_ _NDis_=_NDis_batch_id))");
h_ndist_a2.defineKey('batch_id', '_name_');
h_ndist_a2.defineData('_NDis_batch_id');
h_ndist_a2.defineDone();
declare hash h_ndist_a1(dataset: "&outcas..NDIST_A1(rename=(_column_=_name_ _NDis_=_NDis_recipe))");
h_ndist_a1.defineKey('recipe', '_name_');
h_ndist_a1.defineData('_NDis_recipe');
h_ndist_a1.defineDone();
end;
call missing(of _all_);
set PUBLIC.X_TEST;
/* lookup */
_rc = h_summary_a1.find();
_rc = h_summary_a2.find();
_rc = h_ndist_a2.find();
_rc = h_ndist_a1.find();
run;
I don't know how exactly this works but I have been told that as long as you only load the hash table without any further write operations (hash.add() etc.) there will only be a single instance of the hash table and not a copy per node.
@LinusH @Patrick @Stu_SAS Thanks to your advices.
I haven't tried the hash option, because I refrain from learning it 🙂
The best result in terms of speed comes from defining a customized action that, within the builtin action set, combines an aggregate action and a fedsql.
I share the code. Some variables of the right joined table in my problem statement are still missing, but I don't expect big performance issues.
I would say that I cut the execution time by factor 0.2.
cas mySession sessopts=(caslib="casuser");
/*create an action set called tblMakers*/
%let _sv_quotelenmax=%sysfunc(getoption(quotelenmax));
options noquotelenmax;
proc cas;
builtins.defineActionSet /
name="aggregator"
type="CASL"
actions={
{
name="aggr"
desc="Return aggregation"
parms={
{name="caslib" type="string" required=TRUE}
{name="table" type="string" required=TRUE}
{name="var1" type="string" required=TRUE}
{name="type1" type="string" required=TRUE}
{name="type2" type="string" required=false}
{name="gr1" type="string" required=TRUE}
{name="gr2" type="string" required=false}
{name="clause" type="string" required=false}
}
definition=
"
if exists('Clause') then whereClause=catx(' ','where',whereClause);
else Clause='1=1';
aggregation.aggregate /
table={ name=table,
caslib=caslib,
where=clause,
groupby={gr1, gr2}
},
casout={name='_have', replace=true}
varspecs={
{summarySubset={type1}, Name='measurement', columnnames={'_' || type1 || '_'}},
{summarySubset={type2}, Name='measurement', columnnames={'_' || type2 || '_'}}
},
savegroupbyraw=true, savegroupbyformat=false, raw=false;
"
}
{
name="joiner"
parms={ {name="tblName" type="STRING" required=True}
{name="table" type="STRING" required=True}
{name="caslib" type="STRING" required=True}
{name="type1" type="string" required=false}
{name="type2" type="string" required=false}
{name="gr1" type="string" required=TRUE}
{name="gr2" type="string" required=false}
{name="whereClause" type="STRING" required=False}
}
definition=
" if exists('whereClause') then whereClause=catx(' ','where',whereClause);
else whereClause='';
fedSQL.execDirect / query='create table casuser.'||tblName||' {options replace=true} as
select a.*, _'||type1||'_, _'||type2||'_
from '||caslib||'.'||table||' a inner join CASUSER._have b on a.'||gr1||'=b.'||gr1||' and a.'||gr2||'=b.'||gr2||' ';
"
}
};
run;
options &_sv_quotelenmax;
%let t1=std;
%let t2=mean;
%let g1=recipe;
%let g2=_name_;
/* run the tblMakers.tableCreateParam Action */
proc cas;
aggregator.aggr / table="TTT" caslib="public" var1="measurement" type1="&t1" type2="&t2" gr1="&g1" gr2="&g2" clause="datepart(datetime) > '13feb2025'd" ;
aggregator.joiner / tblName="want3" table="TTT" type1="&t1" type2="&t2" gr1="&g1" gr2="&g2" caslib="public";
run;
PROC CAS;
table.fetch / table={ caslib="casuser" name="want3", VARS={"&g1", "&g2", "_&t1._", "_&t2._"}} to=5000;
RUN;
I haven't tried the hash option, because I refrain from learning it
Well, your decision but you're certainly missing out. It's not that hard to understand.
I've found the culprit for the slow performance.
One of the columns had a length of 32167 ! despite only holding values at length 21 as maximum.
Now I've changed to varchar in a previous data step where I create the column.
Thanks for your help.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.