BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
acordes
Rhodochrosite | Level 12

I wonder why my query takes nearly 4 minutes for completion. 

It's the part of my code which performs slowest despite using exclusively castables and fedsql.

Only the left most table has over 1 million rows, the other tables are aggregation tables.

 

What should I do to speed up this query?

 



%let outcas=casuser;

proc cas;
source ETL_AC;
		create table PUBLIC.X_TEST1{options replace=true} as 
			select a.*, b.Mean as mean_recipe, b.Std as std_recipe,
						c.Mean as mean_batch_id, c.Std as std_batch_id,
						d._NDis_ as _NDis_batch_id,
						e._NDis_ as _NDis_recipe		
               from PUBLIC.X_TEST a 
					left join &outcas..SUMMARY_A1 b 
						on a.RECIPE=b.RECIPE and a._NAME_=b.column
					left join &outcas..SUMMARY_A2 c 
						on a.batch_id=c.batch_id and a._NAME_=c.column
					left join &outcas..NDIST_A2 d 
						on a.batch_id=d.batch_id and a._NAME_=d._column_
					left join &outcas..NDIST_A1 e 
						on a.recipe=e.recipe and a._NAME_=e._column_    ;
endsource;
fedSQL.execDirect / query=ETL_AC
quit;

My log states:

I don't understand the timestamp comment. Is it the culprit of the poor performance?

 

NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: CASDAL driver. Creation of a TIMESTAMP column has been requested, but is not supported by the CASDAL driver. A DOUBLE 
      PRECISION column will be created instead. A DATETIME format will be associated with the column.
NOTE: Table X_TEST1 was created in caslib Public with 1295986 rows returned.
NOTE: PROCEDURE CAS used (Total process time):
      real time           3:58.86
      cpu time            0.43 seconds
1 ACCEPTED SOLUTION

Accepted Solutions
acordes
Rhodochrosite | Level 12

I've found the culprit for the slow performance. 

One of the columns had a length of 32167 ! despite only holding values at length 21 as maximum. 

Now I've changed to varchar in a previous data step where I create the column.

Thanks for your help. 

View solution in original post

9 REPLIES 9
Stu_SAS
SAS Employee

Hey @acordes! If you are working in a distributed CAS environment, you can try to make the small tables repeated tables prior to the join on the big table. This copies all rows of the small tables to all CAS nodes. This reduces the amount of shuffling needed, especially when joining a big table with small tables. I don't know if this will work in your case but it's worth a shot to see if it improves speed. For the DATA Step, this is the duplicate=yes option. For PROC CASUTIL or the loadTable action, this is the repeat option.

 

Examples:

 

data public.small(duplicate=yes);
    set small;
run;

proc casutil;
    load data=small outcaslib='public' repeat;
run;

 

 

For more information on this, see: CAS data distribution: DUPLICATE a REPLICATION using COPIES. Can you REPEAT? 

acordes
Rhodochrosite | Level 12

Thanks @Stu_SAS , I've applied your trick. Nevertheless, the run time is more or less the same. 

Patrick
Opal | Level 21

@acordes If you look into the link I've shared earlier you can see that FedSQL might not perform very well. 

If you don't want to try the SAS data step with hash lookups then I'd suggest that you at least change your SQL along left joins with the same sort order.

Patrick_0-1739958440589.png

Depending on how data is distributed over the worker nodes, data movement between nodes can consume quite a bit of time.

I have my doubts but who knows may-be the CAS is smart enough to move data blocks only once if there are two consecutive left joins that require the same sort order. 

 

Also consider to run your FedSQL with the method option to get an idea how the query is performed. I would hope that at least your small tables use hash join.

Patrick_1-1739959033037.png

 

 

Patrick
Opal | Level 21

If the relationship of table PUBLIC.X_TEST to the lookup tables is many:1 (or zero) then using a data step with hash lookups could perform better. 
CAS Table Lookup (Left Outer Join) on SAS Viya

As long as the base table and the target table are in CAS the source for the hash tables doesn't need to be a CAS table for the process to execute in CAS.

Below untested code - but should be close.

%let outcas=casuser;

data PUBLIC.X_TEST1(drop=_rc);
    if _N_ = 1 then do;
        /* create variables */
        if 0 then
          do;
            set 
              PUBLIC.X_TEST
              &outcas..SUMMARY_A1(keep=column mean Std  rename=(column=_name_   mean=mean_recipe   Std=std_recipe))
              &outcas..SUMMARY_A2(keep=column Mean Std  rename=(column=_name_   Mean=mean_batch_id Std=std_batch_id))
              &outcas..NDIST_A2  (keep=_column_ _NDis_  rename=(_column_=_name_ _NDis_=_NDis_batch_id))
              &outcas..NDIST_A1  (keep= _column_ _NDis_ rename=(_column_=_name_ _NDis_=_NDis_recipe))
              ;
          end;
        /* define hash tables */
        declare hash h_summary_a1(dataset: "&outcas..SUMMARY_A1(rename=(column=_name_ mean=mean_recipe Std=std_recipe))");
        h_summary_a1.defineKey('RECIPE', '_name_');
        h_summary_a1.defineData('mean_recipe', 'std_recipe');
        h_summary_a1.defineDone();

        declare hash h_summary_a2(dataset: "&outcas..SUMMARY_A2(rename=(column=_name_ Mean=mean_batch_id Std=std_batch_id))");
        h_summary_a2.defineKey('batch_id', '_name_');
        h_summary_a2.defineData('mean_batch_id', 'std_batch_id');
        h_summary_a2.defineDone();

        declare hash h_ndist_a2(dataset: "&outcas..NDIST_A2(rename=(_column_=_name_ _NDis_=_NDis_batch_id))");
        h_ndist_a2.defineKey('batch_id', '_name_');
        h_ndist_a2.defineData('_NDis_batch_id');
        h_ndist_a2.defineDone();

        declare hash h_ndist_a1(dataset: "&outcas..NDIST_A1(rename=(_column_=_name_ _NDis_=_NDis_recipe))");
        h_ndist_a1.defineKey('recipe', '_name_');
        h_ndist_a1.defineData('_NDis_recipe');
        h_ndist_a1.defineDone();
    end;
    call missing(of _all_);

    set PUBLIC.X_TEST;

    /* lookup */
    _rc = h_summary_a1.find();
    _rc = h_summary_a2.find();
    _rc = h_ndist_a2.find();
    _rc = h_ndist_a1.find();

run;

I don't know how exactly this works but I have been told that as long as you only load the hash table without any further write operations (hash.add() etc.) there will only be a single instance of the hash table and not a copy per node.

 

LinusH
Tourmaline | Level 20
This sounds insanely slow. A corresponding join in PROC SQL would be done in under lets say 20 seconds. What happens if you try the query in SAS Compute?
Data never sleeps
acordes
Rhodochrosite | Level 12

@LinusH  @Patrick  @Stu_SAS Thanks to your advices.

 

I haven't tried the hash option, because I refrain from learning it 🙂

 

The best result in terms of speed comes from defining a customized action that, within the builtin action set,  combines an aggregate action and a fedsql. 

 

I share the code. Some variables of the right joined table in my problem statement are still missing, but I don't expect big performance issues. 

I would say that I cut the execution time by factor 0.2. 

 

cas mySession sessopts=(caslib="casuser");
/*create an action set called tblMakers*/
%let _sv_quotelenmax=%sysfunc(getoption(quotelenmax));
options noquotelenmax;
proc cas;
  builtins.defineActionSet / 
    name="aggregator" 
    type="CASL"
    actions={
      {
        name="aggr" 
		desc="Return aggregation"
		parms={
		{name="caslib" type="string" required=TRUE}
		{name="table" type="string" required=TRUE}
		{name="var1" type="string" required=TRUE}
		{name="type1" type="string" required=TRUE}
		{name="type2" type="string" required=false}
		{name="gr1" type="string" required=TRUE}
		{name="gr2" type="string" required=false}
		{name="clause" type="string" required=false}
              }
definition=
" 
if exists('Clause') then whereClause=catx(' ','where',whereClause);
            else Clause='1=1';
aggregation.aggregate /
    table={ name=table,
caslib=caslib, 
where=clause,
groupby={gr1, gr2}
}, 
casout={name='_have', replace=true}
varspecs={
{summarySubset={type1}, Name='measurement', columnnames={'_' || type1 || '_'}}, 
{summarySubset={type2}, Name='measurement', columnnames={'_' || type2 || '_'}}
}, 
savegroupbyraw=true, savegroupbyformat=false, raw=false;
"
}

      {
        name="joiner" 
        parms={ {name="tblName" type="STRING" required=True}
				{name="table" type="STRING" required=True}
				{name="caslib" type="STRING" required=True}	
				{name="type1" type="string" required=false}
				{name="type2" type="string" required=false}	
				{name="gr1" type="string" required=TRUE}
				{name="gr2" type="string" required=false}
                {name="whereClause" type="STRING" required=False}
              } 
        definition=
          " if exists('whereClause') then whereClause=catx(' ','where',whereClause);
            else whereClause='';
            fedSQL.execDirect / query='create table casuser.'||tblName||' {options replace=true} as
              select a.*, _'||type1||'_, _'||type2||'_
              from '||caslib||'.'||table||' a inner join CASUSER._have b on a.'||gr1||'=b.'||gr1||' and a.'||gr2||'=b.'||gr2||' ';
          "
      }
    };
run;

options &_sv_quotelenmax;

%let t1=std;
%let t2=mean;
%let g1=recipe;
%let g2=_name_;

/* run the tblMakers.tableCreateParam Action */
proc cas;
  aggregator.aggr / table="TTT" caslib="public" var1="measurement" type1="&t1" type2="&t2" gr1="&g1" gr2="&g2" clause="datepart(datetime) > '13feb2025'd" ;
  aggregator.joiner / tblName="want3" table="TTT" type1="&t1" type2="&t2" gr1="&g1" gr2="&g2" caslib="public";
run;

PROC CAS;
  table.fetch / table={ caslib="casuser" name="want3", VARS={"&g1", "&g2", "_&t1._", "_&t2._"}} to=5000;
RUN;

 

 

Patrick
Opal | Level 21

@acordes 

I haven't tried the hash option, because I refrain from learning it 

Well, your decision but you're certainly missing out. It's not that hard to understand.

 

acordes
Rhodochrosite | Level 12

I've found the culprit for the slow performance. 

One of the columns had a length of 32167 ! despite only holding values at length 21 as maximum. 

Now I've changed to varchar in a previous data step where I create the column.

Thanks for your help. 

Stu_SAS
SAS Employee
Ahhh yep, that'll do it! That table was much bigger than it should've been. Always good to use varchar where possible to help with this!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2126 views
  • 4 likes
  • 4 in conversation