Re: D.I. Studio - etls_recordCheck macro issues

AndrewHowell · Posted 04-28-2014 09:23 AM

Greetings all:

SAS 9.3 / DI Studio 4.3 / Windows server / SQL-Server database

I am attempting to use the DI Studio transformations to create an SQL extract (small subset from a very large SQL-Server table).

The trouble is the generated code always inserts a "pre-code" etls_recordCheck macro to count the number of records from the table, along the lines of:

%macro etls_recordCheck;

%let etls_recCheckExist = %eval(%sysfunc(exist(SQL_DB.Table, DATA)) or

%sysfunc(exist(SQL_DB.Table, VIEW)));

%if (&etls_recCheckExist) %then

%do;

proc sql noprint;

select count(*) into :etls_recnt from SQL_DB.Table;

quit;

%end;

%mend etls_recordCheck;

%etls_recordCheck;

MPRINT(ETLS_RECORDCHECK): proc sql noprint;

MPRINT(ETLS_RECORDCHECK): select count(*) into :etls_recnt from SQL_DB.Table;

..which takes AGES, before then generating the desired SQL code (which only extracts the previous 24 hours of data from a table of many years..)

There is then some "post-code" to insert the etls_recnt value into a SAS performance macro.

The subset extract takes fractions of a second, however the inserted count(*) macro takes many minutes.

Questions:

Why count the entire table, when only the subset is processed? What value is the total table size to the performance stats when only the subset is actually extracted?
Beyond substituting all generated transformations with user-defined code omitting the macro, can anyone offer options?

Thanks,

Andrew.

gregorysnelson · Posted 04-28-2014 09:59 AM

Hi Andrew

This is disturbing behavior indeed! I can't stand the fact that they do a select count(*) from the table. I ran into this with a billion row table and needless to say, the DBA was not pleased. To get around this, I unmapped the input table but i am not sure that this is possible for all use cases. I personally think that this is a horrible bug.

-greg

AndrewHowell · Posted 04-29-2014 02:25 AM

Hi Greg,

The figure obtained in the inserted code at the start of the task is passed in a %perfstop() macro at the end of the task.

%perfstop(metrVal6=%sysfunc(max(&etls_recnt,-1)));

I understand the reason for logging performance metrics, however I cannot fathom the reason for storing the size of an entire table for an SQL subset extraction; surely it would make more sense to log the size of the subset extracted?

Also, there seems to be no ability to control the ETLS_RecordCheck macro code being inserted into the task. The ETLS_RecordCheck macro generates different code; below are examples from an SCD2 Loader, some Base data, some Oracle, some ODBC - but there does not appear to be the ability to surpress (or redefine) this macro. Why hard code the macro each time? Why not have an autocall macro with the smarts to test the data source to determine the most efficient way to obtain the metric? For example, interrogate an Oracle or SQL-Server's dictionary to obtain the value, rather than perform a count(*).

The macro is too "absolute" and inflexible.

Base

LIBNAME STG BASE "Z:\Data\Staging";

%let etls_recCheckExist = 0;
%let etls_recnt = 0;
%macro etls_recordCheck;
%let etls_recCheckExist = %eval(%sysfunc(exist(STG.ACCOUNTS, DATA)) or
%sysfunc(exist(STG.S_FINANCIAL_ACCOUNT, VIEW)));

%if (&etls_recCheckExist) %then
%do;
   %local etls_syntaxcheck;
   %let etls_syntaxcheck = %sysfunc(getoption(syntaxcheck));
   /* Turn off syntaxcheck option to perform following steps */
   options nosyntaxcheck;

proc contents data = STG.ACCOUNTS out = work.etls_contents(keep = nobs) noprint;
run;

   data _null_;
    set work.etls_contents (obs = 1);
    call symput("etls_recnt", left(put(nobs,32.)));
   run;

   proc datasets lib = work nolist nowarn memtype = (data view);
    delete etls_contents;
   quit;

/* Reset syntaxcheck option to previous setting */
options &etls_syntaxcheck;
%end;
%mend etls_recordCheck;
%etls_recordCheck;

BTW - why bother with PROC CONTENTS, PROC DATASETS, etc? What's wrong with:

data _null_;
set STG.ACCOUNTS(obs = 1) obs=nobs;
symputx("etls_recnt", nobs);
run;

ORACLE

LIBNAME MYORA ORACLE PATH=ORA_SVR SCHEMA=SAS_ORA AUTHDOMAIN="ORA_Auth" ;
%rcSet(&syslibrc);

%let etls_recCheckExist = 0;
%let etls_hasRows = 0;
%macro etls_recordCheck;
%let etls_recCheckExist = %eval(%sysfunc(exist(MYORA.ACCOUNTS, DATA)) or
%sysfunc(exist(MYORA.ACCOUNTS, VIEW)));

%if (&etls_recCheckExist) %then
%do;
%local etls_syntaxcheck;

   %let etls_syntaxcheck = %sysfunc(getoption(syntaxcheck));
   /* Turn off syntaxcheck option to perform following steps */
   options nosyntaxcheck;

   data _null_;
    set MYORA.ACCOUNTS( obs=1 );
    call symput("etls_hasRows",'1');
   run;
   /* Reset syntaxcheck option to previous setting */
   options &etls_syntaxcheck;
%end;
%mend etls_recordCheck;
%etls_recordCheck;

but in the same transformation I've also found this for an Oracle table:

%let etls_recnt = 0;
%macro etls_recordCheck;
%let etls_recCheckExist = %eval(%sysfunc(exist(MYORA.FINANCIALS, DATA)) or
%sysfunc(exist(MYORA.FINANCIALS, VIEW)));

%if (&etls_recCheckExist) %then
%do;
   proc sql noprint;
    select count(*) into :etls_recnt from MYORA.FINANCIALS;
   quit;
%end;
%mend etls_recordCheck;
%etls_recordCheck;

ODBC

LIBNAME MY_ODBC ODBC DATASRC="CORP" SCHEMA=DBO AUTHDOMAIN="Corp_Auth" ;

%let etls_recnt = 0;
%macro etls_recordCheck;
%let etls_recCheckExist = %eval(%sysfunc(exist(MY_ODBC.DAILY, DATA)) or
%sysfunc(exist(MY_ODBC.DAILY, VIEW)));

%if (&etls_recCheckExist) %then
%do;
   proc sql noprint;
    select count(*) into :etls_recnt from MY_ODBC.DAILY;
   quit;
%end;
%mend etls_recordCheck;
%etls_recordCheck;

LinusH · Posted 04-29-2014 07:33 AM

We have experienced a similar issue, but in our case &etls_recCheckExist did note generate any select count(*) code.

But, the exist function itself generated a select count(*) when the target object is a view...

So in our case, it'a SAS/ACCESS thing, not DI Studio.

We are using SQL Server via OLEDB.

SAS has reproduced this behavior, but claim it's feature, not a bug. So now they have put this as a "Future request", whatever that means/will lead to...

Data never sleeps

MikeFrost · Posted 05-02-2014 02:27 PM

what is the support track for this? I'm happy to investigate where this stands and let you know.

LinusH · Posted 05-05-2014 05:03 AM

Gr8!

7610822316

Slight correction, ACCESS to OLEDB uses OpenRowSet(), which in turn executes the view. And my interpretation is that this gives the same result (performance wise) as as a select count(*).

Data never sleeps

jdmarshg · Posted 12-15-2014 11:30 AM

Hi Mike has there been an update for this issue.

Regards,

Jonathan

TomKari · Posted 04-28-2014 10:23 AM

This really doesn't make any sense. My guess is that some performance monitoring facilities have been turned on; hopefully turning them off will get rid of this.

The consequences of doing a select count(*) on many brands of DBMS is well known, so it's hard to imagine that this is production code.

Try opening a track with SAS Tech Support. They have the specialized knowledge to help with this kind of thing.

Please repost when you find out more...we'd love to know what's up!

Tom

Quentin · Posted 04-28-2014 12:54 PM

I share your surprise/disappointment.

There was discussion in EG community recently about pros/cons to having EG's macro source code defined in the EG binary, rather than as autocall macros.

If the definition of %ETLS_RecordCheck were provided as an autocall macro, it would be easy to turn it off by modifying the macro definition. But I assume like the EG macros, the definition of the DIS macros is stored somewhere in a binary. Since the macro source code is being generated and then immediately invoked by the code generated by DI studio transformation, I think you're left hoping that there is a way to turn this off this level of monitoring, as Tom suggests.

Mostly I've been sticking to user-define code in my DI jobs. Which works well for a programmer, but perhaps not for the main intended DIS user.

--Q.

jdmarshg · Posted 12-15-2014 10:42 AM

Has there been a resolution for this?

Regards,

Jonathan

AndrewHowell · Posted 12-15-2014 06:39 PM

Jonathan,

No update on my part - we chose to retain the existing process rather than start replacing them with lots of user-written code, but would like to see the transformation updated (corrected?) to be more flexible in how it gathers its stats.

Regards,

Andrew.

MikeFrost · Posted 12-17-2014 01:58 PM

Sorry for the delay in my response to this thread. There are a lot of angles to this that I wanted to make sure we address.

First, the original issue reported here. To summarize, Andrew Howell's comment concerns the "etls_recordCheck" macro - he notes that DI Studio always seems to insert this macro into the code, which passes a :select count(*) statement to SQL Server to count the rows in a table. This statement takes a very long time to return.

From what I can tell, Andrew would ultimately like a way to control whether the "etls_recordCheck" macro" is inserted into the code. This can be done by turning off the option that enables it, which is "Collect Table Statistics". This option is enabled by default but can be disabled for a given job in DI Studio by right-clicking within the job diagram and unchecking the option "Collect Table Statistics". This can also be turned off for any job in the preferences for DI Studio by going to the "Tools" menu, selecting "Options", clicking on the "Job Editor" tab, and in the middle of the dialog box, unchecking the option "Collect Table Statistics".

A secondary questions concerns how the code that relates to this option is generated. The code for this must be capable of running against multiple data sources, so it needs to be generic for any data source. We could enhance the option to first query to determine the data source, and then pass a more data source-specific query to collect these table statistics, but understand that we would be making one or more roundtrips to the database to run a macro that can easily be disabled when it is not desirable and those roundtrips impact performance as well. Also, take into account that if we do this for SQL Server, Oracle, etc. customers will expect us to make and maintain these database-specific changes for this setting for each and every data source they could be accessing, and keep them current for database changes into the future. Because of this, we think it is preferable for those customers who want to have more fine-grained controls over this table statistics code to make code changes in it themselves to suit the unique needs of their own environments.

Finally, LinusH's issue concerns the use of the ACCESS engine for OLEDB, which, as he correctly notes, is caused by is the use of OpenRowset () against the views. The reason why this is not a bug is that OpenRowset () will perform this way when you run this query natively - in other words, the query runs as fast it would outside of SAS code. One way to make this query faster, however, would be to set the property DBPROP_MAXROWS and specify an upper limit of the number of rows returned to something manageable, like 100, 1000, etc. We do not expose a way in the ACCESS engine to set this property, thus the "enhancement" nature of this, an enhancement that we have thus far elected not to implement.

I hope that provides some clarification. Please post any additional questions about these issues.

Mike F.

AndrewHowell · Posted 12-18-2014 01:08 AM

Hi Mike,

Firstly, thanks for the thorough response.

It's not that I mind the %etls_recordCheck() macro being inserted; I understand it's purpose in collecting statistics. What I cannot understand is why an extraction of (for example) one day's data from a multi-year historic table is preceeded by an %etls_recordCheck() macro which performs a count(*) of the ENTIRE table - not only is this time-consuming but I cannot see any value in this statistic; surely it is the count(*) of the extracted records which is of value.

As to how dynamic to make the underlying code embedded within the %etls_recordCheck() macro - I concur the "multiple roundtrips" to the data source would be undesirable. My thoughts were to interrogate the ENGINE field from the dictionary.libnames table to determine the type of data source, and from that construct the most efficient datasource-specific code.

Andrew.

Patrick · Posted 12-18-2014 05:39 AM

Why don't you post this as an "idea". I will get my vote.

ScottBass · Posted 12-18-2014 06:54 AM

One idea I've had for our current ETL is a metric collecting job at the end of the job stream, that would use the dictionary tables to capture the record count of all our tables altered by the ETL. Since it would regularly capture the record count after each run, the records added each day (for the daily jobs) would be the difference between the two dates (assuming records are not deleted by the ETL). It would be easy to graph data growth over time, etc. It would also be easy to have this metadata driven; simply have a table for the datasets you wanted to track (we'd probably just track all the tables in key libraries).

One minor caveat is it would only collect the metrics at the end of the successful run of all jobs, rather than individually after each job. If that was an issue, just alter this approach to process a single table, and add that macro to the end of all your jobs.

IMO it would be trivial to create such a job. If this meets (or exceeds) the metrics collected by %etls_recordCheck, you could just turn off the "Collect Table Statistics" for all your jobs, and take matters into your own hands, rather than run the inefficient code in %etls_recordCheck.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Registration is open