BookmarkSubscribeRSS Feed
Filfrat
Calcite | Level 5

Hi SAS,

the Live ETL process seems to check data availability. As a matter of fact, if a flow has been registered with input table X and such input table X is not present at last live ETL, then the instance creation of the flow fails. Error is raised saying that the table is not there, s flow creation failed.

 

Is there a system parameter or option so to make live ETL *NOT* check data availability?

 

I am asking this since, in our case, some input tables might not be present in landing area at last live ETL, due to the fact that tables come progressively in our system. Thus, we would not like to perform a new live ETL each time a new set of tables is available. Instead, we would like to save tables directly in the landing area, instead of going via the normal route input area -> live ETL-> landing area. This would save time and batch processing.

 

In short, we would not like to perform a live ETL for data availability. Rather, we would like to run live ETL only when infrastructural changes are needed (such as new flows, new base date, new config set and so on...). The data availability check performed by live ETL is anyways useless since SAS will anyway crash if data is not there. 

 

So, once again, can we 'live ETL' without checking data availability, so that flows will start even if input data is not present? Do we have a IRM system parameter for that?

 

Thanks.

 

 

7 REPLIES 7
ArnoldP
Obsidian | Level 7

Hi,

nice question and idea. I'm also wondering about something like this since long time ago. Maybe some SAS expert can give us some insight to Live-Etl.  Looking forward for some answers.

Cheers Arnold

sbxtog
SAS Employee

Hi, 

Live ETL just copies all files from the input area to a landing area. As far as I know, it doesn't check whether all tables are there or not. However, IRM won't run without having all the necessary tables in the landing area. You won't be able to create an instance.

Patrick
Opal | Level 21

To my knowledge (and implementation experience) Live ETL just copies objects from input_area to landing_area.

It will outdate any existing IRM instance for which newer data gets copied to landing_area - which in my opinion is the correct behavior.

It will not fail if not all data required by an IRM flow doesn't exist yet.

 

If you're building IRM flows in Dev under your Personal Federated Area (PFA) using the scripting host then yes, the source data must exists in the landing_area folder under your PFA. But that's a different use case.

 

I know of sites that made the decision to push the data directly to the landing_area folder with exception of the entity table. They still execute Live ETL for the Entity table and the run this Live ETL at the end of the upstream process prior to creating any new instance for the new month.

 

For the where I'm currently working we decided to stay as close to the documented process as possible. I've created a macro that deletes any object in the inpu_area folder with a date the same or older than the matching object in the landing_area folder and only then triggers Live ETL. We still trigger Live ETL whenever we need to push new data to the landing_area folder and didn't observe any issues yet.

 

You also mention changing IRM flows (the definition). That's a totally different use case and something that shouldn't be required for SAS provided content packs. 

Please be aware that you can't push a change to an existing IRM flow (the definition) under the same name to the same Federated Area without deleting ALL existing instances for any date that were created using the previous IRM definition.

IF you really need to change IRM flow definition then I suggest you ask a new question what you need to consider to deploy such change. I'd also recommend that you don't change any IRM flow definition of a SAS provided content pack if not absolutely necessary and only after raising a Tech Support track to get some guidance.

Changing an IRM flow definition in a SAS provided content pack is changing the product with potentially significant impact on both SAS support and migration effort to newer versions.

sinpad
SAS Employee

Nice response @Patrick ! You have touched on so many good points and best practices that when not adhered to could result in potential issues that might be very difficult to resolve.

Filfrat
Calcite | Level 5

Hi.

 

Thanks for the response. I am going to give more details.

We use IRM for flows that we built, which are related to solutions outside what SAS is providing. I mean we have other processing (custom) units that are written in SAS language and that exchange data with SAS solution packages provided by SAS. It therefore makes sense to run those custom units in IRM. Within these custom unit, we create their flows from the scratch and modify as needed. 

 

Let's denote our new custom processing unit in IRM as XX. Then what happens is the following. We build XX flow with all nodes and headers required. We restart IRM server to have the flow into the system. We test it and runs like a charm, we can create instances and they run until the end.

Then next month comes. We need a live ETL to refresh the entity table as well as links to the base date. We do it, say, on the 1st of the month. Data for the flow XX however is not yet there on the 1st. Data for XX comes on the 6th calendar day, and we save it into landing area. Therefore we want to start the XX instance with irm_rest_create_jobflow_instance macro(...XX...), provided by SAS. The macro fails, it says "Input data is not there". Bang.

 

The whole point is that on 6th calendar day, when we want to start the instance, input data IS there.

However, in the last live ETL, which happened on 1st calendar day, input data WAS NOT there. 

 

Therefore, somehow, it looks as if live ETL checks data availability and keeps track about which flows can be run (data is there) and which cannot (data is not there). This information is written 'in stone' somewhere in the postgres database, so we cannot start those instances for which data WAS NOT there. As soon as we perform a new live ETL, we can start the instance. 

 

So, I would like to see a parameter somewhere in IRM system files which says:

  check_data_availability_in_Live_ETL = Y

so that I can turn it off:

  check_data_availability_in_Live_ETL = N

 

or something similar.

 

Best regards and thanks for replying.

sinpad
SAS Employee
To my knowledge there is no such parameter
Patrick
Opal | Level 21

@Filfrat 

I start to ask myself if we're working on the same site. The one I'm on also uses custom solutions and also got data delivered at different calendar days.

You can run live ETL multiple times on the same month with no issue. Only IRM instances for which new data get dropped into the input_area folder get outdated.

 

If you don't run live ETL but drop the data directly into the landing_area folder then you're not following the documented process and though shouldn't complain if things aren't working as desired. 

If you drop the new data into the input_area folder and then trigger live ETL things will work. That's what we are doing.

 

You say that you're also building custom IRM flows. The challenge you will be facing is when you need to update an already deployed and used IRM flow definition. 
Basically: You can't deploy a change to an existing IRM flow definition into the same FA and with the same name without deleting all instances created using the previous version.