SAS IRM Job Flow Processing Using Data Object Pooling

SAS Infrastructure for Risk Management (SAS IRM) is a key product on the SAS Risk Stratum Platform for performing complex computations and step by step job executions. Essentially, it’s a high-performance job execution engine that is designed to integrate with risk-specific industry solutions such as SAS Solution for Current Expected Credit Loss (CECL), SAS Solution for IFRS 9, SAS Solution for IFRS17, etc. It provides a transparent computing environment that is easy to use, traceable, well documented, and flexible.

Data Object Pooling in SAS IRM

SAS IRM achieves its featured high-performance execution in various ways. These include data partitioning, data object pooling, Live ETL, GPU processing, task orchestration, etc. In this post, we will try to present a broad overview of one of these aspects of high-performance computing in SAS IRM.

In SAS IRM, we define a set of individual tasks, where each task has a set of inputs and outputs. We then declare the relationship between these tasks (that is, which task comes first, which one comes next, which one is the last one, and so on). Together these tasks along with their input and outputs comprise jobs, and SAS IRM schedules the jobs and execute them, by leveraging the high-performance methodologies mentioned earlier.

The most important aspect of SAS IRM is its capability to figure out the best way to orchestrate tasks in parallel. If we have a single machine configuration, then all the codes and all the data would be on that single machine. The fact that the data does not need to be moved around across different machines leads to faster processing times. On the other hand, if we have a multiple machine configuration, then SAS IRM tries to figure out the best place to execute a task in a job flow, and it tries to do this by minimizing the movement of data across the servers. One way in which SAS IRM accomplishes this is by first determining how many tasks in the current job are independent (of other tasks) – that is, how many of these tasks can be executed simultaneously or in parallel, thereby increasing its computational efficiency.

Another way SAS IRM increase computational efficiency is by pooling data objects between shared computational nodes. When we create multiple instances of the same job in SAS IRM, the entire job with all the calculations is not re-executed during the second and subsequent instances. This is where the concept of data object pooling comes into play, which is a technique used in high-performance computing to dramatically accelerate computation times.

SAS IRM will re-compute something only if it has changed during execution of the current instance. For example, if two job flow instances contain the same nodes/tasks, then results of the first instance are stored and reused when the second instance is executed. This feature speeds up calculations. If we change the input data before executing the second instance, then SAS IRM will re-run only those nodes/tasks of the second instance that are impacted by the changes.

How data object pooling improves performance in SAS IRM?

SAS IRM keeps track of all the data objects, as well as the history of every single data object that are created. If SAS IRM determines before the start of any execution that the input data for a particular node/task has not changed, then it will not rerun the calculations for that node/task – it will simply give us the results of the previous execution.

SAS IRM has a pool of tables that stores the results of previous job executions and, depending on what we are asking it to execute subsequently, it tries to figure out whether the required results are already computed and stored in a designated pool area or not. If it doesn’t have the required results, then it will trigger another execution, but this will only be for the portions of the job that are new.

Another feature of data object pooling is that it applies to multiple users. To understand this, let us assume that UserA has executed a job and obtained the relevant results. Then, if UserB runs an identical job, then UserB will get the results instantaneously, because SAS IRM will take the results obtained from UserA’s execution of the same job and provide it to UserB. Hence, data object pooling in SAS IRM enables sharing of the results across all users – this again speeds up executions.

The improvement in performance through data object pooling does not come at the expense of the security. Suppose UserA has the required permissions to access data that contains sensitive information. When a job flow instance is executed by UserA on this sensitive data, then the output will display the results related to all the rows in the data to UserA. Next suppose that UserB runs the same job using the same sensitive data, but UserB does not have the credentials to access the same sensitive rows like UserA. As usual, data object pooling will speed up the execution of the job flow instance for UserB, but when the results are borrowed from UserA’s execution and displayed for UserB, it will be after SAS IRM has filtered out the information related to the sensitive rows.

How does SAS IRM perform data object pooling under the hood?

Every time we create a job instance, SAS IRM creates a subfolder named after the Instance ID within the data folder in the persistent area (pa). The default location of the persistence area is given below:

In Windows: \SAS-configuration-directory\Lev’n’\AppData\SASIRM\pa
In Linux: /SAS-configuration-directory/Lev’n’/AppData/SASIRM/pa

Let’s look at a scenario where we start by creating the first instance of a job flow. In SAS IRM after the instance has been executed, we can obtain the Instance ID of this job flow instance and then look for a folder with the same name as the Instance ID in the pa (persistence area) folder.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The first instance of our job flow (Simple Partition Flow byN) is allotted an Instance ID=1653076037 (this will be different for different installations). So, we should be able to find the 1653076037 folder within the data folder in the persistent area.

When we run multiple instances of this job flow, we should end up creating multiple copies of the same subfolder, and multiple copies of the same output tables. But if we look within the data folder in the persistent area, we do not see duplications of the 1653076037 subfolder.

To understand why this is the case let us take a look at the job flow that we have considered (the explanations are not going to be different if we use a different job flow). Our job flow has four separate tasks, and task 2 has a dependency on task 1, task 3 has a dependency on task 2, and task 4 has a dependency on task 3. As a result, SAS IRM will execute all the four tasks on the same machine to minimize data movement across different servers and maximize disk caching benefits.

Task 1 (Get Cardinality Cash Flows ByN) calculates cardinality for cash flows table based on user-specified number of partitions (in this case it is 50), while Task 2 (Partition Cash Flows ByN) partitions the cash flows table into 50 partitions. We can see that output of task 1 is used as one of the inputs in task 2. At the same time, both tasks are using the cash flows table (Fact table for cash flows) as the other input. Task 3 and Task 4 can be described in a similar fashion (i.e., the purpose of each task, their inputs, and outputs, etc.).

Furthermore, the output of task 1 is a SAS data set called card_cashflows_byn.sas7bdat and this saved in the mk_card library. The mk_card library maps onto a directory/folder called mk_card in the data folder within the persistent area.

As noted earlier, if we run this job instance multiple times then we do not get multiple copies of the same output from these executions – we have only copy of this output (mk_card.card_cashflows_byn.sas7bdat) in the mk_card directory/folder.

For a closer look, when we open the mk_card folder under the 1653076037 folder, we notice that it does not contain any data. Rather, it contains a data object link that points to where the actual table is stored in the persistent area – so there is no question of the output tables getting overwritten by their latest versions in the mk_card (or some other) folder. In fact, we can right-click on the data the data object link in the mk_card folder and choose Open file location to reveal where the actual table is stored. Once we click Open file location, we are taken to a directory/folder (in this case _30) in the pool subfolder in the persistent area, and we also notice that the actual table name is different from what we saw in the job flow diagram.

This is the table (shown above) that SAS IRM keeps a close watch on. In fact, SAS IRM has its own database to keep track of all these data objects – where are they stored, and what are their digital signatures. SAS IRM leverages these digital signatures to maintain one single version of the data sets in the pool folder. When we create multiple instances of a job, and unless something is being recomputed in each of these instances, the results are always pointing to the existing tables in the pool folder.

When we create a second instance of the same job flow (Instance ID=171389299), SAS IRM is going verify the contents of the output tables and check if their digital signatures are already existing in its database. In this situation, because we are not modifying anything in the second instance, the digital signature of the required tables already exists in the SAS IRM database. This allows SAS IRM to surface the same tables from the pool folder as results for the second instance.

On the other hand, if we modify the something in the job flow, some of the tasks will need to be recalculated. To create a scenario where one of the inputs tables is modified in the third instance, we download the Fact Table for Cash Flows (staging.cashflows) from task 1, modify the discount rate in the first row and then re-upload this table as the modified Fact table for Cash Flows (shown above).

As a result of this modification, task 1 (Get Cardinality Cash Flows byN) has to be recomputed when the third instance is executed (Instance ID=902557363). Because task 1 is recalculated, the output of task 1, mk_card.card_cashflows_byn.sas7bdat, will be different from what it was when the first and second instances were executed. In fact, when we open the mk_card folder associated with the third instance, we obtain the same data object link card_cashflows_byn.sas7bdat, but this time the link points to a table located in a different subfolder (_35) within the pool folder (shown above). This clearly indicates that during the execution of the third instance data object pooling was not available for task 1. Furthermore, because of the dependencies between all the four tasks, data object pooling would not be available for any of the other tasks during execution of the third instance.

Access Information

For a more in-depth training on SAS IRM refer to the course SAS Infrastructure for Risk Management Overview.

For a list of trainings offered by SAS refer to the Complete Course List from SAS Education.

SAS Communities Library