Understanding the Scoring of a Data Grid in a SAS Intelligent Decisioning Flow

2 Likes

Data grids are very useful when defining the logic through rule sets within a decision flow, especially if the data you are working with takes on a table-like structure i.e. rows and columns. There are numerous data grid functions one can employ to manipulate the data grids to achieve a desired outcome. But there are a few caveats one should be aware of when using data grids and the functions available.

One of those caveats is referencing a specific item in a data grid especially - and here is the catch - when that item appears more than once in the data grid. I like to think of a data grid as an array or matrix in this context. Let me explain. A two-dimensional matrix structure may consist of columns and rows and therefore each cell in the matrix has a distinct and unique "address" or coordinate. See the diagram below. If I want to reference the cell in which the word "Apple" appears I may have to define a function which uses the column and row references or perhaps use a search function. In this example, Apple is found column 2 row 2 as well as column 2 row 4.

So you may ask, "what does this have to do with a data grid". Well, everything. Most data grid functions reference a "cell" in the data grid by way of a column name and a row index. Take for example this data grid function: DATAGRID_GET (dataGrid, column, rowNumber). This function returns the value of the cell in the specified row and column. Notice that the function arguments require a dataGrid reference, as well as the column and rowNumber in order to return the value contained in that combination.

For example the statement: DATAGRID_GET (cust_hist_dg, 'Procedure Code', 3) will return the value 'PROC987' (you can read this as; get the value from data grid "cust_hist_dg" where the column name is "procedure code" using row number 3)

Although this function is great at getting a single "cell" value in the data grid it can be tedious to retrieve for example the values from multiple rows in the data grid.

To further illustrate the point I'm trying to get to is the following function example, the DATAGRID_FILTEREDGET(dataGrid,columnToSearch,filterColumn,operator,variableOrValue) function. This function returns the value in the first row in the specified column for which the specified comparison evaluates to true.

For example, the statement DATAGRID_FILTEREDGET(cust_hist_dg,'CustomerID','Procedure Code','EQ','PROC987'), determines whether the value of the Procedure Code column is equal PROC987, and if so, returns the value of the Customer ID column i.e. "C1". But notice that it will only return the first row it finds that matches these criteria.

In most cases, the above functions would suffice perfectly fine, but how would one configure a rule set or a decision flow which will allow one to go through the data grid row by row and return all of the values it finds to fulfil a specific requirement without having to create complex logic? This is the question I'd like to address in the next sections.

To loop or not to loop

If you are not into coding, like me, the majority of the time configuring decision flows will require you to define rule sets in order to use the data grid functions provided. We also know that typically a decision flow containing rules, passes the logic flow from top to bottom through the flow, and there is no ability to "loop" the logic within a flow. To reitterate, there is no capability available at present to redirect the flow logic in a decision flow to a previously executed node during the decision flow execution. However, there is a way to achieve some of this ...

Let's review this use case: The requirement is to evaluate the following data grid <claim_history_dg> (screenshot below) by going through it row by row, filtering out all records where CUST_ID = 'C1' and the PROC_CODE='P567_0001'. These rows are then to be written to a new data grid. A new column must be added to the data grid: AGGREGATE_TOOTH_TREATED and this column must aggregate the number of teeth treated for CUST_ID = 'C1'. Sounds simple enough right ... ?

The resulting output should represent the following:

To pull this off we will employ the services of this little gem of a tick box :

This feature will allow us to execute the node object logic (rule set) against each row in a data grid. When you select this option, the application maps decision variables to columns in the data grid if the variable and column have the same name. The important factor here is that you do not need to use data grid functions to process columns in the data grid as you typically would.

IMPORTANT When the node object is a filtering rule set, and you select Score rows in this data grid, rows that do not meet the criteria defined by the rules are removed from the data grid.

Step-by-step Walkthrough

Let's run through the example step-by-step. To keep things simple we will create the following decision flow with two rule set nodes. Both rules set nodes are of "assignment" type.

The first node: zafgev_create_dg_json (1.0) will help us create the input data grid with associated data. The second node: zafgev_score_dg (1.0) will be used to score the data grid we created in the first node. Note that the "zafgev_" in the node name are my user credential alias and I use this as a prefix to find rules and artefacts I have created quickly using the search function in the user interfaces. So you are welcome to substitute that with your initials.

Prior to defining the various rules in the rule set make sure to create the variables first using the variables tab.

Predefining the structure of the data grid "claim_history_dg" created using the DATAGRID_CREATE expression is an important step. Note that the DATAGRID_CREATE function does not create the columns and data types of the data grid explicitly when used in a rule expression. I would suggest you create the data grid structure by specifying the column names and types before running the DATAGRID_CREATE function.

... next define the data grid column names and types ...

The rule set node definition creates a new data grid called "claim_history_dg" and uses the JSON argument to populate the data grid with data. The rule set also sets initialisation values for the variables defined i.e. _row_count, _row_index etc.

The code window below contains the expression including the JSON to create the data grid and populate the data grid data.

DATAGRID_CREATE(claim_history_dg,'[{"metadata":[{"CUST_ID":"string"},{"C_DATE":"decimal"},{"LINE_ITEM":"decimal"},{"PROC_CODE":"string"},{"TOOTH_TREATED":"decimal"}]},{"data":[["C1",21915,1,"P25435   ",1],["C1",21915,2,"P567_0001",1],["C1",21915,3,"P678     ",1],["C2",21919,1,"P25435   ",1],["C3",21924,1,"P567_0001",3],["C4",21934,1,"P123     ",1],["C1",21935,1,"P567_0001",1],["C1",22056,1,"P567_0001",5]]}]')

Next create the second node: zafgev_score_dg (1.0). Start by defining the variables required. Take note of the input and output selection columns below and the assignment of each variable.

Again make sure to define the structure of the data grid object "new_datagrid".

Again we will manually define the structure of the output data grid we require "new_datagrid". The purpose of this data grid is to capture the scored rows from the "claims_hisotry_dg" data grid. Note the difference here when "scoring" a data grid i.e. we do not have the claims_history_dg as an input data grid into this rule set node. If we were not scoring the data grid we would have had to define the input data grid, but with scoring, we don't. More on why this is later.

Next, define the rules for the rule set.

A few notes in this rule set - from here on I will refer to "input rows". Each row in the zafgev_claims_hist data grid will be injected into the rule set, one row at a time until the last row in the data grid is reached. Think of it as a LOOP wrapper placed around the rule set, repeating the rule set as many times as there are rows in the data grid zafgev_claims_hist, to be scored.

the rule set starts by filtering out only the rows matched where _dg_cust_id = 'C1' as well as where the _dg_proc_code = <Input variable> (from upstream variable input i.e. req_proc_code)
add a new row to the "new_datagrid" using the DATAGRID_ADDROW expression - doing this will add a new row but all variables in the data grid row will be null at this point
next count the number of rows in the "new_datagrid" - this will allow us to use the variable "_row_count" to insert values into the new data grid. we do this because data grid manipulation functions are very specific with regard to pinpointing the column and row to be manipulated.
we set the "new_datagrid" column values using the DATAGRID_SET functions. Notice here the use of the "_row_count" variable to figure out in which row to insert the new values for the data grid. In this scenario always set the values in the last row of the growing data grid.
The "assign" type rules for example "_aggreate_tooth_treated" helps increment the aggregate value. So every time a new row is added to the data grid the "_aggregate_tooth_treated" variable increments by a value of 1. This also proves later that the zafgev_claims_hist is being scored row by row.

Now that both rule sets are created we can add them to the decision flow. When completed the decision flow should resemble the following flow:

Once you have added the rule sets to the decision flow be sure to select the "Add missing variables" option in the decision flow "Actions" menu. This will make sure the decision flow is up to date on any input and output variables sourced from the rule sets.

A view of the variables tab of the decision flow should represent the following variables:

Selecting the "zafgev_score_dg" rule set node in the decision flow should present the following mapped input variables:

Here we reach the "secret sauce" to the exercise - the ability to score a data grid. Note in the screenshot above that the input variables mapping table allows the user to check the "score rows in this data grid" option as well as a selection drop-down list where the relevant data grid can be selected.

An important distinction must be made here regarding how we map the input variables. Notice that the following variables; _dg_cust_id, _dg_proc_code and _dg_tooth_treated, are mapped to the corresponding selected data grid columns NOT the upstream decision flow variables. This is denoted with the table style icon next to the variable. In essence, what this implies is that the data grid is read row by row, and the variables mapped are passed to the matching rule set variables and processed one row at a time until all rows in the data grid are processed. This capability is super useful when you need to process each row in a data grid in a loop-style fashion.

Scoring the decision flow

The following screenshot depicts the "scenario" based scoring test performed on the decision flow.

Note here that for the test we are passing req_proc_code = 'P567_0001'. Below is the output from the test case:

A few important observations from the result set is noted below which are important in understanding a few unique nuances in scoring a data grid:

_row_count = 3
- From this output, we are certain that the business rule within the rule set was executed a total of 3 times - note that the data grid contained a total of 8 rows however only three rows matched the filter criteria. Also, remember that the value of _row_count is assigned by incrementing its value on each subsequent pass through the filter criteria.
_row_index = 1 vs _row_index2 = 4
- Why do these values differ from one another when both of these variables increment their own respective values in the filtering rule?
- The reason lies in the definition of these variables in the variables tab of the rule set.
  - Note that the _row_index variable in the variables tab does not "output" its value for post-processing. The important detail here is that as the data grid is being scored each cycle through the rule set starts with the _row_index value reinitialising rather than using its previous set value within the rule set.
  - Note that the _row_index2 variable in the variables tab does "output" its value for post-processing. This is an important detail to remember as here the _row_index2 value is retained as the data grid is scored through each cycle effectively passing the variable value from each cycle to the next, effectively retaining the value of the previous cycle.

Summary

In conclusion, the ability to "Score" a data grid can solve many issues where you are required to step through rows in a data grid for your use case. When first exploring this option it is important to keep it simple and small until you are comfortable with the concept and the prerequisites before venturing onto more complex use cases.