Tips and Tricks for Power Users of SAS® Visual Text Analytics: Problematic CDRs

NOTE: The following article references the use of APIs that are not officially documented by SAS for external use and are not supported by SAS Technical support.

Introduction:

Last year, we posted an article, Tips and Tricks for Power Users of SAS® Visual Text Analytics: Part 2 of 3 (API Hacks), that described the programmatic manipulation of custom concepts in SAS® Visual Text Analytics (VTA). This post expands on that article to describe a method to programmatically find, within a custom concept, the concept definition rule (CDR) that matches a text string of interest. This method can be used to quickly and efficiently remediate existing VTA projects to improve model performance.

After an information extraction model has been promoted and has been producing results, unexpected matches not observed during the development phase might appear. A new round of development is needed to modify the model to capture false negative (FN) results, to prevent false positive (FP) results, or both. An example of a FP result is shown in Figure 1 below for the VTA project named Color_Project, where the matched text of ‘blue’ is not desired because the X32 model was discontinued.

Figure 1. For custom concept, T_COLOR, the observation with RowID 435-5-80 shows an undesirable match for 'blue'.

Remediation Strategy:

One strategy to correct any FN or FP results is to find the CDR responsible for the undesired result and then modify the CDR or add a new CDR to produce the desired result. We call this process remediation. A challenging task during the remediation process is to find the CDR that is generating the undesirable match, especially since most information extraction models use nested custom concepts. A nested concept is a concept used within another concept. An example of a nested concept is the helper concept, H_COLORS, shown in Figure 2 below, which is referenced within many rules of the target concept, T_COLOR, shown previously in Figure 1.

Figure 2. For custom concept, H_COLORS, the observation with RowID 435-5-80 shows a match for 'blue'.

For a VTA project in SAS Model Studio, visually finding the matching CDR is not a problem when there are only a few CDRs present in the custom concept of interest. For example, the matched text of ‘blue’ in H_COLORS, Figure 2 above, is due to the CDR on line 10, CLASSIFIER:blue. This match is propagated within T_COLOR, Figure 1 above, since H_COLORS is referenced in the CDRs on lines 12-17. By visually reviewing these CDRs in T_COLOR, the match for ‘blue’ in the observation with RowID 435-5-80, is visually determined to be caused by the CDR on line 12. This match is due to the presence of 'X32 model' in both the rule and the observation. The REMOVE_ITEM rule on line 10 in T_COLOR can be leveraged to remove this FP result using the disambiguation concept D_COLOR, shown in Figure 3 below in its original state with no match for any of the three observations in the data source. NOTE: the custom concept H_NEGATIVE is not shown as it plays no part in this particular example.

Figure 3. For custom concept, D_COLOR, the observation with RowID 435-5-80 shows no matches.

With the matching CDR known (line 12 of T_COLOR), we can test new CDRs to remove this match from T_COLOR. We chose to use the disambiguation custom concept, D_COLOR for this purpose. The REMOVE_ITEM rule on line 10 in T_COLOR leverages this disambiguation concept. Figure 4 below shows the match result of 'blue' after executing the Test Sample Text feature for a new CDR in the Sandbox. Adding this CDR to the D_COLOR custom concept will remove the match in T_COLOR for the observation with RowID 435-5-80. For an explanation of the REMOVE_ITEM rule see SAS documentation.

Figure 4. Using the Sandbox and Sample Text feature to test a new CDR. The match for 'blue' in the text string, indicates placing this CDR in D_COLOR will remove the match from T_COLOR. This is due to the REMOVE_ITEM rule in T_COLOR, see Figure 1.

The remediation for this FP result was relatively straightforward for this simple model. However, when there are hundreds of CDRs in the custom concept of interest, using the SAS Model Studio user interface to visually investigate all the CDRs is not humanly possible, at least for this developer. Using the Sandbox and Sample Text features to sequentially test individual CDRs is another option but would be very time-consuming. To eliminate repetitive and time-consuming manual or visual efforts, we created a SAS program, leveraging API calls, to automate the task of finding the problematic CDRs.

Remediation Solution:

The ability to generate a concept model binary (a machine-readable, binary-encoded file that contains compiled linguistic rules) and score data with that file, solely within the SAS Studio programming interface, was key to the success of this automation effort. This binary file represents the rules-based model. The compileConcept action generates the binary file that is required to score data. See Generate a Concept Model Using the compileConcept Action for details. Below we describe a method to interrogate a custom concept and capture, in a SAS dataset, all CDRs matching the text string of interest. The matching/problematic CDRs are then modified by the developer, or other CDRs created, to remediate the model and improve model performance by eliminating FN and/or FP results.

Figure 5. Schematic diagram of the splitting method used to programmatically find the problematic CDR that matches ‘blue’ in T_COLOR. The terminal matching model is shown in green. The terminal non-matching models are shown in dark blue.

As shown in Figure 5 above, this method sequentially splits in half only the list of CDRs within a single custom concept, leaving all other custom concepts untouched, and creates two new models (binary files). Each new model is a replica of the previous model but with only half the rules of the one custom concept of interest. For the first split, the original model, Model, is split to produce two new models, Model1 and Model2. These two models are compiled, using the compileConcept action, to produce concept model binaries that are used to score the observation containing the inappropriately matched text string. If there is a match for the text string, the splitting and scoring process continues, Model1 in this example. The process continues for each model that matches the text string (Model 11 in for the second split) until only one rule remains in the model for the concept of interest (Model112 for the third split). The splitting terminates for models that do not match the text string (Model2, Model12 and Model111). Figures 6-8 below display some of the SAS data sets that were used as input for the compileConcept action to generate new models for scoring the observation of interest.

Figure 6. The SAS data set named ‘MODEL’ shows all four custom concepts extracted from the VTA project Color_Project shown in Figure 1. The concept of interest, T_COLOR, is highlighted by red and blue boxes. The red box outlines the metadata and one CDR, the REMOVE_ITEM rule. The blue box outlines the concept CDRs that were split or will be split.

Figure 7. The SAS data sets ‘MODEL1’ and ‘MODEL2’ showing the first split result. Only the custom concept of interest, T_COLOR, is displayed. Score results using ‘MODEL1’ produced a match. The red box outlines the metadata and one CDR, the REMOVE_ITEM rule. The blue box outlines the concept CDRs that were split or will be split.

Figure 8. The SAS data sets ‘MODEL111’ and ‘MODEL112’ showing the third split result. Only the custom concept of interest, T_COLOR, is displayed. Score results using ‘MODEL112’ produced a match. The red box outlines the metadata and one CDR, the REMOVE_ITEM rule. The blue box outlines the concept CDRs that were split or will be split. Note: Data sets for the second split are not shown.

In summary, each new model is scored to determine if there is a match for the observation containing the text string of interest. If a match exists for the new model, then at least one of the CDRs in that list is of interest and the interrogation continues with the same logic – split in half the one custom concept of interest for each matching model. The splitting continues until all paths terminate when either a single matching CDR remains in the custom concept of interest or there is no matching CDR.

Use Case Example:

Using the VTA project named Color_Project, we share an example of a false positive and the three-step process we used to remediate the problematic match:

Find the custom concept to interrogate
Execute the program that finds the problematic CDRs
Remediate the model in SAS Model Studio

Step 1 – find the custom concept

In Figure 1, an undesirable match for ‘blue’ is shown for the text string ‘The discontinued X32 model was available in blue’ for the custom concept named T_COLOR. We chose to interrogate T_COLOR for this example after noting that the nested custom concept H_COLORS contained the original matching text of ‘blue’ using a CLASSIFIER rule.

Step 2 – execute the program

Having identified the custom concept of interest and the matched text string ‘blue’, we execute a single program named Find_matching_CDRs.sas to extract, from only one custom concept, all CDRs responsible for the match. NOTE: this program will indeed find all CDRs that match the text string. Each section of the code is described briefly below showing only snippets of the code. You can access the full SAS program in our public repository: sas-vta-examples.

Section 1

This is the only section for which the user needs to make changes prior to executing the SAS program; modify the macro variable values to match your specifications.

/* ******* NOTICE: ******* */
/* the values for the variable names must not contain spaces */
/* the unique_row_id_name variable must be of type character */
/* hence the unique_row_id_value will be of type character */

/******** SECTION 1 ********/
/* USER ASSIGNED macro variables */

/* Data Source Information */
%let datacaslib=%str(vtadata); 
%let datatablename=%str(ds_color); 
%let text_variable_name = %str(pagetext); *see NOTICE section;
%let unique_row_id_name = %str(rowid); *see NOTICE section;

/* Search Criteria */
%let matched_text = %upcase(blue); *value is not case sensitive;
%let unique_row_id_value = %str(435-5-80); *see NOTICE section;
%let search_concept = %upcase(t_color); *value is not case sensitive;

/* VTA Project Information */
%let sas_project_name = %upcase(color_project); *text is case insensitive;

Section 2

The contents of the named VTA project are exported from SAS Model Studio to a SAS data set as described in an earlier article, Tips and Tricks for Power Users of SAS® Visual Text Analytics: Part 2 of 3 (API Hacks).

Section 3

The data source is subset to a single observation using the unique row identifier value; this observation must contain the inappropriately matched text string of interest. In our example, the observation contains the text 'The discontinued X32 model was available in blue' where 'blue' is matched.

/******** SECTION 3 ********/
/* SUBSET source data to observation of interest and move to CASUSER */
/*************************/
data CASUSER.TEST_OBS;
    set &datacaslib..&datatablename;
    where &unique_row_id_name. = "&unique_row_id_value.";
run;

Sections 4-6

Macro variables are created to identify and track the results of the splitting strategy and an empty SAS data set is created and values appended to it for each iteration of a split. A recursive macro, that executes the splitting strategy, is compiled. The macro is called and the results captured in a SAS dataset and also printed to the Results tab.

Briefly, the recursive macro program splits in half the rules in the custom concept of interest (denoted top and bottom), and creates two new models, one containing the top half and the other containing the bottom half, all other custom concepts remaining the same. NOTE: any REMOVE_ITEM rules in the custom concept of interest must be kept in both new models to maintain the integrity of the original model.

A concept model binary is created for each new model, see SAS documentation at generating a concept model; and the rules validated, see validating concept rule syntax. The binary file is then used to score the data set, created in Section 3, for each new model.

Finally, the macro program calls itself, due to the ‘call execute’ routine, and splitting continues based on the following logic:

If the two models for a split iteration both match and the number of remaining CDRs in the custom concept of interest is greater than one in each model, then the splitting continues with both models.
If only one of the models matches and the number of remaining CDRs is greater than one, then splitting continues only for the model that matches.

Section 7

This section is optional. It simply removes all intermediary filenames, libnames and datasets.

NOTE: execution time for the program depends on the number of CDRs within the custom concept of interest and the number of CDRs that match the text string. All matching CDRs will be returned.

Step 3 – remediate the model

With the matching/problematic CDRs in hand, modify and then test any new or modified CDRs using the Sample Text feature in SAS Model Studio using either the Concept Tab or the Sandbox. NOTE: REMOVE_ITEM rules do not function in the Sandbox. Finally, score the observation with the applyConcept action in SAS Studio using the concept model binary from the SAS Model Studio project score code to ensure modifications to any CDRs added to the VTA project return the expected results, and that the undesired results have been remediated.

Conclusion:

The ability to programmatically find the CDRs responsible for matching the text string of interest has resulted in significant time savings for this developer! You can access the SAS program in our public repository here, sas-vta-examples. We have shared an example of how leveraging the internal text analytics APIs have enable efficient interrogation of a false positive result leading to improved model performance. Please continue to share your ideas and the use of these or other methods. Thank you to my SAS collaborators and teammates for contributing their time and knowledge to this article. Find previous articles by these same authors at Tips and Tricks for Power Users of SAS Visual Text Analytics: Part 1 of 3 (Structuring Concepts), Part 2 of 3 (API Hacks) and Part 3 of 3 (Tracking Concept Rules).