Behind the Scenes with SAS Visual Text Analytics Part II

2 Likes

I start with a few questions for you:

Have you ever wondered what code is getting run behind each node in your Visual Text Analytics project?
Maybe you'd like to explore and leverage all of the the CAS table outputs created by the underlying code?
Maybe you're curious about the VTA capabilities that are not surfaced through nodes in Model Studio...

If any of the above apply to you, then read on for more details. I personally have been using SAS for almost 30 years and started purely as a programmer. As a result, I have an appetite to understand the SAS code behind the GUIs. Often there are additional parameters/options available via code that are not always surfaced in the UI. Also, a coding approach makes it easier to build a repeatable/schedulable process to improve automation. If you're a coder at heart like me, then you'll be able to quickly leverage the ideas in this post. If you're no so comfortable with code, but do have comfort working with VTA projects and nodes, I'm hoping that this article will help you ease into coding with VTA.

Like most Visual Text Analytics new users, I gravitated towards the visual interface in Model Studio. The interactivity of the nodes were very valuable in interrogating and understanding my text data and still are. Eventually, I was curious about the underlying SAS code that produced these analytic outputs which led me to investigate the log output from each node..

Let's start by taking a a closer look at the log from a Text Parsing Node (below). We see there are a number of CAS actions getting loaded and called. We see tpParse, tpSpell, tpAccumulate and tpSVD as highlighted in the graphic below. In each case, the CAS action is loaded and then called with a set of parameters that are captured inside the curly braces of the LUA string for that CAS action.

As it turns out, with this information from the log, we can easily reconstruct the CAS action syntax for tpParse by copying the information highlighted above from the log into proc cas code as follows where we put each parameter from the LUA string on a separate line for readability. Don't forget to add a single semicolon after all of the comma separated parameters:

proc cas ;
  builtins.loadActionSet / actionSet="TextParse" ;
  TextParse.tpParse / 
    docId="__uniqueid__", 
    entities="NONE", 
    language="ENGLISH", 
    liti={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="FB4E14C0-4388-4F5D-92B9-60738E2544CC_CONCEPT_MODELS"}, 
    nounGroups=true, 
    offset={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_POSITION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", promote=true}, 
    outComplexTag=true, 
    parseConfig={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_CONFIGURATION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", promote=true}, 
    table={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016"}, 
    text="SYMPTOM_TEXT" ;
run ;
quit ;

Alternatively, we can use macro variables to better show what parts are coming from the log and see the "template" for proc cas for the other actions:

%let ActionSetandAction = TextParse.tpParse ;
%let ActionSet = %scan(&ActionSetandAction.,1,%str(.)) ;
%let ActionParms = docId="__uniqueid__",entities="NONE",language="ENGLISH", liti={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="FB4E14C0-4388-4F5D-92B9-60738E2544CC_CONCEPT_MODELS"}, nounGroups=true, offset={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_POSITION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", promote=true}, outComplexTag=true, parseConfig={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_CONFIGURATION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", promote=true}, table={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016"}, text="SYMPTOM_TEXT";

proc cas ;
  builtins.loadActionSet / actionSet="&ActionSet." ;
  &ActionSetandAction. / &ActionParms. ;
run ;
quit ;

Note that the above code includes an explicit action to load the actionSet first as the log indicates the Node code does, but this is not required for the code to work. I omit this line of code in my examples below. Next let's focus on the CAS input and output table parameters for the tpParse action:

The parameters above highlighted in yellow are all the same and are the caslib for the VTA project. The table inputs and outputs are highlighted in gray. The "offset" and "parseConfig" parameters are outputs of the tpParse action. The subparameters should be changed to a different output caslib such as casuser and the output table names can be changed as well. I like to choose table names that are 32 characters or less so that I can see them in the Libraries section of SAS Studio.

The "table" and "liti" parameters are CAS table inputs to the tpParse action where the "table" input CAS table is expected to contain the text column ("symptom_text" in my example above) and a unique identifier column ("__uniqueid__" in my example above). The "liti" table parameter is populated in my example because I had a Concepts Node prior to my Text Parsing Node in my VTA pipeline. It is simplest to keep the code the same for the inputs and continue to source these from the VTA project caslib, but be sure to change the outputs to a different caslib which is what I did with the following modified code from above.

When I run the above code, we get information about the newly created outputs in the Results:

Let's take a quick look at them:

The "config" output table is just one row and it contains the parameters that were passed to the tpParse action.

The "position" output will contain many rows where each row is a term from the original document along with the position, sentence and paragraph in the document. While "stemming=TRUE" was not explicitly specified in the code, that is the default so with that parameter on, terms are grouped into parent terms. Also note that technically stemming in this context is more accurately described as lemmatization. In addition, "tagging=TRUE" was not explicitly specified in the code, but again this is the default and results in the part of speech getting included in the output in the _role_ column.

Note, the Text Parsing Node in VTA does not include any output tables (these would be in Results from right click menu). But now that we are running the tpParse action directly, we can create and access these tables.

CAS was designed to NOT overwrite existing CAS tables that are promoted. So if you tried running the above code a second time using the same ouptut caslib and tables names, you'll get an error:

You can do as the error message suggests and create new names for your output tables and the code will run. Alternatively, you can add code to delete the tables if they already exist. Here is example code to delete the two outputs using the "quiet" option so that it won't give an error if the tables don't already exist.

proc casutil;
  droptable casdata="PARSING_POSITION_OUT" incaslib="casuser" quiet;
  droptable casdata="PARSING_CONFIG_OUT" incaslib="casuser" quiet;
run;

Another option is to change the promote=true option on the CAS table outputs to replace=true which will give the CAS tables only session scope and replace them if they already exist:

proc cas ;
  builtins.loadActionSet / actionSet="TextParse" ;
  TextParse.tpParse / 
    docId="__uniqueid__", 
    entities="NONE", 
    language="ENGLISH", 
    liti={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="FB4E14C0-4388-4F5D-92B9-60738E2544CC_CONCEPT_MODELS"}, 
    nounGroups=true, 
    offset={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_POSITION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", replace=true}, 
    outComplexTag=true, 
    parseConfig={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016_PARSING_CONFIGURATION_OUT_fb4e14c0-4388-4f5d-92b9-60738e2544cc", replace=true}, 
    table={caslib="Analytics_Project_9185b008-3dcc-4dc0-9220-b0c43b172902", name="VAERS2016"}, 
    text="SYMPTOM_TEXT" ;
run ;
quit ;

Armed with an understanding of how to reconstruct the tpParse action from the Text Parsing Node log, I leave it to you to perform the same steps for the tpSpell, tpAcccumulate and textMining.tmSVD actions. The tpSpell action will only show up in the log when you choose the "Enable misspelling detection" checkbox on the Text Parsing node settings. The tpAccumulate action will always be run by the Text Parse node and takes the offset output and produces additional aggregated outputs for parents and terms used by downstream nodes.

Hopefully, this will help you make progress down the path of coding in VTA which will enable you to leverage the numerous VTA algorithms not available as a node in Model Studio. For example, if you have used the VTA Topics node, you've been using the Singular Value Decomposition algorithm for text topics. There is an alternative method for discovering text topics called Latent Dirichlet Allocation (LDA) that is documented here LDA Topic Modeling Action Set (2024.11).

If you are using the Categories Node to derive and leverage boolean rules for text classification, you could try some machine learning algorithms as alternatives such as the trainTextClassifier action that uses the BERT (Bidirectional Encoder Representations from Transformers) algorithm or the rnnTrain action that uses a Recurrent Neural Network to build text classifier model.

If you have not already checked it out, you may find my previous post Behind the Scenes with SAS Visual Text Analytics Part I useful. The Part I post shows VTA users how to find and copy the existing CAS tables created by the VTA nodes.

Disclaimer: it is possible that in future updates, VTA could change in a way so that the above approach won't work but I think that is unlikely.

SAS Communities Library