How the NLP - Extract Rule Configuration custom step saved my sorry... soul.
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Okay, let me abuse the term 'stream of consciousness' and provide a quick outline of a situation faced (and rectified within 20 mins) yesterday.
Among other things, I dabble in Natural Language Processing (NLP). I also dabble in another seemingly unrelated area of creating low-code reusable components, which is basically a posh way of referring to wrappers on top of SAS programs. We call them SAS Studio Custom Steps. Both capabilities are offered on SAS Viya, a unified analytics platform which is highly addictive for people like me who like to ... well, dabble.
Which is why, yesterday, upon facing one of those weird errors that crop up in Visual Text Analytics (the SAS Viya product offering NLP), I panicked for only 15 minutes. Me knowing me, that's a record.
The error occurred because I decided to do something which I have decided I won't ever do in future: change the structure of an input table for a Visual Text Analytics project. Although VTA does offer an option to replace a data source (which is useful in many cases such as refreshing input data for some crazy and complicated projects), there is a link maintained between columns in the input table and metadata in a text analytics project. A change to a data source of the same structure (same number of columns with the same names) should be pretty seamless around 99% of the time (not 100%, because there is a god or something like that), but a change to the structure of input data (such as a new column added or an important text column getting deleted) increases the chances of error (because, well, there does seem to be a god).
I don't know the root cause of the error yet, but probably will be able to rectify it soon. The error is not the point of this article. The impact due to the error is my main area of focus. I had been working on something pretty interesting : a fairly comprehensive project with information extraction and categorisation rules for a complex taxonomy on technical paper abstracts for PharmaSUG 2025. Not heard of PharmaSUG? You should attend, if you are interested in applications of SAS and allied open source technologies to improve processes and outcomes in Life Sciences. Read here to learn more.
The point is, I had a project comprising 30 different information extraction rules and 20 different categorisation rules, which no human can remember. I needed to take remedial measures. But, first, I needed to scream.
Figure 1: The VTA error encountered. One day, I will find out what happened..... |
Minutes 16 to 20
As mentioned, the first 15 minutes comprised of various childish activities. Then, I remembered an indiscretion of my somewhat recent (1-1.5 years) past. Then, I had contributed a SAS Studio Custom Step, the "NLP - Extract Rule Configuration" step to the SAS Studio Custom Steps GitHub repository. It had seemed fun at that time, and was motivated primarily by the following factors:
1. Transparently surface rule logic for aiding understanding by stakeholders
2. Identify changes to a set of rules
3. Help satisfy governance requirements.
I had even written an article about the same previously, available here. Now, the time had come to add an additional requirement satisfied by this step.
- A recovery mechanism in case of stuff happening
You may understand why this tends not to be the primary message behind positioning the custom step, as it hints at the possibility of an error, which nobody likes to talk about. But, the reality is, stuff does happen, more often than you think, and no system is immune or foolproof (even if one exists, well, stuff just hasn't happened yet). It's beneficial to build in mechanisms which help you to be resilient.
Back to action. The Extract Rule Configuration step helps extract rule configurations from an existing Visual Text Analytics project. This may refer to either a Concepts (information extraction) model or a Categories model. The step requires a reference to the VTA project in order to get started. More specifically, the project is tied to an Analytics caslib (a folder location for CAS tables) which contains all back-end tables and metadata created by the project. The easiest possible way of obtaining this is to refer to the front page of the VTA project.
Figure 2: Identifying the Project caslib location for VTA projects in Model Studio |
After copying this somewhere, I then opened SAS Studio and created a new SAS Studio Flow. Since the input tables in the project caslib happen to be CAS tables, I first established a connection to SAS Cloud Analytics Services (CAS) as follows:
/* provide a name for your CAS connection and connect */;
cas ss;
/* Optional - to ensure caslibs are assigned to a SAS libref */;
caslib assign _all_ ;
Then, I dragged a copy of the NLP - Extract Rule Configuration on to the canvas. Refer here for instructions on how to make a custom step available in your SAS Viya environment.
In my particular case, I had two nodes to extract the rules for. To identify them, I need to obtain a list of rule configuration tables within the caslib, which happens to be the first option available in the step (Generate a list of rule configurations). Select the first option and then provide the name of the analytics caslib in the space provided. Finally, attach an output port (right click -> attach output port) to the step and provide the name of a SAS dataset (which can be located in WORK; this is only a temporary dataset to hold names of the config tables).
Figure 3: Generate a list of rule configurations |
Now that we have a list of config tables ready, let's go ahead and extract them. For this purpose, drag a copy of the same step again to the canvas and this time, select the option named "Extract all rule configurations as per an input list" . Connect the ruleconfig list (the WORK dataset created during the first run of the step) to this step. In this case, all you have to do is to provide the name of a libref pointing to a CAS engine (I prefer PUBLIC since it's a shared caslib and easy to access), and this happens to be the place where the configuration tables get output. The names of the tables that get saved are located in the config list (WORK.RULECONFIG) created earlier.
Figure 4: The list of rule configuration tables generated. Note that there's a Concept and a Category table |
Figure 5: Extract Rule Configuration tables to a Caslib |
Load these tables to memory and open them using SAS Visual Analytics, the simplest application through which you can take a quick look at the rules. From this stage onwards, I found it extremely easy to export the rules over to an excel table. I then created a new Visual Text Analytics project (my recovery project, added a Concepts node followed by a Categories node, and used my Excel sheet to quickly copy paste all the rule names and rules into the new project. In a very quick period of time, I was up back and running, and had shelved my previous error for later analysis. My project's purring along perfectly now, thank you very much.
Figure 6: An example of the rule configuration table in Visual Analytics, which I then exported to Excel to easily copy over to a new project. Time : ~ 5 mins (end-to-end) |
In Summary, Did I Learn Anything?
Yes, actually, and not just on the personal discovery side (such as , I never knew I knew such words). For one, this was one more reinforcement of how a change mid-project contains risks. In my case, they pertained to how I decided to change my input table structure. The other lesson is to not trust in software capabilities alone. Even the best designed software can be stumped with new edge cases, and there's a reason why testing continues to be relevant in any software process, and my support to all software developers and testers out there (lest they put a hex on me or something).
The biggest lesson I learnt, as mentioned earlier, was on the importance of being resilient in the face of a setback. I'm not going to go all 'Taleb' on you and venture into philosophy (though this is a pretty darn good book, if only I can get past page 101). What I appreciate most, from this experience, is the value of going beyond the UI. Automated GUIs with their bells and whistles are nice, but it always helps to understand how things work under the hood. That knowledge helped me design the custom step a while back, and I found that it came to the rescue this time.
A final point is that there are always other options for recovery, and this specific case is no exception. As a classic example of being wiser after the fact, I acknowledge that saving my rules periodically to the Exchange (favourite number one in this , by the way), which acts as a repository of templates, would have significantly reduced rework. At the same time, maybe I wouldn't want to use the Exchange for one-off projects with little scope for reuse. Another option would have been to just refrain from making a change to the main project's table, and to have rather created a new project meant to use the new table structure. You live and learn.
Do you have any recovery 'miracles', or perhaps sob stories to share? Or even further questions about the NLP-related Custom Steps? Drop me an email and we can chat more.