This old nursery rhyme described a challenge the woman had to deal with. Keeping track of children takes a lot of patience and energy as many of you know. In SAS Visual Text Analytics, the Concepts node is where you enter detailed rules that extract information from document collections. These rules are so powerful that over time you might have forgotten the clever ideas you used in individual rules to build up your overall process.
In a recent SAS Visual Text Analytics class, while teaching the concept rules section, I was inspired by a discussion with my students. They are experienced text analytics users and actively run hundreds of concept rules in their own environment. They were wondering if there was a way to get a nice, detailed inventory listing of the concept rules and the syntax of all LITI statements used in a Concepts node.
As of the date this post is being written, there is no button available in the user interface that gives you this information. This post provides information on how you can produce a list of the concept rules and syntax of all LITI statements used in the Concepts node. Since there are a few steps involved in the process, I’ll summarize them here first;
First, I found a Concepts node in a project that included custom concepts rules. The next step was to find the actual concept rules table so I could extract the detailed rule syntax. If you are new to SAS Visual Text Analytics, this post will provide some background.
This is a list of the custom concept rules from the project I selected. Notice the naming convention incorporates underscores and all capital letters to differentiate the rule name from any text that might occur in a document. This is a recommended rule naming convention.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
I know that there are tables “behind the scenes” that contain vital information for each project. Refer to this post by Noah Powers for the details. I first opened the Concepts node’s log to find the internal names of the project as well as the specific Concepts node.
I noted the names from the log (shown below) for use in my next program. It was easy to get the information I needed to start looking for the detailed rule syntax.
In case you don’t have a magnifying glass handy, the "wrapped" log portion below highlights the relevant lines from the previous log. The first highlight is the project name and the second is the Concept_RULESCONFIG table name.
The following program uses the previously identified project and node names. It outputs a table of the rules from that concepts node.
cas;
caslib _all_ assign;
/* enter project and concept table name */
proc casutil incaslib="Analytics_Project_ce2afc9d-0ac4-438d-afa3-d417ae037fc5" outcaslib="casuser";
save casdata="646dc624-5b67-47b9-b070-9ff6f91bb7a2_CONCEPT_RULESCONFIG" casout="concept_rules.sashdat" replace;
run;
/* load the results table so it can be viewed */
proc casutil incaslib="casuser" outcaslib="casuser";
load casdata="concept_rules.sashdat" casout="concept_rules" replace;
run;
I named the output table “concept_rules”. It has the extracted rule and the system-generated (not user-friendly) ruleid.
OK, we are making progress but are not quite there yet. The concept rules table contains a “representation” of the rule syntax, but it includes an extra "rule name" in every line of the code, making it hard to read. In addition, I don’t want to keep the internal ruleId column.
Let’s consider the first row in the previous table. It has the syntax for all LITI statements in the _CONDITION_ rule, but also has the extra text “_CONDITION_:” in each line of the rule that is not part of the original syntax. I want to remove it from my results.
The last 4 lines of each rule are not part of the original rule syntax either, but they tell us if the rule is enabled, the fullpath of the rule, its priority and whether it is case sensitive or not. You could write some code to exclude those lines from your listing if you wanted to but I thought they would be useful to keep.
Since I'm working towards receiving a 'good data steward' award one day, I ran this optional step next to identify the maximum size of a rule from the node to not waste space.
/* verify the maximum length of the concept rules in the node */
proc contents data=casuser.concept_rules;
run;
In this example, the maximum bytes used for the largest rule is 307. Chances are your rules take more space, so you might just want to check for use in a later step.
To create a new understandable RuleName column with the actual rule name instead of the internal ruleId, the following code scans the config column for the second term using a colon as the delimiter. The resulting table is below the code.
/* extract the RULENAME as a separate column in a temporary working dataset */
data casuser.one;
length RuleName $36. ; /* modify rulename length if desired */
set casuser.concept_rules;
RuleName = scan(config, 2, ':'); /* save each rulename from the table */
run;
The final challenge is to remove the rule name from each line of LITI code in the config column, and save the results in a table containing all rules in the node.
This code replaces the second token from the config column and its trailing colon with a blank. It also limits the cleaned output column to a number close to the proc contents output earlier.
/* remove the rulename from each rule to see the original syntax */
data casuser.two (keep=rulename cleaned);
set casuser.one;
/* change length based on proc contents output to allow for longer rules */
length cleaned $320. ;
/* remove the rulename from the table output as well as the extra colon */
cleaned = tranwrd(config, (scan(config, 2, ':')||":"), " ");
put cleaned=;
run;
Tranwrd in this example uses the scan function to remove the extra rulename from each line producing the result in the "cleaned" column in the screen capture below. (Thanks Kathy and Danny!)
Notice how the results above compare with the original rules that were entered in the SAS Visual Text Analytics application (shown below). Here is the original _ACTIVITY_ rule for example.
The resulting table is an inventory of your rules, including the complete syntax of your concept rules. You can run reports against the table to look for redundant rules and document advanced LITI syntax and best practices used in your organization. You can adapt this technique to work on other "behind the scenes"project tables.
Next Steps
If this topic resonates with you, I strongly recommend that you check out this post on Programmatic Manipulation of Concept Rules. It is Part 2 of an excellent 3-part series for Power Users that takes this idea to the next level. In the post, you will find a reference to our public repository containing complete sample code for writing out the rules from all concept nodes in all pipelines that are in a Visual Text analytics project!
Part 1 of the series provides recommended practices for structuring concept rules, while Part 3 provides sample code for integrating version control and quality assurance for your project using a Git repository.
Thanks for reading!
Here is the complete code used in this post.
cas;
caslib _all_ assign;
/* enter project and concept table name */
proc casutil incaslib="Analytics_Project_ce2afc9d-0ac4-438d-afa3-d417ae037fc5"
outcaslib="casuser";
save casdata="646dc624-5b67-47b9-b070-9ff6f91bb7a2_CONCEPT_RULESCONFIG"
casout="concept_rules.sashdat" replace;
run;
/* load the results table so it can be viewed */
proc casutil incaslib="casuser" outcaslib="casuser";
load casdata="concept_rules.sashdat" casout="concept_rules" replace;
run;
/* verify the maximum length of the concept rules in the node */
proc contents data=casuser.concept_rules;
run;
/* extract the RULENAME as a separate column in a temporary working dataset */
data casuser.one;
length RuleName $36.;
/* modify rulename length if desired */
set casuser.concept_rules;
RuleName=scan(config, 2, ':');
/* save each rulename from the table */
run;
/* remove the rulename from each rule to see the original syntax */
data casuser.two (keep=rulename cleaned);
set casuser.one;
/* change length of output field based on proc contents output to allow for longer rules */
length cleaned $320.;
/* remove the rulename from the table output as well as the extra colon */
/* tranword and scan functions are used */
cleaned=tranwrd(config, (scan(config, 2, ':')||":"), " ");
put cleaned=;
run;
Find more articles from SAS Global Enablement and Learning here.
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.