Tracking and Interrogating Concept Rules in SAS® Visual Text Analytics (VTA)
In Part 1 and Part 2 of this series we have made suggestions for organizing and managing information extraction models created with SAS® Visual Text Analytics (VTA). This third article leverages the content of the first two articles to implement version control and quality assurance.
Version Control
The benefits of version control for VTA projects can be realized by leveraging two of the SAS programs described in Part 2 of this series. The following process was used to implement tracking via Git of all custom concepts in any VTA project.
Step 1: a Git repository was assigned to the directory named vta_version_control, on the SAS Server.
Step 2: the program named export_concept_rules.sas was executed to export the contents of each custom concept, in the VTA Project named Color_Project, to a SAS dataset.
Step 3: the program named concept_rules_2text_files.sas was executed, using the dataset output from Step 2 as input, to save the concept contents and any metadata for each custom concept to individual files on the SAS Server, see Figure 1.
Figure 1: SAS Server file structure showing two files associated with each custom concept in the VTA project named Color_Project.
An example of an exported custom concept file is shown in Figure 2 and an example of an exported metadata file is shown in Figure 3.
Figure 2: Example of a text file containing the exported custom concept contents. Figure 3: Example of an info file containing the exported metadata.
Step 4: Git commands are executed to manage changes to the repository.
During development, all the benefits of version control can be realized for any information extraction VTA project, when the project custom concepts are exported and updated in Git on a regular basis using the process described above. Note that the taxonomy hierarchy is maintained and each text file is named using the name of the custom concept.
Quality Assurance
Prior to sending any model into production, quality assurance checks are performed to ensure the models comply with expected standards. The standards for an information extraction VTA model, described in Part 1 of this series, can be easily checked by applying the methods described here. These methods have proved valuable to us by ensuring a quality project was placed into production.
The methods described here are certainly not exhaustive. The intention is to share the ability to interrogate the contents of the VTA model by using some of the SAS code described in Part 2 of this series. This ability can be extended as the reader sees fit.
Referenced Custom Concepts
Since the name of each text file on the file server reflects the name of the corresponding custom concept, a quality check can be performed to ensure that each custom concept is referenced by at least one other custom concept. Indeed, helper and disambiguation custom concepts serve their purpose only if used within other custom concepts. See Part 1 for descriptions of the types of custom concepts mentioned here.
The VTA project from Part 1 of this series was modified to contain a stale custom concept, a concept not referenced by any other custom concept in the VTA project. Figure 4 shows the custom concepts for the VTA project named ‘Color_Project’. The H_STALE custom concept has been added and only contains header information.
Figure 4: The custom concepts for the VTA project named Color_Project.
After implementing the version control process described above, the content of each custom concept in the project is available on the SAS server as a text file. The quality check program named CDR_Reference_Check.sas searches for and returns any custom concept that is not referenced in another custom concept by looping through the text files to find the presence of the name of each custom concept within any other text file. Results from this program for Color_Project are shown in Figure 5.
Figure 5: Output from CDR_Reference_Check.sas showing the list of custom concepts not referenced by any other custom concept in the project.
The results show one value as expected. A stale concept, intended for use as a helper or disambiguation concept but not performing as expected, would be identified using this program.
Rule Search
A specific string or string combination can be identified to exist or not exist within each custom concept in the VTA project by searching within the text files produced by the version control process described above. The SAS program that executes this search is named CDR_Text_Check.sas. For example, any custom concept containing the REMOVE_ITEM concept definition rule (CDR) can be identified and subsequently reviewed to confirm this CDR is located at the top of the custom concept. The location of this CDR is a standard described in Part 1 of this series.
Figure 6: Output from CDR_Text_Check.sas showing the list of custom concepts containing the user assigned text string.
Header Search
To confirm the header content of each custom concept contains all the elements suggested in Part 1 of this series we executed Concept_Header_Check.sas. This program searches the comments within a custom concept for seven different elements, namely for text representing the concept name, and the specific text strings ‘document type’, ‘purpose’, ‘supports’, ‘created’, ‘copyright’ and ‘all rights reserved’. Any custom concept lacking one or more of these elements will be listed in the results. The developer can then address the missing elements in the identified custom concepts. Note that the H_STALE concept is not listed in the results, see Figure 7, since it contains all the desired header elements.
Figure 7: Output from Concept_Header_Check.sas showing the list of custom concepts without the desired header elements.
Conclusion: Part 3
The ability to track changes to all our information extraction VTA projects has provided consistency and efficiency to our model development process. The ability to interrogate the contents of an information extraction VTA project has proven valuable for assuring a quality model is placed into production. You can access the code in our public repository: sas-vta-examples.
We have shared just a few examples of how leveraging the internal text analytics APIs have enabled new capabilities for tracking and interrogating VTA projects. Please continue to share your ideas and use of these or other methods for organizing, reviewing, and managing a VTA model.
Thank you to my SAS collaborators and teammates for contributing their time and knowledge to this three-part series of articles.
... View more