USAID began working with SAS Content Categorization almost 2 years ago and what a long, strange trip it’s been. USAID has an online archive of about 175,000 documents called the Development Experience Clearinghouse. Documents are added to the Clearinghouse, or the DEC as we call it by an average of 500 documents per month. USAID purchased the SAS Content Categorization Studio as a means to manage the influx of documents submitted to the DEC.
When first starting a Text Analytics project, one of the first decisions we made was whether to use the statistical categorizer or a rules-based approach. With guidance from Denise Bedford, the USAID DEC staff chose the rules-based approach for a number of our profiles for the following reasons:
Because of these factors, it made sense to model the profiles on the way the human catalogers were cataloging rather than using statistical categorization.
But first we had to figure out how the humans actually cataloged and indexed documents. So much of cataloging and indexing is “second-nature” to us that we first had to deconstruct the steps in the process of cataloging and indexing a document.
We mapped out every step that a cataloger did to catalog and upload a document to the DEC system. <see attached document>
We then had extensive interviews with catalogers on how they decided what data to put into each of the fields. Some of the questions that came up during this interview phase are below:
Then we began trying to translate this knowledge to SAS Content Categorization studio in a way that the application would be able to use it. Many of the questions above seem basic, common-sense kinds of decisions that the catalogers no longer needed to consciously stop and “think” about them for every single document. But as we tried to write rules to extract the data from a document the way a human extracts the data, we found out just how complex these common-sense decisions actually were.
Using the example above, the rule that Mission Directors are not included as authors is a piece of institutional knowledge that the catalogers learned long ago and use without thinking about it. But somehow you have to figure out how to tell the SAS Content Categorization Studio that if a person’s name has “Mission Director” within a certain proximity to it, then it’s not an author. Oh and you have to write the rule in such a way as to not cancel out any of the other rules that you’ve written.
We discovered the difficulties of writing rules for institutional knowledge as we went through an extensive testing phase trying to find out what gave the best results. If you are just starting out a SAS Content Cat project, be prepared to spend a lot of time testing and evaluating. If I could start over, I would have had a more extensive plan for testing and evaluating the output from the application.
Questions: Can someone give an account of their experience in building a statistical categorizer? Why did you choose the statistical categorizer? What was your process in the building? Did you stick to statistically generating the rules or did you move towards writing rules to refine the output?