We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Using a SAS Enterprise Content Categorization project within SAS Contextual Analysis

by SAS Employee CraigDeVault on ‎03-30-2017 12:45 PM - edited on ‎03-30-2017 12:48 PM by Community Manager (1,050 Views)

When SAS Contextual Analysis was first released, there was not any way to read in a SAS Enterprise Content Categorization project.  The ability to import these projects was an item on the roadmap that would be implemented in a future release.  Starting in SAS Contextual Analysis 13.2, the importing of a SAS Enterprise Content Categorization project function was introduced.

 

On step 1 of the Create New Project wizard, there is an option to Import a SAS Enterprise Content Categorization project as seen below:

 

ecc_1.png

 

When selecting the SAS Enterprise Content Categorization project, make sure to choose the project file with an extension of .tk2.  The project folder needs to be accessible by the Workspace Server that is being used for SAS Contextual Analysis.

 

Here are some important notes from the documentation regarding the importing of a project:

  • Concepts that were defined by using the LITI (language interpretation and text interpretation) syntax in an imported SAS Enterprise Content Categorization project can be used in your SAS Contextual Analysis project.
  • Categories that were defined using Boolean rules (MCAT syntax) in an imported SAS Enterprise Content Categorization project can be used in your SAS Contextual Analysis project.
  • Concepts that were created using linguistic rules in SAS Enterprise Content Categorization are not supported.
  • In order for the LITI concepts to be parsed correctly in SAS Contextual Analysis, the parsing priority for disabled concepts must be honored. To ensure this, open your existing project in SAS Enterprise Content Categorization. For any child concept that was disabled, modify its parent concept so that the parent has a higher priority than the child. Save the project before you import it into SAS Contextual Analysis.

Using created projects from SAS Enterprise Content Categorization can be very beneficial in moving the well-established rules in your text projects to newer technology.  The combination of some of traditional SAS Text Miner features such as using different input sources, using synonym and stop lists, and topic generation allow for a more integrated way to analyze the document corpus.  One of the biggest advantages of importing the projects is that the user can get score code to be generated in data step 2 (DS2) format instead of traditional data step code.

 

In this example project, we have extracted 100 web pages that discuss disasters.  The imported SAS Enterprise Content Categorization project is a fairly simple project, only with one category, which deals with earthquakes.  Here is what the rule for earthquakes looks like with the project:

 

ecc_2.png

 

 

Now, in SAS Contextual Analysis, the above rule has been imported into the project.  After running the project, the output is split up into four separate windows.  The first one is Concepts.  If pre-defined concepts are used, or if user written concepts have been added, any matches for the rules will be shown here:

 

ecc_3.png

 

Next, the project will show the terms found in the documents. In this output window, terms are seen that are treated as synonyms (for example the terms run, running, ran, jog, and sprint can all be combined for the parent term of run).  A parent term is noted by the folder icon that precedes it on its left. In addition, terms that are not going to be important (or differentiators) for the analysis can be dropped by right clicking on the term in the Kept Terms tab and choosing Drop Term as seen here:

 

ecc_4.png

 

Next, the project will show the output for the created topics.  Upon looking at the topics generated, we can look at the most descriptive terms to understand what topics were found.  In looking at the topics:

 

ecc_5.png


Without really diving deep into the topics individually, we have a pretty good idea what each topic is about by looking at the descriptive terms (the terms are listed in significance from left to right.  The earlier the term is listed, the more important the term is to the topic):

  • Tornado, Hurricane, Cyclone, Blizzard, and Tsunami show that this topic contains web pages that encompass natural disasters
  • War, German, Empire, Army, and Italian show that this topic contains web pages regarding to World War II.
  • Lava, Volcano, Eruption, Magma, and Ash show that this topic contains web pages regarding to volcanoes.
  • Sinkhole, Cave, Hole, Karst, and Ft show that this topic contains web pages regarding holes in the Earth. With only two documents found in this topic, the analyst may want to analyze whether to keep this topic or not.
  • Tide, Fault, Magnitude, Intensity, and Earthquake show that this topic contains web pages regarding Earthquakes, specifically the after effects.

The final section of output contains the results from the category rules that come from SAS Enterprise Content Categorization.  Notice that the earthquake rule is found in 44 documents:

 

ecc_6.png

 

If I look in the Edit Rules section, I can see the rule, which matches what was imported:

 

ecc_7.png

 

If I look in the Documents tab, I can see the documents where matches occurred for the rule:

 

ecc_8.png


While entering category rules within SAS Contextual Analysis can be done, a previously created SAS Enterprise Content Categorization project can be imported to include the easy and complex rules already built.  This can be leveraged to get more information out of your data quicker, and then apply the findings to future documents.

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.