SAS Contextual Analysis is a web-based text analytics application that uses contextual analysis to provide a comprehensive solution to the challenge of identifying and categorizing key textual data.
As an analyst, your textual data can be provided to you in different formats. For example, it could be text-based documents stored within a directory in your network, prepared as a SAS data set, or recorded a spreadsheet.
SAS Contextual Analysis is able to work with most of the data formats; however, sometimes additional steps are required in order to bring in your data. In this article, we will first discuss how SAS Contextual Analysis reads in textual data. Then we will discuss two commonly seen hurdles and how to work around them.
When you define the data source in your project, there are basically two options to identify your data:
Option 1: A SAS data set. To enable this method, you will check the “Select variables from within a data set” as shown below:
The context of the Text variable is either the actual data or the file locations of the text documents (by checking the Text variable contains a file reference box). The latter enables you to reference directly to text documents that are stored in network directories.
There is also an advantage of using the file reference approach while working with large documents. A SAS character variable’s length cannot exceed 32,767 character bytes. So the maximum length of text that can be stored and analyzed is around 32K unless you use the file reference approach where SAS Contextual Analysis will train on the entire content of each document.
Consider creating a variable that contains the file location of each document while you preparing your data, if your documents are large. Also be aware that these documents are required to be text files, that is, with .txt file extension.
Option 2: A document collection. To enable this method, you will check the Use a file in a directory button as shown below.
The input documents can be in text-based file formats such as MS Office, OpenDocument (OpenOffice), PDF, XML, HTML, and others. You can browse to the directory that contains your corpus. Documents contained in the subfolders are also imported and analyzed.
This input option allows you to read documents with various data formats, which SAS Contextual Analysis normalizes into text files to be processed later on.
Challenge 1: The SAS data sets must be registered in metadata
A SAS data set must be registered before it can be shown in your SAS Contextual Analysis data library during the data source creation. This also includes the synonym list, stop list, and start list. This task is most likely performed by your SAS administrator. The steps for registering data are given in the section “Quick Start Steps” of the SAS Contextual Analysis 14.2: Administrator's Guide.
What this means is you must have a metadata permission to register your SAS data sets into SAS Metadata. As a SAS Contextual Analysis user, you probably do not have the required permission. It becomes a challenge when you have many input tables to work on, and your SAS administrator has to to register each table for you once it is ready to be analyzed.
One way to work around is to have your SAS administrator (for example,“sasadm”) perform the following two steps just once so you can register the table yourself.
Step 1: create a metadata folder under “/SharedData” for you (for example “sasdemo”) as shown below.
Step2: “sasadm” grants required metadata rights to “sasdemo” on its own metadata folder “meta_dir_sasdemo” under “/SharedData”.
Once the step above is performed, “sasdemo” will log on to SAS Management Console and create its own library (for instance, "demolib") under the “meta_dir_sasdemo” metadata folder as shown in the following two screen shots:
The “sasdemo” user now can register its own data set (for instance, "COMMENTS") in SAS Management Console without having to ask its SAS Administrator.
Once the table "COMMENTS" is registered, "sasdemo” is able to find the table in the personal library during the data source creation:
Challenge 2: My text data is stored in an Excel file
If your text data is stored as a Microsoft Excel file, SAS Contextual Analysis imports the entire content of a worksheet and converts it to a single text document. This behavior is probably not what you've expected -- you want to extract the text data in each cell to a separate document.
To work around, you can either create a SAS data set from the Excel file or create a folder containing the collection of the documents (extracted from the text column in your Excel file). We will focus on the first approach by using File->Import Data in SAS Enterprise Guide as shown in the following steps.
Step1: select the output library for the SAS data set to be created in. Here we use the same library "demolib" that was created in the example shown in the first challenge. Once the SAS data set is created, the only step left for you is to register the data set into your Metadata library.
Step 2: Use File -> Import Data to import your Excel file:
Open your Excel file.
The first window asks you the information about the source of the Excel file and the corresponding output data set to be created.
Next, depending on how your data is stored in your worksheet, you will adjust the setting accordingly.
Next, verify the attributes of your text variable and modify it if necessary:
Lastly, examine your text column to see if the content is as expected.
Once the SAS data set is created, you can register it using the workaround provided in the first challenge.