Importing Text Data into SAS Viya
If you have worked with SAS and unstructured text before, you are probably familiar with the number 32,767 or the max length of a character field in the SAS 9 engine. This limit often comes into play when dealing with importing text data from various sources such as pdf, word and html files. SAS Viya introduces a new data type to SAS users, the VARCHAR format. The VARCHAR format allows a maximum of 536,870,911 characters (UTF-8 encoding).
The VARCHAR data type is important because it allows you to save space when working with fields of varying length. A 32,767 fixed length CHAR field that is populated with the word “N/A” stores 32,764 trailing blanks in the field, a large waste of memory. The VARCHAR format uses only the space that is needed (3 bytes for “N/A”) plus a little overhead (16 bytes). If trying to decide between CHAR and VARCHAR use this guide:
CHAR – use a fixed-width CHAR when the sizes of the column data entries are similar. Fixed-width columns are usually accessed faster.
VARCHAR(n) – use when the sizes of the column data vary considerably but you are reasonably certain they will not exceed a certain width.
VARCHAR(*) – use when the sizes of the column data vary considerably and the column width might exceed any limits you might place on it.
In this blog, I will cover three primary objectives:
Review the Character Variable Padding (CVP) libname engine, for converting sas7bdat datasets with CHAR variables to VARCHAR variables for SAS Viya
Demonstrate code importing a documentation collection programmatically
Provide the steps to import a document collection through the UI
Converting CHAR Variables to VARCHAR
The CVP libname engine allows you to convert CHAR variables to VARCHAR before loading the data into a CASLIB. The CVPVARCHAR=YES option specified in a CVP libname statement will convert all CHAR variables in the data to VARCHAR data type.
The results of the proc contents show variables with the VARCHAR data type. The example dataset contains 31 variables and 848k observations, with the longest character field being 2048 characters in length. The data loaded into CAS with the VARCHAR data type results in a 0.79 GB table compared to an 8.97 GB without VARCHAR.
Importing Documents Programmatically
The code below shows an example of importing a document collection programmatically. The first step is to start my CAS sessions and define a caslib that points to document location. Since the documents are split into multiple folders (/PDF and /HTML) it is important to have the “subdirs” option on the caslib. This allows to include any files in subdirectories in the import process.
The example imports 7 files into a CAS table with columns for the file path, filename, file type and the content of the imported file. For this example, we import four html pages from Frank Silva’s SAS blog and 3 pdf files with information on different SAS products.
Importing Documents with the UI
The first step to importing a collection of documents begins with selecting “Documents Directory” as the data import source and choosing the caslib. This will import all potential text files located in the path the caslib is defined.
The document directory import has options for searching recursively in the directories for documents, as well as specifying specific file types of interest. When ready, the files can be imported or a SAS Job created (via the SAS Job icon located next to the Import Item button). Creating a SAS Job, allows for the import process to be scheduled and run regular with the SAS Job Execution’s built in scheduler.
Frequently Asked Questions
· What type of file format can be converted and imported using the Document Directory import?
o SAS can import the following file types:
· What about importing JSON files, since it isn’t included in the list above?
o JSON files can be imported using the JSON libname.
· What product do I need to be able to import a document directory?
o SAS Visual Analytics on Viya is all that is needed for importing a document directory.
· Once I have the text data imported, how do I analyze it?
o SAS Visual Analytics on Viya provides the ability to explore text topics and sentiment of unstructured text.
o For deeper analysis, additional text analytics capabilities can be found in SAS Visual Text Analytics and SAS Visual Data Mining and Machine Learning
... View more