If you have worked with SAS and unstructured text before, you are probably familiar with the number 32,767 or the max length of a character field in the SAS 9 engine. This limit often comes into play when dealing with importing text data from various sources such as pdf, word and html files. SAS Viya introduces a new data type to SAS users, the VARCHAR format. The VARCHAR format allows a maximum of 536,870,911 characters (UTF-8 encoding).
The VARCHAR data type is important because it allows you to save space when working with fields of varying length. A 32,767 fixed length CHAR field that is populated with the word “N/A” stores 32,764 trailing blanks in the field, a large waste of memory. The VARCHAR format uses only the space that is needed (3 bytes for “N/A”) plus a little overhead (16 bytes). If trying to decide between CHAR and VARCHAR use this guide:
In this blog, I will cover three primary objectives:
The CVP libname engine allows you to convert CHAR variables to VARCHAR before loading the data into a CASLIB. The CVPVARCHAR=YES option specified in a CVP libname statement will convert all CHAR variables in the data to VARCHAR data type.
The results of the proc contents show variables with the VARCHAR data type. The example dataset contains 31 variables and 848k observations, with the longest character field being 2048 characters in length. The data loaded into CAS with the VARCHAR data type results in a 0.79 GB table compared to an 8.97 GB without VARCHAR.
The code below shows an example of importing a document collection programmatically. The first step is to start my CAS sessions and define a caslib that points to document location. Since the documents are split into multiple folders (/PDF and /HTML) it is important to have the “subdirs” option on the caslib. This allows to include any files in subdirectories in the import process.
The example imports 7 files into a CAS table with columns for the file path, filename, file type and the content of the imported file. For this example, we import four html pages from Frank Silva’s SAS blog and 3 pdf files with information on different SAS products.
Importing Documents with the UI
The first step to importing a collection of documents begins with selecting “Documents Directory” as the data import source and choosing the caslib. This will import all potential text files located in the path the caslib is defined.
The document directory import has options for searching recursively in the directories for documents, as well as specifying specific file types of interest. When ready, the files can be imported or a SAS Job created (via the SAS Job icon located next to the Import Item button). Creating a SAS Job, allows for the import process to be scheduled and run regular with the SAS Job Execution’s built in scheduler.
· What type of file format can be converted and imported using the Document Directory import?
o SAS can import the following file types:
· What about importing JSON files, since it isn’t included in the list above?
o JSON files can be imported using the JSON libname.
· What product do I need to be able to import a document directory?
o SAS Visual Analytics on Viya is all that is needed for importing a document directory.
· Once I have the text data imported, how do I analyze it?
o SAS Visual Analytics on Viya provides the ability to explore text topics and sentiment of unstructured text.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.