BookmarkSubscribeRSS Feed

Importing Text Data into SAS Viya

Started ‎11-06-2019 by
Modified ‎11-08-2019 by
Views 9,634

Importing Text Data into SAS Viya

If you have worked with SAS and unstructured text before, you are probably familiar with the number 32,767 or the max length of a character field in the SAS 9 engine. This limit often comes into play when dealing with importing text data from various sources such as pdf, word and html files. SAS Viya introduces a new data type to SAS users, the VARCHAR format. The VARCHAR format allows a maximum of 536,870,911 characters (UTF-8 encoding).

The VARCHAR data type is important because it allows you to save space when working with fields of varying length. A 32,767 fixed length CHAR field that is populated with the word “N/A” stores 32,764 trailing blanks in the field, a large waste of memory. The VARCHAR format uses only the space that is needed (3 bytes for “N/A”) plus a little overhead (16 bytes). If trying to decide between CHAR and VARCHAR use this guide:

  • CHAR – use a fixed-width CHAR when the sizes of the column data entries are similar. Fixed-width columns are usually accessed faster.
  • VARCHAR(n) – use when the sizes of the column data vary considerably but you are reasonably certain they will not exceed a certain width.
  • VARCHAR(*) – use when the sizes of the column data vary considerably and the column width might exceed any limits you might place on it.

In this blog, I will cover three primary objectives:

  1. Review the Character Variable Padding (CVP) libname engine, for converting sas7bdat datasets with CHAR variables to VARCHAR variables for SAS Viya
  2. Demonstrate code importing a documentation collection programmatically
  3. Provide the steps to import a document collection through the UI

 

Converting CHAR Variables to VARCHAR

The CVP libname engine allows you to convert CHAR variables to VARCHAR before loading the data into a CASLIB. The CVPVARCHAR=YES option specified in a CVP libname statement will convert all CHAR variables in the data to VARCHAR data type.

Import sas7bdat with VarChar.png

 

The results of the proc contents show variables with the VARCHAR data type. The example dataset contains 31 variables and 848k observations, with the longest character field being 2048 characters in length. The data loaded into CAS with the VARCHAR data type results in a 0.79 GB table compared to an 8.97 GB without VARCHAR.

Proc Contents Results.png

Importing Documents Programmatically

The code below shows an example of importing a document collection programmatically. The first step is to start my CAS sessions and define a caslib that points to document location. Since the documents are split into multiple folders (/PDF and /HTML) it is important to have the “subdirs” option on the caslib. This allows to include any files in subdirectories in the import process.

Import Document Directory.png

The example imports 7 files into a CAS table with columns for the file path, filename, file type and the content of the imported file. For this example, we import four html pages from Frank Silva’s SAS blog and 3 pdf files with information on different SAS products.

Imported Documents_v2.png

Importing Documents with the UI

The first step to importing a collection of documents begins with selecting “Documents Directory” as the data import source and choosing the caslib. This will import all potential text files located in the path the caslib is defined.

 

Select Document Directory.png

 

The document directory import has options for searching recursively in the directories for documents, as well as specifying specific file types of interest. When ready, the files can be imported or a SAS Job created (via the SAS Job icon located next to the Import Item button). Creating a SAS Job, allows for the import process to be scheduled and run regular with the SAS Job Execution’s built in scheduler. 

Import Wizard Options.png

Frequently Asked Questions

·         What type of file format can be converted and imported using the Document Directory import?

o   SAS can import the following file types:

Supported File Formats.png 

·         What about importing JSON files, since it isn’t included in the list above?

o   JSON files can be imported using the JSON libname.

 

·         What product do I need to be able to import a document directory?

o   SAS Visual Analytics on Viya is all that is needed for importing a document directory.

 

·         Once I have the text data imported, how do I analyze it?

o   SAS Visual Analytics on Viya provides the ability to explore text topics and sentiment of unstructured text.

o   For deeper analysis, additional text analytics capabilities can be found in SAS Visual Text Analytics and SAS Visual Data Mining and Machine Learning

Comments

I tried importing a document collection from my c drive and I got the error the path is not an absolute path.  What is an absolute path? 

Is the C drive a path on your local computer? The required path should point to a path on the SAS server. 

How do you move an entire collection from your local computer to the SAS server.  I can only seem to move one doc at a time. 

What method are you using to move the documents? You can leverage a 3rd party tool like (mobaxterm or winscp) to help with transferring documents to the SAS server.

I go to import and it says drag local files here, but I can only drag one at a time.  Needing separate software seems like a major limitation, but it is what it is.  

You can also use SAS Studio (note not SAS Studio V) to upload files to the SAS server. You can upload multiple files at a time with this method. 

Can you post the steps or code or whatever?

Navigate to /SASStudio url and look at the panel on the left hand side (see photo attached). Select the path you want to upload the data to and click the upload icon (4th of 6th icon). This will allow you to upload multiple files from your local machine.

Studio_Upload.JPG

 

I am working with SAS Viya,

I have below queries 

 

1. By using the method, "Importing Documents with the UI"  will help us to convert pdf files into sas dataset ?
2. In the snapshot under the heading "Importing Documents with the UI" I am unable to locate the folder MY_DOCS if i want to create one and put all my files there how can i achieve it ? for the same I am using SAS Viya shared drive !

1. By using the method, "Importing Documents with the UI"  will help us to convert pdf files into sas dataset ?

- Yes this process will convert machine readable pdfs files and import them into a sas dataset. 

 

2. In the snapshot under the heading "Importing Documents with the UI" I am unable to locate the folder MY_DOCS if i want to create one and put all my files there how can i achieve it ? for the same I am using SAS Viya shared drive !

- MY_DOCS was a caslib I created in the programmatic example. You can also create a caslib in the GUI when selecting the location of the document directory. The screenshots below show the steps needed to setup up a caslib.

 

Create Caslib Connection_1.jpg

 

Create Caslib Connection_2.jpg

 

thank you so much @brumil, for making things clear !

 

I have one more doubt,

How do we make caslib point to the below highlighted folder structure, attaching the snapshot of the same, this is SAS Viya Shared drive

Caslib issue.jpg

 

I am unable to trace the path of the folder "TextAnalytics" hence not able to create a caslib pointing to the directory containing the pdfs or do we have to use SAS Studio to achieve the above task ? if yes then how do we establish connection between SAS Studio and SAS Shared drive ?

The folder you are pointing to is a SAS content folder. To assign a caslib, you need to point to a file path on the your server. In Viya 3.5, you can access these folders in /SASStudioV. If you are using a prior version, you can access them in /SASStudio. There is a screenshot of this in a previous comment on 11/8/2019. 

Hi, I´m experiencing and error triyng to load:

proc cas;
	table.loadTable /
		casOut={name="DIARIO_BELEM",
				caslib="casuser",
				replace=TRUE}
		caslib="my_docs"
		importOptions={fileType="DOCUMENT",
					   recurse=TRUE,
					   tikaConv=TRUE,
					   tikaPath="/opt/sas/viya/home/SASFoundation/lib/docconvjars"}
	path="/sasdata/";
run;

ERROR: An access control check was detected to a full, rather than a partial path: /sasdata/.
ERROR: Access denied.
ERROR: The action stopped due to errors.

 

I checked all paths, and permissions for service user which we´re using to connect studio, incluiding a full permission on diretoctories.

 

What could be this error ?

 

Regards, 

 

@BilaTheLegend path="/sasdata/" is pointing to an absolute path. I am assuming sasdata is a subfolder that exists in the path for caslib="my_docs". Try switching your code to this:  path="sasdata/" 

 

 

Version history
Last update:
‎11-08-2019 12:39 PM
Updated by:

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags