BookmarkSubscribeRSS Feed

SAS Viya 3.5: CAS loading supported file types and platforms

Started ‎01-09-2020 by
Modified ‎01-09-2020 by
Views 6,376

SAS Viya 3.5 introduces a couple of changes regarding the support of file types that can be loaded in CAS. It has been greatly extended. But the file type list support can differ from platform to platform (what we call “Platform Data Sources” in CAS context). So, I’ll try to summarize in this post what can be read by CAS and from which platform.

 

What would be better than a table to show this? Here it is.

 

nir_post44_cas_filetypes_table.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Let me detail a little bit.

 

Basically, we have the supported file types in rows, and platform CASLIBs in columns. Cells can be filled with “S” green circles (serial load is supported), “P” dark green circles (parallel load is supported) or can be empty (no support at all).

 

Orange triangles show new capabilities. Endnotes provide details on specific cases.

New platform data source

ADLS stands for Azure Data Lake Storage. Essentially, this is the Microsoft Azure object storage equivalent of Amazon S3. CAS requires an Azure Data Lake Storage Gen 2 directory. As indicated, only a few file types are supported: CSV and ORC.

New file types support

Among the big new capabilities of Viya 3.5 is the ability to read Apache Parquet and ORC columnar file formats. Parquet is readable serially from PATH, and in parallel from DNFS and S3. ORC files are available on PATH and Azure Data Lake Gen 2 platforms in serial mode.

Media files

Loading images, documents, audio and video files in CAS is not really new in SAS Viya 3.5. What is new is that:

  • it no longer requires using specific CAS actions like loadImages but instead the loadTable CAS action and the CASUTIL LOAD CASDATA statement can be used to load various file types in CAS very easily all at once
  • it has been extended to support DNFS and AWS S3 platforms as well as being “parallelizable”

Syntax example:

 

nir_post44_import_media_files_code-1024x475.png

 

In addition to the picture, the code is provided below:

 

caslib MYFILES type=path path="/gelcontent/demo/DM/data" subdirs libref=MYFILES ;
proc casutil incaslib="MYFILES" outcaslib="MYFILES" ;
   load casdata="import_document" importOptions=(filetype="DOCUMENT") casout="MYDOCS" ;
   load casdata="import_images" importOptions=(filetype="IMAGE") casout="MYIMGS" ;
   load casdata="import_audio" importOptions=(filetype="AUDIO") casout="MYAUDIOS" ;
   load casdata="import_video" importOptions=(filetype="VIDEO") casout="MYVIDEOS" ;
   load casdata="import_any" importOptions=(filetype="ANY") casout="COMBINED" ;
quit ;
proc cas ;
   loadTable path="" importOptions="DOCUMENT" caslib="MYFILES" 
             casout={caslib="MYFILES" name="MYDOCS2" replication=0 replace=true} ;
quit ;

 

PROC CASUTIL can only be used when you have images, documents, video and audio files in sub-directories of the main CASLIB path (casdata= cannot be null). loadTable is the CAS action used behind the scenes.

 

If you want to import files from the main directory, you’d rather use the loadTable CAS action directly (second example).

 

Notice the “ANY” filetype keyword to import any supported image, document, audio and video file all at once.

 

The target CAS table will contain a varBinary field to handle the image, document, sound or video content.

 

Some useful links:

  • IMAGE supported formats
  • DOCUMENT supported formats
    • Please note that when you import a CSV/TXT file through the DOCUMENT file type, it loads it in CAS table record/varBinary column with the contents of each file, it does NOT parse it in a structured way (like native CSV reading)

“Others”

“Others” represent DTA (Stata), JMP, SAV (SPSS), XLS, XLSX files. They require the SAS Data Connector to PC Files.

Endnote #1

SAS7BDAT files can be loaded in parallel if they are accessible to all CAS workers at the same location (for instance a SAS7BDAT file on a NFS share mounted on every CAS worker). You can then use the dataTransferMode=“parallel” option. The CASLIB must be a PATH CASLIB, not a DNFS CASLIB.

 

Syntax example:

 

proc casutil ;
   load casdata="big_prdsale.sas7bdat" incaslib="caspath" casout="big_prdsale" outcaslib="caspath" 
   importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;

 

For more information, see Rob Collum's articles:

Endnote #2

The CSV file type is used to identify any delimited file. For example, to read a semicolon delimited file with a .txt suffix, you can specify CSV as the file type and specify the delimiter.

 

Syntax example:

 

proc casutil ;
   load casdata="prdsale.txt" incaslib="mycaslib" casout="prdsale_txt" outcaslib="mycaslib" replace
   importoptions=(filetype="csv" delimiter=";") ;
quit ;

Endnote #3

Multifile CSV import is supported and new in SAS Viya 3.5. The multiFile parameter enables loading multiple CSV files into one in-memory table. showFullPath adds an extra column that identifies the fully qualified path to the CSV file that contributed to the row.

 

Syntax example:

 

nir_post44_multifile_csv_import-1024x453.png

 

In addition to the picture, the code is provided below:

 

caslib mycsvs type=path path="/gelcontent/demo/DM/data/multicsv" subdirs ;
proc cas ;
   table.loadTable /
      caslib="mycsvs"
      path=""
      casout={caslib="mycsvs",name="combined",replace=True}
      importOptions={
         fileType="csv",
         multiFile=true,
         showFullpath=true,
         recurse=false
      } ;
quit ;
proc casutil ;
   load casdata="subdir" incaslib="mycsvs" outcaslib="mycsvs" casout="union" 
        importOptions=(fileType="csv",multiFile=true,showFullpath=true,recurse=false) ;
quit ;

 

All the CSV files contained in the specified directory must have the same number of columns and the columns must have the same data types. The file names must end with a .csv suffix. Note that multifile CSV import is NOT available on HDFS and ADLS CASLIBs.

 

For more information, refer to the SAS documentation.

 

Thanks for reading.

Version history
Last update:
‎01-09-2020 12:18 PM
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags