SAS Viya 3.5 introduces a couple of changes regarding the support of file types that can be loaded in CAS. It has been greatly extended. But the file type list support can differ from platform to platform (what we call “Platform Data Sources” in CAS context). So, I’ll try to summarize in this post what can be read by CAS and from which platform.
What would be better than a table to show this? Here it is.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Let me detail a little bit.
Basically, we have the supported file types in rows, and platform CASLIBs in columns. Cells can be filled with “S” green circles (serial load is supported), “P” dark green circles (parallel load is supported) or can be empty (no support at all).
Orange triangles show new capabilities. Endnotes provide details on specific cases.
ADLS stands for Azure Data Lake Storage. Essentially, this is the Microsoft Azure object storage equivalent of Amazon S3. CAS requires an Azure Data Lake Storage Gen 2 directory. As indicated, only a few file types are supported: CSV and ORC.
Among the big new capabilities of Viya 3.5 is the ability to read Apache Parquet and ORC columnar file formats. Parquet is readable serially from PATH, and in parallel from DNFS and S3. ORC files are available on PATH and Azure Data Lake Gen 2 platforms in serial mode.
Loading images, documents, audio and video files in CAS is not really new in SAS Viya 3.5. What is new is that:
Syntax example:
In addition to the picture, the code is provided below:
caslib MYFILES type=path path="/gelcontent/demo/DM/data" subdirs libref=MYFILES ;
proc casutil incaslib="MYFILES" outcaslib="MYFILES" ;
load casdata="import_document" importOptions=(filetype="DOCUMENT") casout="MYDOCS" ;
load casdata="import_images" importOptions=(filetype="IMAGE") casout="MYIMGS" ;
load casdata="import_audio" importOptions=(filetype="AUDIO") casout="MYAUDIOS" ;
load casdata="import_video" importOptions=(filetype="VIDEO") casout="MYVIDEOS" ;
load casdata="import_any" importOptions=(filetype="ANY") casout="COMBINED" ;
quit ;
proc cas ;
loadTable path="" importOptions="DOCUMENT" caslib="MYFILES"
casout={caslib="MYFILES" name="MYDOCS2" replication=0 replace=true} ;
quit ;
PROC CASUTIL can only be used when you have images, documents, video and audio files in sub-directories of the main CASLIB path (casdata= cannot be null). loadTable is the CAS action used behind the scenes.
If you want to import files from the main directory, you’d rather use the loadTable CAS action directly (second example).
Notice the “ANY” filetype keyword to import any supported image, document, audio and video file all at once.
The target CAS table will contain a varBinary field to handle the image, document, sound or video content.
Some useful links:
“Others” represent DTA (Stata), JMP, SAV (SPSS), XLS, XLSX files. They require the SAS Data Connector to PC Files.
SAS7BDAT files can be loaded in parallel if they are accessible to all CAS workers at the same location (for instance a SAS7BDAT file on a NFS share mounted on every CAS worker). You can then use the dataTransferMode=“parallel” option. The CASLIB must be a PATH CASLIB, not a DNFS CASLIB.
Syntax example:
proc casutil ;
load casdata="big_prdsale.sas7bdat" incaslib="caspath" casout="big_prdsale" outcaslib="caspath"
importoptions=(filetype="basesas" dataTransferMode="parallel") ;
quit ;
For more information, see Rob Collum's articles:
The CSV file type is used to identify any delimited file. For example, to read a semicolon delimited file with a .txt suffix, you can specify CSV as the file type and specify the delimiter.
Syntax example:
proc casutil ;
load casdata="prdsale.txt" incaslib="mycaslib" casout="prdsale_txt" outcaslib="mycaslib" replace
importoptions=(filetype="csv" delimiter=";") ;
quit ;
Multifile CSV import is supported and new in SAS Viya 3.5. The multiFile parameter enables loading multiple CSV files into one in-memory table. showFullPath adds an extra column that identifies the fully qualified path to the CSV file that contributed to the row.
Syntax example:
In addition to the picture, the code is provided below:
caslib mycsvs type=path path="/gelcontent/demo/DM/data/multicsv" subdirs ;
proc cas ;
table.loadTable /
caslib="mycsvs"
path=""
casout={caslib="mycsvs",name="combined",replace=True}
importOptions={
fileType="csv",
multiFile=true,
showFullpath=true,
recurse=false
} ;
quit ;
proc casutil ;
load casdata="subdir" incaslib="mycsvs" outcaslib="mycsvs" casout="union"
importOptions=(fileType="csv",multiFile=true,showFullpath=true,recurse=false) ;
quit ;
All the CSV files contained in the specified directory must have the same number of columns and the columns must have the same data types. The file names must end with a .csv suffix. Note that multifile CSV import is NOT available on HDFS and ADLS CASLIBs.
For more information, refer to the SAS documentation.
Thanks for reading.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.