In the recent stable releases of SAS Viya, a new feature went unnoticed, at least to me. 😉
Starting with 2021.1.4, CAS can ingest compressed CSV files. Let’s introduce this new capability.
Generally speaking, we always mention CSV when reading delimited text files in CAS, but in reality, it could be any delimited text file with any extension and with any delimiter. Just use "CSV" as the filetype parameter and specify the non-default delimiter in the delimiter option.
For compressed files, it is still valid. However, there is one exception for zip: the files included in the archive must have the csv extension even if they don't contain comma-separated values. See the requirements later in this article.
Two archive types are supported by CAS:
If you want some details about their advantages and differences, there are numerous articles on the web.
From a user perspective, here are a couple things to remember:
Starting with 2021.1.4, compressed delimited files are supported in the PATH and DNFS CASLIBs. Then, in 2021.1.6, this has been added to the S3 CASLIB. The latest LTS 2021.2 has all of them.
Here are a few requirements/observations compiled from the documentation and some early testing:
1 – reading a gzipped CSV file (single file)
Archive structure:
$ gunzip -l 1_prdsale_single.csv.gz
compressed uncompressed ratio uncompressed_name
1109 6331 83.0% 1_prdsale_single.csv
$
SAS code:
proc casutil incaslib="csvs" outcaslib="csvs" ;
load casdata="1_prdsale_single.csv.gz" casout="1_prdsale_single" replace
importOptions=(fileType="csv",archiveType='gz',getnames="true") ;
quit ;
2 – reading a zip archive with multiple CSV files
Archive structure:
$ unzip -v 4_prdsale_multiple.csv.zip
Archive: 4_prdsale_multiple.csv.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
6331 Defl:N 1074 83% 11-17-2021 15:23 d3570af4 prdsale100_1.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 prdsale100_2.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 prdsale100_3.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 prdsale100_4.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 prdsale100_5.csv
-------- ------- --- -------
31655 5370 83% 5 files
$
SAS code (all 5 files will be loaded into one single CAS table):
proc casutil incaslib="csvs" outcaslib="csvs" ;
load casdata="4_prdsale_multiple.csv.zip" casout="4_prdsale_multiple" replace
importOptions=(fileType="csv",archiveType='zip',getnames="true") ;
quit ;
3 – reading a zip archive with multiple CSV files across multiple folders
Archive structure:
$ unzip -v 5_prdsale_multiple_folder.csv.zip
Archive: 5_prdsale_multiple_folder.csv.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Stored 0 0% 11-17-2021 15:32 00000000 data/
0 Stored 0 0% 11-17-2021 15:26 00000000 data/subdir/
6331 Defl:N 1074 83% 11-17-2021 15:26 d3570af4 data/subdir/prdsale100_1.csv
6331 Defl:N 1074 83% 11-17-2021 15:26 d3570af4 data/subdir/prdsale100_2.csv
6331 Defl:N 1074 83% 11-17-2021 15:26 d3570af4 data/subdir/prdsale100_3.csv
6331 Defl:N 1074 83% 11-17-2021 15:23 d3570af4 data/prdsale100_1.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 data/prdsale100_2.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 data/prdsale100_3.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 data/prdsale100_4.csv
6331 Defl:N 1074 83% 11-17-2021 15:25 d3570af4 data/prdsale100_5.csv
-------- ------- --- -------
50648 8592 83% 10 files
$
SAS code (same code as before, all 8 files will be loaded into one single CAS table):
proc casutil incaslib="csvs" outcaslib="csvs" ;
load casdata="5_prdsale_multiple_folder.csv.zip" casout="5_prdsale_multiple_folder" replace
importOptions=(fileType="csv",archiveType='zip',getnames="true") ;
quit ;
4 – reading a gzipped semi-colon delimited file (single file, txt extension)
Archive structure:
$ gunzip -l 6_prdsale_single.dlm.txt.gz
compressed uncompressed ratio uncompressed_name
1113 6331 83.0% 6_prdsale_single.dlm.txt
$
SAS code (note the delimiter option):
proc casutil incaslib="csvs" outcaslib="csvs" ;
load casdata="6_prdsale_single.dlm.txt.gz" casout="6_prdsale_single" replace
importOptions=(fileType="csv",archiveType='gz',getnames="true",delimiter=";") ;
quit ;
5 – reading a zip archive with multiple semi-colon delimited files (they must have the csv extension though)
Archive structure:
$ unzip -v 8_prdsale_multiple.dlm.csv.zip
Archive: 8_prdsale_multiple.dlm.csv.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
6331 Defl:N 1074 83% 11-18-2021 13:59 955f2518 prdsale100_dlm_1.csv
6331 Defl:N 1074 83% 11-18-2021 14:09 955f2518 prdsale100_dlm_2.csv
-------- ------- --- -------
12662 2148 83% 2 files
$
SAS code (both files will be loaded into one single CAS table, note the delimiter option):
proc casutil incaslib="csvs" outcaslib="csvs" ;
load casdata="8_prdsale_multiple.dlm.csv.zip" casout="8_prdsale_multiple" replace
importOptions=(fileType="csv",archiveType='zip',getnames="true",delimiter=";") ;
quit ;
You can find some documentation here. This only mentions S3 but is valid for PATH and DNFS too. Thanks for reading.
Find more articles from SAS Global Enablement and Learning here.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.