BookmarkSubscribeRSS Feed

Read Compressed CSV Files in CAS

Started ‎11-30-2021 by
Modified ‎11-30-2021 by
Views 3,537

In the recent stable releases of SAS Viya, a new feature went unnoticed, at least to me. 😉

 

Starting with 2021.1.4, CAS can ingest compressed CSV files. Let’s introduce this new capability.  

 

Which file formats are supported?

Generally speaking, we always mention CSV when reading delimited text files in CAS, but in reality, it could be any delimited text file with any extension and with any delimiter. Just use "CSV" as the filetype parameter and specify the non-default delimiter in the delimiter option.

 

For compressed files, it is still valid. However, there is one exception for zip: the files included in the archive must have the csv extension even if they don't contain comma-separated values. See the requirements later in this article.

 

 

Which archive types are supported?

Two archive types are supported by CAS:

  • ZIP (common extension is zip)
  • GZIP (common extension is gz)

 

If you want some details about their advantages and differences, there are numerous articles on the web.

 

From a user perspective, here are a couple things to remember:

  • ZIP can compress AND archive multiple files (including sub-folders) and outputs the result in a separate archive file
  • GZIP usually only compresses one data file (one can still create an archive, for example with tar, and gzip it, but that is not supported by CAS) in place (it doesn’t keep a copy of the original file)

 

Which CASLIB types are supported?

Starting with 2021.1.4, compressed delimited files are supported in the PATH and DNFS CASLIBs. Then, in 2021.1.6, this has been added to the S3 CASLIB. The latest LTS 2021.2 has all of them.  

 

Requirements

Here are a few requirements/observations compiled from the documentation and some early testing:

 

  • GZIP
    • the documentation mentions that the file you are reading must be the only delimited file in the GZIP file
      • generally, gzipped files contain only one single file
    • a gzipped tar archive is not supported
    • reading gzipped files with non-csv extensions (for example "txt") and non-comma separated values (for example a semi-colon) works
  • ZIP
    • the data files in a zip archive must have the "csv" extension even if they don't contain comma-separated values
      • reading non-comma separated values (for example a semi-colon) is supported through the delimiter option
    • all csv files contained in a zip archive are read, including files in the archive's sub-folders
    • all CSV files included in one archive must have the same structure

 

Examples

1 – reading a gzipped CSV file (single file)

 

Archive structure:

 

$ gunzip -l 1_prdsale_single.csv.gz
         compressed        uncompressed  ratio uncompressed_name
               1109                6331  83.0% 1_prdsale_single.csv
$

 

SAS code:

 

proc casutil incaslib="csvs" outcaslib="csvs" ;
   load casdata="1_prdsale_single.csv.gz" casout="1_prdsale_single" replace
        importOptions=(fileType="csv",archiveType='gz',getnames="true") ;
quit ;

 

2 – reading a zip archive with multiple CSV files

 

Archive structure:

 

$ unzip -v 4_prdsale_multiple.csv.zip
Archive:  4_prdsale_multiple.csv.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    6331  Defl:N     1074  83% 11-17-2021 15:23 d3570af4  prdsale100_1.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  prdsale100_2.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  prdsale100_3.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  prdsale100_4.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  prdsale100_5.csv
--------          -------  ---                            -------
   31655             5370  83%                            5 files
$

 

SAS code (all 5 files will be loaded into one single CAS table):

 

proc casutil incaslib="csvs" outcaslib="csvs" ;
   load casdata="4_prdsale_multiple.csv.zip" casout="4_prdsale_multiple" replace
        importOptions=(fileType="csv",archiveType='zip',getnames="true") ;
quit ;

 

3 – reading a zip archive with multiple CSV files across multiple folders

 

Archive structure:

 

$ unzip -v 5_prdsale_multiple_folder.csv.zip
Archive:  5_prdsale_multiple_folder.csv.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
       0  Stored        0   0% 11-17-2021 15:32 00000000  data/
       0  Stored        0   0% 11-17-2021 15:26 00000000  data/subdir/
    6331  Defl:N     1074  83% 11-17-2021 15:26 d3570af4  data/subdir/prdsale100_1.csv
    6331  Defl:N     1074  83% 11-17-2021 15:26 d3570af4  data/subdir/prdsale100_2.csv
    6331  Defl:N     1074  83% 11-17-2021 15:26 d3570af4  data/subdir/prdsale100_3.csv
    6331  Defl:N     1074  83% 11-17-2021 15:23 d3570af4  data/prdsale100_1.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  data/prdsale100_2.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  data/prdsale100_3.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  data/prdsale100_4.csv
    6331  Defl:N     1074  83% 11-17-2021 15:25 d3570af4  data/prdsale100_5.csv
--------          -------  ---                            -------
   50648             8592  83%                            10 files
$

 

SAS code (same code as before, all 8 files will be loaded into one single CAS table):

 

proc casutil incaslib="csvs" outcaslib="csvs" ;
   load casdata="5_prdsale_multiple_folder.csv.zip" casout="5_prdsale_multiple_folder" replace
        importOptions=(fileType="csv",archiveType='zip',getnames="true") ;
quit ;

 

4 – reading a gzipped semi-colon delimited file (single file, txt extension)

 

Archive structure:

 

$ gunzip -l 6_prdsale_single.dlm.txt.gz
         compressed        uncompressed  ratio uncompressed_name
               1113                6331  83.0% 6_prdsale_single.dlm.txt
$

 

SAS code (note the delimiter option):

 

proc casutil incaslib="csvs" outcaslib="csvs" ;
   load casdata="6_prdsale_single.dlm.txt.gz" casout="6_prdsale_single" replace
        importOptions=(fileType="csv",archiveType='gz',getnames="true",delimiter=";") ;
quit ;

 

5 – reading a zip archive with multiple semi-colon delimited files (they must have the csv extension though)

 

Archive structure:

 

$ unzip -v 8_prdsale_multiple.dlm.csv.zip
Archive:  8_prdsale_multiple.dlm.csv.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    6331  Defl:N     1074  83% 11-18-2021 13:59 955f2518  prdsale100_dlm_1.csv
    6331  Defl:N     1074  83% 11-18-2021 14:09 955f2518  prdsale100_dlm_2.csv
--------          -------  ---                            -------
   12662             2148  83%                            2 files
$

 

SAS code (both files will be loaded into one single CAS table, note the delimiter option):

 

proc casutil incaslib="csvs" outcaslib="csvs" ;
   load casdata="8_prdsale_multiple.dlm.csv.zip" casout="8_prdsale_multiple" replace
        importOptions=(fileType="csv",archiveType='zip',getnames="true",delimiter=";") ;
quit ;

 

You can find some documentation here. This only mentions S3 but is valid for PATH and DNFS too. Thanks for reading.

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎11-30-2021 03:32 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags