BookmarkSubscribeRSS Feed

Writing a CSV Zip Archive to CAS

Started ‎11-28-2023 by
Modified ‎11-28-2023 by
Views 466

I am loading some data into a Viya 3.5 implementation that is being landed on the CAS controller node. The data is in the form of multiple CSV files within a zip file. Each file representing a different target table.

 

sf_1_.png

 Select any image to see a larger version.

Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

Exploring Possible Load Strategies

I was hoping I could treat the zip file as a directory and use CASL's elegant loading capabilities. However, I couldn't get CAS to treat the zip file as a directory. So I gave up on that plan.

 

Next, I next looked into using Viya 2021.1.4's enhancements around reading compressed files. However, these capabilities only to work for zip files that contain multiple files all of the same format intended to be loaded in the same target table. So again, I moved on.

 

The FILENAME ZIP Access Method

Finally, I looked at modifying an approach I saw on this blog post that uses the FILENAME ZIP access method. This approach worked. The full code I used is below.

 

cas mysession sessopts=(metrics=true);
caslib _all_ assign;

filename inzip ZIP "/mnt/..../v1.3.0_csv_snapshot_2023_10_19_11_37_58.zip";
 
/* Read the files from the ZIP file and load their names into a work.contents */
data contents(keep=memname isFolder);
 length memname $200 isFolder 8;
 fid=dopen("inzip");
 if fid=0 then
  stop;
 memcount=dnum(fid);
 do i=1 to memcount;
  memname=dread(fid,i);
  /* check for trailing / in folder name */
  isFolder = (first(reverse(trim(memname)))='/');
  output;
 end;
 rc=dclose(fid);
run;

/* Create a macro to copy a zip file member, write it to work, and load it to CAS */
%macro spin(csvfile);
/* Write the file from the to the zip file to the work directory */
filename xl "%sysfunc(getoption(work))/&csvfile..csv" ;
%put %sysfunc(getoption(work))/&csvfile..csv;

data _null_;
   infile inzip(export/&csvfile.csv) 
       lrecl=256 recfm=F length=length eof=eof unbuf;
   file   xl lrecl=256 recfm=N;
   input;
   put _infile_ $varying256. length;
   return;
 eof:
   stop;
run;

/* Load the file from work into CAS */
proc casutil; 
  load file="%sysfunc(getoption(work))/&csvfile..csv"
   outcaslib="casuser" casout="&csvfile" replace copies=0;
quit;
%mend;

/* Spin over the zip file members */
data _null_;
set contents;
text=scan(memname,2);
call execute('%spin('!!strip(text)!!');');
run;

 

As you can see, the code first examines the zip file and writes the names of all files within to a work data set named CONTENTS. Then we create a macro, SPIN, that extracts a CSV file from the zip file, writes it to the SAS WORK directory, and then loads it to CAS. Finally we use the CONTENTS data set to drive the macro with CALL EXECUTE so each file within the zip is processed one at a time by the SPIN macro.

 

Next Steps

In our next post, we'll look at building a scheduling process for this code so that it can handle new files that appear.

 

Find more articles from SAS Global Enablement and Learning here.

Comments
Tom

Is there a reason you had to copy the file out of the ZIP file?

Why not use the ZIP engine of the FILENAME statement to read the file?

Does code like this not work?

 

proc casutil; 
  load file=inzip("export/&csvfile..csv")
   outcaslib="casuser" casout="&csvfile" replace copies=0;
quit;

 

 

That would certainly be better.  Less code and less I/O.  I'll definitely give it a try.  Thanks!

Tom

Double check the name it generates for the first column.  The ZIP engine has a nasty habit of including the BOM (Byte Order Mark) as part of the data.

The following tested code is along the line of what @Tom suggests.

The file option only allows for a file reference. I believe what's happening here is that the file will get extracted on the Compute side and then streamed to CAS with CAS then reading the .csv

Patrick_0-1701226521725.png

From how I understand things if one want to execute the unzipping on the CAS side then the .zip file needs first to get loaded into CAS memory of the active CAS library and then one needs to use the zip engine on the CAS side. 
Here an example that uses table action table.loadTable under Proc CAS: Load CSV Files in a ZIP Archive from a Caslib

 

Using Proc CAS has imho the advantage that we can use CASL. 
I'll play a bit more... like load the metadata (member names) of the zip file into a casl array variable and then loop over this variable calling the loadTable action once per member. This should allow to then extract the files into different CAS tables.

 

For a real implementation like some ETL process: Right now I'd still go for a "traditional" data step for reading a member from a zip file that also allows for error and exception handling, write the result to a parquet table and the load this parquet table in-parallel into CAS. I don't expect that such an approach would take much longer but it would certainly give more control reading the external file and also consume less CAS memory (for the .zip file).

 

Version history
Last update:
‎11-28-2023 03:28 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags