How to split a large file into smaller chunks using SAS

4 Likes

In this article I share a SAS coding technique to split any file into several "chunks". Note that this is not about splitting large data sets into smaller data sets with fewer records -- there are well-established techniques for doing that. Instead, this approach is meant to help when you have a single file that is too large to manage in one bite (or "byte"?) when transferring or storing, so you need it to be in smaller pieces to accomplish the operation. My main use case: upload a large file using an API. If a file is too big to send all at once, we need to break it into pieces so that we can send them in sequence, and then the API service can reassemble the file on the other side. The Box.com API and Microsoft Graph API are two examples of services that require/support this piecemeal file upload for large files.

If your goal is to upload a large file from SAS to Microsoft Teams or SharePoint or OneDrive, you can use this technique as implemented in this GitHub repository: https://github.com/sascommunities/sas-microsoft-graph-api.

How the %splitFile macro works

The %splitFile macro is a simple routine that allows you specify a single file to split, specify where you want the pieces to be stored, and what maximum size you need for each of those pieces. The code also produces an output data set that includes a record for each chunk with the byte-range of content that the file contains. You can use this information directly in the Box.com and Microsoft Graph APIs.

Sample usage:

%splitFile(
 sourceFile=/home/my-user-id/STAT1/data/spending2011.sas7bdat,
 maxsize=%sysevalf(1024*60),
 chunkLoc=/home/my-user-id/splitchunks,
 metadataOut=work._metaout
 );

In this example, we're splitting a sas7bdat file into 60Kb-sized chunks and storing those chunks in a folder named ./splitchunks. The _METAOUT data set summarizes the output and looks something like this:

The maxsize= argument is optional; it's 320Kb by default. Also, chunkLoc= will default to your WORK location unless you specify a different path.

The algorithm for creating the file pieces is straightforward. The DATA step code uses file functions like FREAD and FGET to stream the source file into a buffer, then uses FPUT and FWRITE to write that content to a series of output files, starting a new file when the size of the current file exceeds the target file chunk size. The method relies on block I/O operations that work similar to their analogies in other programming languages, such as C (fread, fgetc, fputc, fwrite).

Putting the file back together

I did not have a need for SAS code to reassemble the file from the pieces that %splitFile creates. In my use case, the APIs I'm using are performing that step within their service. However, I did test "reassembly" using system tools in Windows and Linux. I verified that in each case the output was a binary clone of the original file, although with updated file attributes such as date/time stamp.

On Windows using PowerShell, you can use the Get-Content (or gc) command to read the content of the file chunks and redirect to a new destination file using Set-Content (or sc ).

gc .\chunk_1.dat,.\chunk_2.dat,.\chunk_3.dat .\chunk_4.dat -Encoding Byte | sc new.sas7bdat -Encoding Byte

On Linux, you can use the cat command to concatenate the file pieces into a new larger file:

cat .\chunk_1.dat .\chunk_2.dat .\chunk_3.dat .\chunk_4.dat > new.sas7bdat

Credits

I cribbed some of the file writing techniques from a %binaryCopyFile macro that my colleague @BrunoMueller created several years ago. Thanks Bruno!

Complete code for %splitFile and supporting macros

I've included the complete code below; you can also find it on GitHub here.

/* Reliable way to check whether a macro value is empty/blank */
%macro isBlank(param);
  %sysevalf(%superq(param)=,boolean)
%mend;

/* We need this function for large file uploads, to telegraph */
/* the file size in the API.                                   */
/* Get the file size of a local file in bytes.                */
%macro getFileSize(localFile=);
  %local rc fid fidc;
  %local File_Size;
  %let rc=%sysfunc(filename(_lfile,&localFile));
  %let fid=%sysfunc(fopen(&_lfile));
  %let File_Size=%sysfunc(finfo(&fid,File Size (bytes)));
  %let fidc=%sysfunc(fclose(&fid));
  %let rc=%sysfunc(filename(_lfile));
  %sysevalf(&File_Size.)
%mend;

%macro splitFile(sourceFile=,
 maxSize=327680,
 metadataOut=,
 /* optional, will default to WORK */
 chunkLoc=);

  %local filesize maxSize numChunks buffsize ;
  %let buffsize = %sysfunc(min(&maxSize,4096));
  %let filesize = %getFileSize(localFile=&sourceFile.);
  %let numChunks = %sysfunc(ceil(%sysevalf( &filesize / &maxSize. )));
  %put NOTE: Splitting &sourceFile. into &numChunks parts;

  %if %isBlank(&chunkLoc.) %then %do;
    %let chunkLoc = %sysfunc(getoption(WORK));
  %end;

  /* This DATA step will do the chunking.                                 */
  /* It's going to read the original file in segments sized to the buffer */
  /* It's going to write that content to new files up to the max size     */
  /* of a "chunk", then it will move on to a new file in the sequence     */
  /* All resulting files should be the size we specified for chunks       */
  /* except for the last one, which will be a remnant                     */
  /* Along the way it will build a data set with the metadata for these   */
  /* chunked files, including the file location and byte range info       */
  /* that will be useful for APIs that need that later on                 */
  data &metadataOut.(keep=original originalsize chunkpath chunksize byterange);
    length 
      filein 8 fileid 8 chunkno 8 currsize 8 buffIn 8 rec $ &buffsize fmtLength 8 outfmt $ 12
      bytescumulative 8
      /* These are the fields we'll store in output data set */
      original $ 250 originalsize 8 chunkpath $ 500 chunksize 8 byterange $ 50;
    original = "&sourceFile";
    originalsize = &filesize.;
    rc = filename('in',"&sourceFile.");
    filein = fopen('in','S',&buffsize.,'B');
    bytescumulative = 0;
    do chunkno = 1 to &numChunks.;
      currsize = 0;
      chunkpath = catt("&chunkLoc./chunk_",put(chunkno,z4.),".dat");
      rc = filename('out',chunkpath);
      fileid = fopen('out','O',&buffsize.,'B');
      do while ( fread(filein)=0 ) ;
        call missing(outfmt, rec);
        rc = fget(filein,rec, &buffsize.);
        buffIn = fcol(filein);
        if (buffIn - &buffsize) = 1 then do;
          currsize + &buffsize;
          fmtLength = &buffsize.;
        end;
        else do;
          currsize + (buffIn-1);
          fmtLength = (buffIn-1);
        end;
        /* write only the bytes we read, no padding */
        outfmt = cats("$char", fmtLength, ".");
        rcPut = fput(fileid, putc(rec, outfmt));
        rcWrite = fwrite(fileid);      
        if (currsize >= &maxSize.) then leave;
      end;
      chunksize = currsize;
      bytescumulative + chunksize;
      byterange = cat("bytes ",bytescumulative-chunksize,"-",bytescumulative-1,"/",originalsize);
      output;
      rc = fclose(fileid);
    end;
    rc = fclose(filein);
  run;
%mend;