Importing CSV Files with All Character Data

When using PROC IMPORT to read CSV files, the SCANTYPE=MIXED option is not available. This means that we do not have an option to force variables to be character. The GUESSINGROWS option can be used, but it can be costly for large files, or potentially not sufficient. Although it does require two passes of the data, the following technique can be used to force all columns to be character when using PROC IMPORT on a CSV file.

The technique works by inserting an asterisk ( * ) in each data column, before the file is seen by PROC IMPORT. A potential side benefit is that this allows us to use a smaller value for the GUESSINGROWS option, however remember that the variable length is determined during the scanning process and also depends on GUESSINGROWS.

First Step

Insert a row of asterisks, one for each incoming column, into the data steam. This requires a first pass of the data.

We read the first row which contains the names of the variables. These are counted and a series of asterisks are written on the first data line following the variable names. The asterisk forces PROC IMPORT to see each column as a character variable.

   * HOLDIT is a temporary location;
   filename holdit temp  lrecl=32000;
 
   * Point to the csv file;
   filename rawcsv "YOUR PATH.csv"  lrecl=32000;
 
   data _null_;
      file holdit;
      infile rawcsv end=done;
 
      * Read the first observation (variable names);
      input;
 
      * Write the var names;
      put _infile_;
      * Count the variables (the names are comma separated);
      wcount = countw(_infile_,',');
      * Write an asterisk for each variable - forces each to be character;
      * but will not set the var length;
      * string = catt('*',repeat(',*',wcount-2));
      put string;
 
      * Read and write the data portion of the raw data;
      do until(done);
         input;
         **********
         *  Pre processing of the data could be done here
         **********;
         * Write the record;
         put _infile_;
      end;
      stop;
      run;

Reading the Data

PROC IMPORT is then used to import the modified CSV file into a SAS data set. The GUESSINGROWS option can be set to a small number, and the DATAROW=2 is used to make sure that the row of asterisks is included in the read, but not the variable names.

   * Read the Altered CSV file into a SAS dataset;
   PROC IMPORT OUT= fromcsv 
               DATAFILE= holdit 
               DBMS=CSV 
               REPLACE;
      guessingrows=4;
      GETNAMES=YES;
      DATAROW=2; 
      run;
   * Clear the filerefs for the next data set;
   filename holdit;
   filename rawcsv;

Removing the Asterisk

The first time this new data set is used the asterisks can be removed using the FIRSTOBS option.

   data noasterisk;
      set fromcsv(firstobs=2);
      * Doing other stuff here.;
      run;

Originally posted by Art Carpenter on sasCommunity.org.

Tom · ‎07-05-2018

Interesting technique. Here is an easier way to write the lines of stars. It also includes using the M and Q modifiers to the COUNTW() function so that the proper number of fields are generated when the header line is complex.

data _null_;
   file holdit;
   infile rawcsv ;
   input ;
   put _infile_ ;
   if _n_=1 then do;
     do i=1 to countw(_infile_,',','mq');
       if i>1 then put ',' @;
       put '*' @ ;
     end;
     put;
   end;
run;

Christine_Lai · ‎08-27-2018

I have specific 4 fields: CustID, Date, Order item as text then a long comment text field. They line up properly most of the times but here and there either the long order field or the extreme long comment field jump to another row.

How can I tackle it?