Solved: Re: Reading Nonstandard CSV Files

Junyong · Posted 01-16-2020 03:33 PM

There are a plethora of CSV files to read with different starting rows due to some comments at the beginning. For example,

Comment 1
Comment 2

,X,Y,Z
1988,0.00,0.00,0.00
1989,0.00,0.00,0.00
1990,0.00,0.00,0.00
1991,0.00,0.00,0.00

In another file, for example,

Comment 3
Comment 4
Comment 5
Comment 6
Comment 7
Comment 8

,P,Q,U,V,W
2001,0.00,0.00,0.00,0.00,0.00
2002,0.00,0.00,0.00,0.00,0.00
2003,0.00,0.00,0.00,0.00,0.00
2004,0.00,0.00,0.00,0.00,0.00
2005,0.00,0.00,0.00,0.00,0.00

So %MACRO with FIRSTOBS is impossible. One common rule I found is that the data part of every CSV file starts with a comma at the beginning (e.g., ",X" in the first and ",P" in the second). Would there be any way to trim those starting comments from the CSV files (probably INFILE, FILE, INPUT, PUT combinations) until SAS recognizes the line that starts with the comma? After then, PROC IMPORT would be able to read the rest properly. Thanks.

Tom · Posted 01-16-2020 04:19 PM

Do you have the list of the filenames? Or are they all in the same directory? Or follow some simple naming pattern?

Let's assume you have list of the filenames. If not you can easily make one.

data filenames ;
  infile "ls /mydir/*.csv" pipe truncover ;
  input fname $255.;
run;

Use the list to read the top of each file. Perhaps you want to save the comments?

data 
  comments (keep=filename comment) 
  firstrow (keep=filename row)
;
  set filenames ;
  filename = fname;
  infile csv filevar=fname truncover end=eof;
  do row=1 by 1 while (not eof);
    input comment $200.;
    if char(_infile_,1)=',' then leave ;
    output comments;
  end;
  row=row+1;
  output firstrow;
run;

Then you can use the list of first row number to generate data step (or proc import) code.

filename code temp;
data _null_;
  set firstrows;
  file code;
  fileno+1;
  put 'proc import datafile=' filename :$quote.
     / '  out=csv' fileno 'replace'
     / '  dbms=csv'
     / ';'
     / '  datarow=' row ';'
     / 'run;'
   ;
run;
%include code / source2;

View solution in original post

Tom · Posted 01-16-2020 03:49 PM

How automatic does it need to be? I typically do NOT use PROC IMPORT to read CSV files. I has to guess what is in the file. You can guess much much better than PROC IMPORT can. Especially if your file only has numeric variables, like in your examples.

For you examples I would just look at the file using any text editor (SAS program editor for example) or simple data _null_ step and figure out the number of columns and the first row.

data one;
  infile 'one.txt' firstobs=5 dsd truncover ;
  input year x y z;
run;

If I have a bunch of files with the same 4 column, but varying number of comments you could make the step smart enough to skip the header on its own. Perhaps by looking for the header row?

data one;
  infile 'one.txt' firstobs=1 dsd truncover ;
  if _n_=1 then do while(1=1);
     input ;
     if _infile_=',X,Y,Z' then leave;
  end;
  input year x y z;
run;

Junyong · Posted 01-16-2020 03:57 PM

Primarily, there are more than 200 CSVs with different variables (+different numbers of columns) as well as different comments. I would specify all FIRSTOBSs and INPUTs manually as you said if there are about 30 CSVs.

Tom · Posted 01-16-2020 04:19 PM

Do you have the list of the filenames? Or are they all in the same directory? Or follow some simple naming pattern?

Let's assume you have list of the filenames. If not you can easily make one.

data filenames ;
  infile "ls /mydir/*.csv" pipe truncover ;
  input fname $255.;
run;

Use the list to read the top of each file. Perhaps you want to save the comments?

data 
  comments (keep=filename comment) 
  firstrow (keep=filename row)
;
  set filenames ;
  filename = fname;
  infile csv filevar=fname truncover end=eof;
  do row=1 by 1 while (not eof);
    input comment $200.;
    if char(_infile_,1)=',' then leave ;
    output comments;
  end;
  row=row+1;
  output firstrow;
run;

Then you can use the list of first row number to generate data step (or proc import) code.

filename code temp;
data _null_;
  set firstrows;
  file code;
  fileno+1;
  put 'proc import datafile=' filename :$quote.
     / '  out=csv' fileno 'replace'
     / '  dbms=csv'
     / ';'
     / '  datarow=' row ';'
     / 'run;'
   ;
run;
%include code / source2;

Junyong · Posted 01-16-2020 05:39 PM

Very much appreciate, but there is one issue—even with DATAROW, PROC IMPORT uses the very first row to recognize the variable names, so the variable name information doesn't survive in this way. Is there still a way to have SAS physically access and remove the comment rows until it confronts the row starting with a comma?

Tom · Posted 01-16-2020 05:44 PM

@Junyong wrote:

Very much appreciate, but there is one issue—even with DATAROW, PROC IMPORT uses the very first row to recognize the variable names, so the variable name information doesn't survive in this way. Is there still a way to have SAS physically access and remove the comment rows until it confronts the row starting with a comma?

yet another reason not to use PROC IMPORT.

It should be a trivial modification to the data step I posted to add another loop to read the rest of the file and write it to a new file.

data filenames ;
  infile "ls /mydir/*.csv" pipe truncover ;
  input fname $255.;
run;
data _null_;
  set filenames ;
  outfile=cats(fname,'.fixed');
  infile in filevar=fname end=eof;
  file out filevar=outfile;
  do while (not eof);
    input ;
    if char(_infile_,1)=',' then found=1;
    if found then put _infile_;
  end;
run;

If your values are all numbers like in your example then just generate data step code instead. Let's make your two example files:

%let path=c:\downloads\;
filename txt1 "&path.file1.csv";
filename txt2 "&path.file2.csv";
options parmcards=txt1;
parmcards4;
Comment 1
Comment 2

,X,Y,Z
1988,0.00,0.00,0.00
1989,0.00,0.00,0.00
1990,0.00,0.00,0.00
1991,0.00,0.00,0.00
;;;;


options parmcards=txt2;
parmcards4;
Comment 3
Comment 4
Comment 5
Comment 6
Comment 7
Comment 8

,P,Q,U,V,W
2001,0.00,0.00,0.00,0.00,0.00
2002,0.00,0.00,0.00,0.00,0.00
2003,0.00,0.00,0.00,0.00,0.00
2004,0.00,0.00,0.00,0.00,0.00
2005,0.00,0.00,0.00,0.00,0.00
;;;;

And then analyze them:

data comments
     files(rename=(row=firstobs comment=varlist))
;
  fileno+1;
  length memname $32 ;
  memname=cats('CSV',fileno);
  input fname $80.;
  filename=fname ;
  infile csv filevar=fname end=eof truncover;
  do row=1 by 1 while(not eof);
    input comment $100. ;
    if char(comment,1)=',' then leave;
    output comments;
  end;
  row=row+1;
  comment=translate('year'||comment,' ',',');
  output files;
cards;
c:\downloads\file1.csv
c:\downloads\file2.csv
;

And then use the generated data to write code to read them:

filename code temp;
data _null_;
  set files ;
  file code;
  put 'data ' memname ';' 
    / '  infile ' filename :$quote. 'dsd truncover ' firstobs= ';'
    / '  input ' varlist ';' 
    / 'run;'
  ;
run;
%include code / source2;

2120  %include code / source2;
NOTE: %INCLUDE (level 1) file CODE is file ...\#LN00157.
2121 +data CSV1 ;
2122 +  infile "c:\downloads\file1.csv" dsd truncover firstobs=5 ;
2123 +  input year X Y Z ;
2124 +run;

NOTE: The infile "c:\downloads\file1.csv" is:
      Filename=c:\downloads\file1.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=116,
      Last Modified=16Jan2020:16:55:22,
      Create Time=16Jan2020:16:47:06

NOTE: 4 records were read from the infile "c:\downloads\file1.csv".
      The minimum record length was 19.
      The maximum record length was 19.
NOTE: The data set WORK.CSV1 has 4 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2125 +data CSV2 ;
2126 +  infile "c:\downloads\file2.csv" dsd truncover firstobs=9 ;
2127 +  input year P Q U V W ;
2128 +run;

NOTE: The infile "c:\downloads\file2.csv" is:
      Filename=c:\downloads\file2.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=235,
      Last Modified=16Jan2020:16:55:22,
      Create Time=16Jan2020:16:47:06

NOTE: 5 records were read from the infile "c:\downloads\file2.csv".
      The minimum record length was 29.
      The maximum record length was 29.
NOTE: The data set WORK.CSV2 has 5 observations and 6 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Junyong · Posted 01-16-2020 06:35 PM

Thanks again, but I am still considering PROC IMPORT rather than DATA notwithstanding the shortcomings due to another reason. Here I attach one example.

This file was created by CMPT_ME_RETS using the 201911 CRSP database.  It contains
value- and equal-weighted returns for size portfolios.  Each record contains returns for:

Negative (not used)  30%  40%  30%     5 Quintiles    10 Deciles

The portfolios are constructed at the end of Jun.  The annual returns are from January
to December.

Missing data are indicated by -99.99 or -999.


  Value Weight Returns -- Monthly
,<= 0,Lo 30,Med 40,Hi 30,Lo 20,Qnt 2,Qnt 3,Qnt 4,Hi 20,Lo 10,Dec 2,Dec 3,Dec 4,Dec 5,Dec 6,Dec 7,Dec 8,Dec 9,Hi 10
192607,  -99.99,    0.14,    1.59,    3.42,    0.37,    0.78,    1.68,    1.39,    3.67,   -0.12,    0.52,   -0.05,    1.31,    1.21,    2.04,    1.58,    1.29,    3.53,    3.71
192608,  -99.99,    3.21,    2.77,    2.91,    2.26,    3.51,    3.75,    1.53,    3.07,    1.33,    2.55,    4.00,    3.20,    2.81,    4.45,    1.61,    1.49,    0.61,    3.79
192609,  -99.99,   -1.74,   -0.88,    0.80,   -1.39,   -1.06,    0.05,   -0.26,    0.81,    0.59,   -2.00,   -2.01,   -0.46,   -0.06,    0.14,   -2.02,    0.74,   -0.77,    1.25

Again the variable name row starts with a comma here too. However, the variable names are not appropriate enough for SAS to recognize since they contain not only math symbols such as <= 0 but also blanks such as Lo 30. Actually, PROC IMPORT does a great job in dealing with those situations.

Tom · Posted 01-16-2020 07:49 PM

You could easily write a program to read that file and generate a data step like this to read it.

data want ;
  infile 'myfile.txt' dsd firstobs=14 truncover ;
  label id ="Value Weight Returns -- Monthly"
    var1='<= 0'
    var2='Lo 30'
    var3='Med 40'
    var4='Hi 30'
    var5='Lo 20'
    var6='Qnt 2'
    var7='Qnt 3'
    var8='Qnt 4'
    var9='Hi 20'
    var10='Lo 10'
    var11='Dec 2'
    var12='Dec 3'
    var13='Dec 4'
    var14='Dec 5'
    var15='Dec 6'
    var16='Dec 7'
    var17='Dec 8'
    var18='Dec 9'
    var19='Hi 10'
 ;
 input id var1-var19;
run;

Registration is open

SAS Training: Just a Click Away