Solved: Re: prevent reading blank rows in csvs

proctice · Posted 02-15-2017 02:30 PM

I am using the following code to read multiple csvs.

http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm

But, my csvs have blank rows in them. Any suggestions on how to modify this code to prevent reading the blank rows. The blank rows are making it take a long time to run.

I am currently doing it using a second data step with this line of code

if compress(cats(of _all_),'.')=' ' then delete; /*This line was also taken from SAS communites. Thanks!*/

Patrick · Posted 02-15-2017 04:15 PM

When using the INPUT statement the automatic variable _INFILE_ gets created. This variable points to the input buffer.

For an blank line in your source .csv the input buffer will be "blank" and the following code should work.

data dirlist;
  length fname $256;
  infile dirlist length=reclen;
  input fname $varying256. reclen;
  if missing(_INFILE_) then delete;
run;

View solution in original post

LaurieF · Posted 02-15-2017 03:21 PM

It could make a difference depending on the file system and structure of the CSV files themselves, but you could try something like this:

if reclen < 2 then
   delete;

If you've used the length=reclen keyword of infile, and your files have variable length records, the blank lines will have a length of 0 or 1 (probably the former).

proctice · Posted 02-15-2017 03:29 PM

LaurieF,

Where would I put this piece of code in the code in the link?

LaurieF · Posted 02-15-2017 03:37 PM

It's hard to tell, because I can't see your exact code!

If it's exactly as the link has it, I'd modify it to this:

do while(not done);
    myfilename = filepath;
    input @;
    if reclen >= 2 then do;
       input name $  x1 x2 x3 @;
       output;
       end;
    input;
  end;

The input @; will read the record but not process it and not release it; nevertheless it populates reclen. If the length is 2 or more, read the fields you want and write out an observation. Again the trailing @ will not release the record. At the end of the do block, the record is again processed as a null process, but releases it so that the next input will carry on correctly.

I should think that this will give you quite an improvement, especially if you've got a lot of variables to read.

Patrick · Posted 02-15-2017 04:15 PM

When using the INPUT statement the automatic variable _INFILE_ gets created. This variable points to the input buffer.

For an blank line in your source .csv the input buffer will be "blank" and the following code should work.

data dirlist;
  length fname $256;
  infile dirlist length=reclen;
  input fname $varying256. reclen;
  if missing(_INFILE_) then delete;
run;

mkeintz · Posted 02-15-2017 11:01 PM

I think you mean _INFILE_, not _input_

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 02-15-2017 11:10 PM

I do and fixed in my post. Thanks!