Proc import without guessingrows for a csv file

madara155 · Posted 10-12-2023 10:49 PM

Hi,

My CSV file has around 10000+ rows and 2000+ columns and in most columns, first VALUE appears in the 1005 row or 2901 row or row 7000 (see the figure below) -- i.e. first many cells are empty in most of the columns.

I was trying to PROC IMPORT this CSV file and getting a data step error. If I specify the GUESSINGROWS = 10000, program works fine but takes ages to run.

I saw in another thread, people have suggested to write a DATA STEP to define the column details and them use PROC IMPORT.

As there are 2000+ columns, this is not very practical.

And I am trying not to hardcode things (eg: instead of specifying the column names, I wish to import the data file and use the column names in the file as my work needs to be extended for generic use).

Can someone help me with this?

Thanks

Patrick · Posted 10-13-2023 12:26 AM

A .csv is a text file without any information for SAS how to map the values into SAS variables (type, length, format etc.).

Proc Import generates SAS datastep code to read the file. To generate correct code Proc Import needs in a first pass to analyse the text file.

Guessingrows defines how many rows of the text file get analysed. I normally use max (=analyse all rows) because it's always possible that there is a string on the very last row that changes things - like a character on the last line after only having digits before.

"but takes ages to run"

What does that mean? How big is your source file in size? And where is it stored (like some network drive with low throughput)?

Kurt_Bremser · Posted 10-13-2023 01:09 AM

Do not use PROC IMPORT for text (csv) files. Write the DATA step yourself, according to the description/documentation of the file.

Also, do not use Excel to inspect text files. Use a suitable editor (like Notepad++) for this. Excel will show you what it thinks about the values, not the real raw values.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 10-13-2023 01:51 AM

As others have already said: limiting the use of proc import to cases not being solvable by a data step is the way to go.

Writing a length-statement for 2000+ variables is no fun at all, but you do it once and don't have to fix wrong guessed types and lengths.

ballardw · Posted 10-13-2023 02:04 AM

I am not sure what your question actually might be. You say you don't want to "hardcode" names or such but complain that Proc Import takes too long with a large value of guessingrows.

You could use a much smaller value of guessingrows. The procedure will generate data step code. Copy that generated code from the log to the editor. Then manually change the generated INFORMAT lines for how ever many variables are coming in flaky. If you don't have any good documentation about the content of the file to set those properties this step is likely to much longer than just waiting for a large value of guessing rows to complete.

Personally I suspect any file with 2000+ columns as to the construction and content.

madara155 · Posted 10-13-2023 02:10 AM

Thanks everyone for your thoughts.

As per everyone, it seems I can't avoid using a Data step. Will proceed to do that with the help of some SAS generate code.

THANKS

Patrick · Posted 10-13-2023 07:24 PM

@madara155 wrote:

Thanks everyone for your thoughts.

As per everyone, it seems I can't avoid using a Data step. Will proceed to do that with the help of some SAS generate code.

If you use Proc Import to generate the data step code it will generate separate code bits for every single variable.

A table with 2000 variables normally got variable names that follow some naming pattern like myvar_1 to myvar_2. If you've got such patterns then your SAS datastep code can become much simpler to "type" or potentially generate via some script.

data demo;
  infile datalines dsd dlm=',' truncover;
  informat 
    key best32.
    col1-col6 $1.
    ;
  input key col1-col6;
  datalines;
1,a,b,c,d,e,f
2,g,h,i,,k,l
;

Kurt_Bremser · Posted 10-13-2023 02:40 AM

With 2000 columns where the names are simply numbered, this points to data in wide format. You can read such data by looping until you reach the end of line, and outputting non-missing values to separate observations. A long layout is usually better for further analysis.

If the columns are not numbered, you can copy the header line into the code and make the INPUT statement out of it.

For more detailed help, we need to see at least a partial example of the header line and the real data (overwrite sensitive information with x's or 0's) from lines where there are values. Once again, do not use Excel, use a text editor for copying.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Re: Proc import without guessingrows for a csv file

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away