BookmarkSubscribeRSS Feed
madara155
Obsidian | Level 7

Hi,

My CSV file has around 10000+ rows and 2000+ columns and in most columns, first VALUE appears in the 1005 row or 2901 row or row 7000 (see the figure below)  -- i.e. first many cells are empty in most of the columns.

 

madara155_0-1697164624350.png

 

I was trying to PROC IMPORT this CSV file and getting a data step error. If I specify the GUESSINGROWS = 10000, program works fine but takes ages to run.

 

I saw in another thread, people have suggested to write a DATA STEP to define the column details and them use PROC IMPORT. 

As there are 2000+ columns, this is not very practical.

And I am trying not to hardcode things (eg: instead of specifying the column names, I wish to import the data file and use the column names in the file as my work needs to be extended for generic use).

Can someone help me with this?

Thanks

 

7 REPLIES 7
Patrick
Opal | Level 21

A .csv is a text file without any information for SAS how to map the values into SAS variables (type, length, format etc.). 

Proc Import generates SAS datastep code to read the file. To generate correct code Proc Import needs in a first pass to analyse the text file. 

Guessingrows defines how many rows of the text file get analysed. I normally use max (=analyse all rows) because it's always possible that there is a string on the very last row that changes things - like a character on the last line after only having digits before.

 

 "but takes ages to run"

What does that mean? How big is your source file in size? And where is it stored (like some network drive with low throughput)?

Kurt_Bremser
Super User

Do not use PROC IMPORT for text (csv) files. Write the DATA step yourself, according to the description/documentation of the file.

Also, do not use Excel to inspect text files. Use a suitable editor (like Notepad++) for this. Excel will show you what it thinks about the values, not the real raw values.

andreas_lds
Jade | Level 19

As others have already said: limiting the use of proc import to cases not being solvable by a data step is the way to go.

Writing a length-statement for 2000+ variables is no fun at all, but you do it once and don't have to fix wrong guessed types and lengths.

 

ballardw
Super User

I am not sure what your question actually might be. You say you don't want to "hardcode" names or such but complain that Proc Import takes too long with a large value of guessingrows.

 

You could use a much smaller value of guessingrows. The procedure will generate data step code. Copy that generated code from the log to the editor. Then manually change the generated INFORMAT lines for how ever many variables are coming in flaky. If you don't have any good documentation about the content of the file to set those properties this step is likely to much longer than just waiting for a large value of guessing rows to complete.

 

 

Personally I suspect any file with 2000+ columns as to the construction and content.

 

 

madara155
Obsidian | Level 7

Thanks everyone for your thoughts. 

 

As per everyone, it seems I can't avoid using a Data step. Will proceed to do that with the help of some SAS generate code.

 

THANKS

 

 

Patrick
Opal | Level 21

@madara155 wrote:

Thanks everyone for your thoughts. 

 

As per everyone, it seems I can't avoid using a Data step. Will proceed to do that with the help of some SAS generate code.

 


If you use Proc Import to generate the data step code it will generate separate code bits for every single variable. 

A table with 2000 variables normally got variable names that follow some naming pattern like myvar_1 to myvar_2. If you've got such patterns then your SAS datastep code can become much simpler to "type" or potentially generate via some script.

data demo;
  infile datalines dsd dlm=',' truncover;
  informat 
    key best32.
    col1-col6 $1.
    ;
  input key col1-col6;
  datalines;
1,a,b,c,d,e,f
2,g,h,i,,k,l
;

 

Kurt_Bremser
Super User

With 2000 columns where the names are simply numbered, this points to data in wide format. You can read such data by looping until you reach the end of line, and outputting non-missing values to separate observations. A long layout is usually better for further analysis.

 

If the columns are not numbered, you can copy the header line into the code and make the INPUT statement out of it.

 

For more detailed help, we need to see at least a partial example of the header line and the real data (overwrite sensitive information with x's or 0's) from lines where there are values. Once again, do not use Excel, use a text editor for copying.

sas-innovate-white.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Early bird rate extended! Save $200 when you sign up by March 31.

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1954 views
  • 0 likes
  • 5 in conversation