BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
biringstad
Calcite | Level 5

 

Hello,

 

When I use 'proc import' to import csv files into SAS some observations become missing (these are not missing in the original csv files). Each csv file has 2474 rows and 2466 columns. It is several observations in the bottom right of the excel table that becomes missing.

 

Below is the commands I use.

 

I am trying to import 14 csv files by using a macro (data starting on row 3), and some missing observations (in the bottom right of the table) appears in all 14 of them.

 

It is not reported anything about this in the log.

 

%macro readdat;

%do i = 2000 %to 2014;

PROC IMPORT datafile="/folders/myfolders/WIO&i..csv"

OUT=WIO&i

DBMS=DLM

REPLACE;

DELIMITER='2C'x;

Getnames=NO;

Datarow=3;

RUN;

...

 

%end;

%mend readdat;

%readdat;

 

What could be the problem? Help is very appreciated!

 

 

Best,

Birgitte Ringstad

 

 

 

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Why would you ask PROC IMPORT to guess what you data looks like if you know that there are 2466 variables already?

You can read the data for all 15 years into a single dataset much easier just using a data step.

 

data want ;
  length fname $200 ;
  do year = 2000 to 2014;
    fname = cats("/folders/myfolders/WIO",year,".csv");
    infile csv filevar=fname dsd dlm=',' firstobs=3 truncover end=eof;
    do while (not eof);
      input var1-var2466 ;
      output;
    end;
  end;
  stop;
run;

If your 2466 variables are not all numeric then just update the INPUT statement to reflect what they actually are.

View solution in original post

19 REPLIES 19
Kurt_Bremser
Super User

Inspect the csv file with a proper editor (preferred one that can display in hex format, so you can find "interesting" characters).

 

If you use dbms=csv, you can omit the delimiter= statement (csv automatically uses a comma as delimiter).

 

If inspecting the file does not give you a clue, post the relevant portion of the file.

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Tip 1 - CSV (Comma Separated Variable text file) is not the same as an Excel file.  Opening a CSV (or in fact any file) in Excel can invoke many Excel "features" where data is hidden, displayed differently or any number of other nice features.  Use a plain text editor to see the real file.

 

Tip 2 - CSV file format indicates row 1 contains variable names.  From row 2 onwards is data.  If your "data" starts on row 3 then it is not a CSV file. 

 

Tip 3 - Proc import is a guessing procedure.  It does not know your data.  It runs through some observations and makes a best guess on what that data is/contains.  If the guess is wrong you can end up with odd results.

 

Tip 4 - You can import multiple files using infile in a datastep (and fix tip 2 also) no need to do any macro processing.  Here is result 1 of a google on the subject:

http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm

 

Tip 5 - Putting data (in this case) into dataset names, or dataset variable names, will do several things.  It will first make your code vastly more complicated than it needs to be.  It will also be harder to maintain, i.e. less robust, and be less efficient.  Using the indsname option you can create a varible for date - so a proper date variable - and put all the data into one dataset - i.e no looping, need to know dataset names etc.  

Ksharp
Super User
Try option guessingrow=32767
Tom
Super User Tom
Super User

Why would you ask PROC IMPORT to guess what you data looks like if you know that there are 2466 variables already?

You can read the data for all 15 years into a single dataset much easier just using a data step.

 

data want ;
  length fname $200 ;
  do year = 2000 to 2014;
    fname = cats("/folders/myfolders/WIO",year,".csv");
    infile csv filevar=fname dsd dlm=',' firstobs=3 truncover end=eof;
    do while (not eof);
      input var1-var2466 ;
      output;
    end;
  end;
  stop;
run;

If your 2466 variables are not all numeric then just update the INPUT statement to reflect what they actually are.

Patrick
Opal | Level 21

@Tom

If I execute your job in my SAS environment then I get into an infinite loop if I don't add a "stop;" right before the "run;"

Tom
Super User Tom
Super User

That makes sense that the STOP is required. Normally SAS stops a data step when it reads past the end of the input file, but in this case the inner loops are prevented from reading past the end and the outer loop gets its data from the DO loop instead of a SET statement.  So the STOP is required.

biringstad
Calcite | Level 5

Thank you.

 

I tried this code now, including STOP; right before END;, and it works okey.

 

My first two variables are character variables, how do I specify that in the code?

 

Birgitte

 

 

Tom
Super User Tom
Super User

Change the INPUT statement.

Normally I like to define the variable first.  You can use a LENGTH statement to define the type and length of the variables.  You can even give them names that make sense if you want.  If you do you could change the variable list in the INPUT statement to use a positional list of names instead of a simple sequential list of names by using double hyphen .

 

length ID $20 NAME $30 var3-var5678 8 ;
input id -- var5678 ;
Tom
Super User Tom
Super User

STOP would go after the END statement or else it will only do one file.

Put the STOP before the RUN statement.

biringstad
Calcite | Level 5

Thank you!

 

I did it like this now, so far it works.

 

data WIO;
 length fname $200 ;
 do year = 2000 to 2014 ;
 fname = cats("/folders/myfolders/WIO",year,".csv");
 infile csv filevar=fname dsd dlm=',' firstobs=3 truncover end=eof;
 do while (not eof);

length ctr $20 sec $30 var3-var2466 8;
input ctr -- var2466;
output;
end;
end;
stop;
run;

 

The two character variables are in column 1 and 2, their names can be ctr and sec, and the rest of the variables are numeric variables. What does the $20, $30 and 8 indicate?

And also the $200 specification in the second line?

 

 

The log notes that the minimum length for each dataset was 27108 and the maximum record length was 32767 for each of the imported csv files. Should these two lengths be the same?

 

LOG example (it says the same for each of the 15 imported csv files):

NOTE: The infile CSV is:

Filename=/folders/myfolders/WIO2014.csv,
Owner Name=root,Group Name=vboxsf,
Access Permission=-rwxrwx---,
Last Modified=25. november 2016 kl 08.56,
File Size (bytes)=67310096
 
NOTE: 2472 records were read from the infile CSV.
The minimum record length was 27108.
The maximum record length was 32767.

 

Tom
Super User Tom
Super User

Look up the LENGTH statement in the man pages.  SAS has two types of variables floating point numbers and fixed length character strings. The $ indicates the variable is a string and the length is the maximum number of bytes it can store. For numbers it is possible to request SAS to use less than the full 8 bytes that it takes to store a floating point number, but it is not worth it.

 

The default LRECL for INFILE is 32767.  If you are seeing that as the maximum length read then most likely some of your lines have been truncated.  Run it again but add the LRECL= option to the INFILE statement and specify a longer length such as 1000000.

 

It is surprising that 32,767 is not enough space to store 2463 numbers.  That would imply that your average number is more than 10 digits long.  Perhaps you have floating point numbers with too many digits of precision?

biringstad
Calcite | Level 5

Yes, I do have 8 digits after comma on all the variables, could that cause a problem?

The maximum length seem to have changed after including the LRECL alternative in the code. The minimum length is the same for each of the 15 files, and the maximum code varies.

 

 

 

Another question:

If I already have names for each column in the csv file (say in the first row), would it be possible to use these names for the variables in the dataset in sas? How could I include this in the code?

 

(Before I have used GETNAMES when using the the PROC IMPORT statement, but here I could not make it work)

 

Sorry for all the questions, but I am a bit desperate after using days on this.

 

Thanks a lot.

 

 

 

 

Tom
Super User Tom
Super User
Not sure why the maximum length should change from 32K to 3K unless the files were changed. The maximum record length is just what it sounds like, the length of the longest line in the file. Since the length depends on the actual data in the line it should vary a little by file.
What I normally do when given a new file is copy and paste the variable names into the program editor and convert the list into a valid LENGTH statement by inserting the length. Note this will only work if the names are valid SAS variable names. Also not that if the you have a lot of variables next to each other that you want to define as having the same type and length then you just have to put the length value after the last of the list.
For example: length name $30 age height weight 8 ;
Then for the INPUT statement I just need to re-type the first and last variable names .
INPUT firsvar - lastvar ;
Having 8 or 9 digits after the decimal point will not be a problem. Just remember that you only have about 15 digits of precision in an 8 byte floating point number. So do not expect to store 20 digits. Note that orders of magnitude don't matter so 123 and 123 million both only have 3 significant digits.
Also don't confuse the storage length of 8 bytes for the floating point numbers with the number of digits it takes to represent the number as a text string in your CSV file. You don't need to worry about that as SAS will know how many characters to read by the relative positions of the delimiters.
biringstad
Calcite | Level 5

Good to know, thank you!

 

If I have a row (say row 1) with names for all the numeric variables (var3-var2466), and I would like to include these when I import the data from the csv files, how could I include that in the code?

 

The names are like this: AUS, AUS_1, AUS_2, AUS_3, AUS_4, AUT, AUT_1, AUT_2, AUT_3, AUT_4 and up to ZOW, ZOW_1, ZOW_2, ZOW_3, ZOW_4. So there are 5 variables for each of 44 countries, summing to 220.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 19 replies
  • 21781 views
  • 2 likes
  • 6 in conversation