11-18-2014 10:24 AM
I was wondering if someone found a perfect way to import csv/text files to sas without having to change the formats afterwards.
Someone suggested in a post changing the default number for "GuessingRows" in SAS Registry (regedit command), but there is no "perfect" number to use efficiently on every type of file.
Sometimes I would have data with missing values in the first 5 rows, sometimes my first half of the file has missing values, sometimes qualitative variables take a short value in the first rows and afterwards the really long ones.
Is there a way to perfectly import data without having to change some variables' format afterwards?
11-18-2014 10:43 AM
As described on quite a few posts on here, using the proc import syntax is shaky a best. You are allowing a generalized procedure to guess what you want to do. IMO I would advise to avoid proc import at all. Look at your data, understand the data, write code which imports that data.
length var1 var2 $10;
input var1 $ var2 $;
This may seem a bit more effort than letting proc import guess it for you, however in the long term you:
1) Get complete control over the import
2) Catch errors early on
3) Understand the data structure.
11-18-2014 01:01 PM
What is your assignment exactly? If it is, write a program which imports any csv file exactly as we want, I am afraid you are on a road nowhere. There is no such thing as code which can handle any eventuality. Even if you go down the road of reading the complete file character by character and having some complex algorithm to calculate each column you are still going to come up with scenarios where the data just fit. You have to have some kind of idea of what the data is going to be structure-wise.
11-18-2014 01:44 PM
For CSV files there is no need to edit the registry, that value is a default.
CSV files allow the guessingrows as an option in the procedure call. The max value is 2147483647. If you have more rows than that before a variable changes behavior then you're likely hosed anyway.
As I set it the biggest issues have to due with numerically coded data with 1) significant leading zeroes and/or 2) 15 or more digits. Things like account numbers or identifiers most of the time should not be numeric and proc import is likely to assign them as numeric but account 0001 and 00001 aren't different as numbers (bad numbering but the example works) or a value like 1234567891234567 may exceed storage precision for integers (hint: you'll see values like 1.23E15)