11-15-2012 11:57 AM
Hello - first time poster!
Using the infile engine I'm importing comma-delimited text files with millions of observations and want to ensure that i'm not truncating any character variables.
My first thought was that after reading in the data using an estimated set of widths, I would calculate the maximum length of the values of each character variable and compare the max length to the initial width. So, for example, if I set a width of $10. for charvar1 and one of the charvar1 values is "Hello worl", then the max length would be 10 and it would suggest I should rerun my import using a larger width value, perhaps $15., at which point that charvar1 value would be "Hello world" and the max length might drop to 11, suggesting I can stop.
The issue I ran into is that this doesn't work in the situation where my variable value is "Hello Bob McDonald". If I read it in using $10., my max length could be 9, generating a false sense of assurance that i'm not truncating anything. One solution would be for me to read in charvar1 such that it includes trailing blanks, so that the test string length is 10 in the Bob McDonald case, but i'm not sure if that's feasible. To make things concrete, below is a dataset showing the issue.
I have a text file with one line and the following text (i've also attached it):
The quick brown; fox
My datastep is as follows.
informat testvar1 $CHAR5. testvar2 $CHAR5.;
format testvar1 $5. testvar2 $5.;
infile "Filepath.txt" delimiter = ';' missover dsd lrecl = 32767;
input testvar1 $ testvar2 $;
lenvar1 = length(testvar1);
lenvar2 = length(testvar2);
11-15-2012 01:48 PM
I think you are making more work for yourself then is necessary. Yes, you could write code that does what you are asking, but SAS already has a procedure that does it. e.g., running the following code, I think, will accomplish the task for you:
PROC IMPORT OUT= WORK.want
11-15-2012 02:36 PM
Thanks very much for your input, I didn't realize that the GUESSINGROWS max had been boosted from 32,767 rows in 9.2 to 2,147,483,647 rows in 9.3.
Unfortunately though, proc import has at least one limitation that gives me pause. It automatically truncates leading zeros when reading in variables it sets as numeric. If I have a variable with the following values, those values should be distinct in the imported dataset. However both are imported as "123": "00123", "0123".
I understand if there is no easy answer, but is there a way to either solve that issue with proc import or solve the original issue with the infile statement?
11-15-2012 02:51 PM
After you run proc import, press function key 4. That will retrieve the code that was actually created.
Change the informats and formats for the fields that were incorrectly input as numbers, and rerun the code.
11-15-2012 03:03 PM
11-15-2012 03:57 PM
Anytime you're using a program with a parameter with GUESSING in the name you will have to rely on the guess. Or as we like to say in these parts "they call it guessing for a reason".
If you know what your fields are and how they should be INFORMATTED then write a program that uses what you know. PROC IMPORT is for the 80% of the time when it works.
11-15-2012 01:52 PM
This is one time I would try using proc import. You can tell it to use the GUESSINGROWS option to examine upto the first 32767 rows of data to set the lengths of your variables. If that doesn't work you may want to look at the online help example in the INFILE example Reading Files That Contain Variable-Length Records.
Once again Arthur answered while I was typing!
11-15-2012 04:09 PM
Whenever I have wanted to be 100% certain in these situations, I use your approach, but with two changes.
1. Don't use INFORMAT. Use LENGTH instead.
2. Use a much longer length than you could ever imagine needing, for testing.
length testvar1 testvar 2 $ 50;
input testvar1 testvar2;
Then follow up with testing for the longest length needed:
proc means data=test min max maxdec=0;
var lenvar1 lenvar2;
Then finally, go back and change the LENGTH statement to reflect the maximum length actually used.
Tedious, but it works.
11-15-2012 04:41 PM
ZFreeman: While I don't often disagree with data_null_, this time I will. Since we know what proc import is guessing, and you know what you want, why not compromise and give proc import what you want it to guess? The following rewrites the input file, putting double quotes around anything that starts with a zero, and then imports the re-written file:
do i=1 to 3;
if first(scan(_infile_,i,";")) eq "0" then var=catt('"',scan(_infile_,i,";"),'"');
if i lt 3 then do;
put var @;
else put var;
PROC IMPORT OUT= WORK.want