DATA Step, Macro, Functions and more

Re: Reading in Trailing Blanks in Delimited Text Files

Reply
New Contributor
Posts: 3

Re: Reading in Trailing Blanks in Delimited Text Files

Hello - first time poster!

Using the infile engine I'm importing comma-delimited text files with millions of observations and want to ensure that i'm not truncating any character variables.

My first thought was that after reading in the data using an estimated set of widths, I would calculate the maximum length of the values of each character variable and compare the max length to  the initial width. So, for example, if I set a width of $10. for charvar1 and one of the charvar1 values is "Hello worl", then the max length would be 10 and it would suggest I should rerun my import using a larger width value, perhaps $15., at which point that charvar1 value would be "Hello world" and the max length might drop to 11, suggesting I can stop.

The issue I ran into is that this doesn't work in the situation where my variable value is "Hello Bob McDonald". If I read it in using $10., my max length could be 9, generating a false sense of assurance that i'm not truncating anything. One solution would be for me to read in charvar1 such that it includes trailing blanks, so that the test string length is 10 in the Bob McDonald case, but i'm not sure if that's feasible. To make things concrete, below is a dataset showing the issue.

Thanks!

I have a text file with one line and the following text (i've also attached it):

The   quick brown; fox

My datastep is as follows.

data test;

  informat testvar1 $CHAR5. testvar2 $CHAR5.;

  format testvar1 $5. testvar2 $5.;

  infile "Filepath.txt" delimiter = ';' missover dsd lrecl = 32767;

  input testvar1 $ testvar2 $;

  lenvar1 = length(testvar1);

  lenvar2 = length(testvar2);

run;

Attachment
PROC Star
Posts: 7,468

Re: Reading in Trailing Blanks in Delimited Text Files

I think you are making more work for yourself then is necessary.  Yes, you could write code that does what you are asking, but SAS already has a procedure that does it.  e.g., running the following code, I think, will accomplish the task for you:

PROC IMPORT OUT= WORK.want

            DATAFILE= "c:\filepath.txt"

            DBMS=DLM REPLACE;

     DELIMITER='3B'x;

     GETNAMES=NO;

     DATAROW=1;

     GUESSINGROWS=1500000;

RUN;

New Contributor
Posts: 3

Re: Reading in Trailing Blanks in Delimited Text Files

Art,

Thanks very much for your input, I didn't realize that the GUESSINGROWS max had been boosted from 32,767 rows in 9.2 to 2,147,483,647 rows in 9.3.

Unfortunately though, proc import has at least one limitation that gives me pause. It automatically truncates leading zeros when reading in variables it sets as numeric. If I have a variable with the following values, those values should be distinct in the imported dataset. However both are imported as "123": "00123", "0123".

I understand if there is no easy answer, but is there a way to either solve that issue with proc import or solve the original issue with the infile statement?

PROC Star
Posts: 7,468

Re: Reading in Trailing Blanks in Delimited Text Files

After you run proc import, press function key 4.  That will retrieve the code that was actually created.

Change the informats and formats for the fields that were incorrectly input as numbers, and rerun the code.

New Contributor
Posts: 3

Re: Reading in Trailing Blanks in Delimited Text Files

Thanks a lot Arthur.

I'm trying to avoid manually editing code from the log if possible, but this seems to be the best solution. I appreciate the help!

I found another conversation with some similarities to this one, and it doesn't seem like there was ever a clean resolution either.

Respected Advisor
Posts: 3,799

Re: Reading in Trailing Blanks in Delimited Text Files

Anytime you're using a program with a parameter with GUESSING in the name you will have to rely on the guess.  Or as we like to say in these parts "they call it guessing for a reason".

If you know what your fields are and how they should be INFORMATTED then write a program that uses what you know.  PROC IMPORT is for the 80% of the time when it works.

Super User
Posts: 11,343

Re: Reading in Trailing Blanks in Delimited Text Files

This is one time I would try using proc import. You can tell it to use the GUESSINGROWS option to examine upto the first 32767 rows of data to set the lengths of your variables. If that doesn't work you may want to look at the online help example in the INFILE example Reading Files That Contain Variable-Length Records.

Once again Arthur answered while I was typing!

Super User
Posts: 5,498

Re: Reading in Trailing Blanks in Delimited Text Files

Whenever I have wanted to be 100% certain in these situations, I use your approach, but with two changes.

1. Don't use INFORMAT.  Use LENGTH instead.

2. Use a much longer length than you could ever imagine needing, for testing.

length testvar1 testvar 2 $ 50;

input testvar1 testvar2;

Then follow up with testing for the longest length needed:

proc means data=test min max maxdec=0;

   var lenvar1 lenvar2;

run;

Then finally, go back and change the LENGTH statement to reflect the maximum length actually used.

Tedious, but it works.

Good luck.

PROC Star
Posts: 7,468

Re: Reading in Trailing Blanks in Delimited Text Files

ZFreeman: While I don't often disagree with data_null_, this time I will.  Since we know what proc import is guessing, and you know what you want, why not compromise and give proc import what you want it to guess?  The following rewrites the input file, putting double quotes around anything that starts with a zero, and then imports the re-written file:

data _null_;

  infile "D:\Art\filepath.txt";

  file "D:\Art\temp.txt";

  input;

  i=1;

  do i=1 to 3;

    if first(scan(_infile_,i,";")) eq "0" then var=catt('"',scan(_infile_,i,";"),'"');

    else var=scan(_infile_,i,";");

    if i lt 3 then do;

      var=cats(var,";");

      put var @;

    end;

    else put var;

  end;

run;

PROC IMPORT OUT= WORK.want

            DATAFILE= "D:\Art\temp.txt"

            DBMS=DLM REPLACE;

     DELIMITER='3B'x;

     GETNAMES=NO;

     DATAROW=1;

     GUESSINGROWS=1500000;

RUN;

Respected Advisor
Posts: 3,799

Re: Reading in Trailing Blanks in Delimited Text Files

While you may think you're disagreeing I think you are doing exactly what I suggest albeit in a different way.:smileymischief:

PROC Star
Posts: 7,468

Re: Reading in Trailing Blanks in Delimited Text Files

Posted in reply to data_null__

That is definitely a relief!  Like I said, I seldom disagree with your advice.

Ask a Question
Discussion stats
  • 10 replies
  • 488 views
  • 6 likes
  • 5 in conversation