BookmarkSubscribeRSS Feed
ZFreeman
Calcite | Level 5

Hello - first time poster!

Using the infile engine I'm importing comma-delimited text files with millions of observations and want to ensure that i'm not truncating any character variables.

My first thought was that after reading in the data using an estimated set of widths, I would calculate the maximum length of the values of each character variable and compare the max length to  the initial width. So, for example, if I set a width of $10. for charvar1 and one of the charvar1 values is "Hello worl", then the max length would be 10 and it would suggest I should rerun my import using a larger width value, perhaps $15., at which point that charvar1 value would be "Hello world" and the max length might drop to 11, suggesting I can stop.

The issue I ran into is that this doesn't work in the situation where my variable value is "Hello Bob McDonald". If I read it in using $10., my max length could be 9, generating a false sense of assurance that i'm not truncating anything. One solution would be for me to read in charvar1 such that it includes trailing blanks, so that the test string length is 10 in the Bob McDonald case, but i'm not sure if that's feasible. To make things concrete, below is a dataset showing the issue.

Thanks!

I have a text file with one line and the following text (i've also attached it):

The   quick brown; fox

My datastep is as follows.

data test;

  informat testvar1 $CHAR5. testvar2 $CHAR5.;

  format testvar1 $5. testvar2 $5.;

  infile "Filepath.txt" delimiter = ';' missover dsd lrecl = 32767;

  input testvar1 $ testvar2 $;

  lenvar1 = length(testvar1);

  lenvar2 = length(testvar2);

run;

10 REPLIES 10
art297
Opal | Level 21

I think you are making more work for yourself then is necessary.  Yes, you could write code that does what you are asking, but SAS already has a procedure that does it.  e.g., running the following code, I think, will accomplish the task for you:

PROC IMPORT OUT= WORK.want

            DATAFILE= "c:\filepath.txt"

            DBMS=DLM REPLACE;

     DELIMITER='3B'x;

     GETNAMES=NO;

     DATAROW=1;

     GUESSINGROWS=1500000;

RUN;

ZFreeman
Calcite | Level 5

Art,

Thanks very much for your input, I didn't realize that the GUESSINGROWS max had been boosted from 32,767 rows in 9.2 to 2,147,483,647 rows in 9.3.

Unfortunately though, proc import has at least one limitation that gives me pause. It automatically truncates leading zeros when reading in variables it sets as numeric. If I have a variable with the following values, those values should be distinct in the imported dataset. However both are imported as "123": "00123", "0123".

I understand if there is no easy answer, but is there a way to either solve that issue with proc import or solve the original issue with the infile statement?

art297
Opal | Level 21

After you run proc import, press function key 4.  That will retrieve the code that was actually created.

Change the informats and formats for the fields that were incorrectly input as numbers, and rerun the code.

ZFreeman
Calcite | Level 5

Thanks a lot Arthur.

I'm trying to avoid manually editing code from the log if possible, but this seems to be the best solution. I appreciate the help!

I found another conversation with some similarities to this one, and it doesn't seem like there was ever a clean resolution either.

data_null__
Jade | Level 19

Anytime you're using a program with a parameter with GUESSING in the name you will have to rely on the guess.  Or as we like to say in these parts "they call it guessing for a reason".

If you know what your fields are and how they should be INFORMATTED then write a program that uses what you know.  PROC IMPORT is for the 80% of the time when it works.

ballardw
Super User

This is one time I would try using proc import. You can tell it to use the GUESSINGROWS option to examine upto the first 32767 rows of data to set the lengths of your variables. If that doesn't work you may want to look at the online help example in the INFILE example Reading Files That Contain Variable-Length Records.

Once again Arthur answered while I was typing!

Astounding
PROC Star

Whenever I have wanted to be 100% certain in these situations, I use your approach, but with two changes.

1. Don't use INFORMAT.  Use LENGTH instead.

2. Use a much longer length than you could ever imagine needing, for testing.

length testvar1 testvar 2 $ 50;

input testvar1 testvar2;

Then follow up with testing for the longest length needed:

proc means data=test min max maxdec=0;

   var lenvar1 lenvar2;

run;

Then finally, go back and change the LENGTH statement to reflect the maximum length actually used.

Tedious, but it works.

Good luck.

art297
Opal | Level 21

ZFreeman: While I don't often disagree with data_null_, this time I will.  Since we know what proc import is guessing, and you know what you want, why not compromise and give proc import what you want it to guess?  The following rewrites the input file, putting double quotes around anything that starts with a zero, and then imports the re-written file:

data _null_;

  infile "D:\Art\filepath.txt";

  file "D:\Art\temp.txt";

  input;

  i=1;

  do i=1 to 3;

    if first(scan(_infile_,i,";")) eq "0" then var=catt('"',scan(_infile_,i,";"),'"');

    else var=scan(_infile_,i,";");

    if i lt 3 then do;

      var=cats(var,";");

      put var @;

    end;

    else put var;

  end;

run;

PROC IMPORT OUT= WORK.want

            DATAFILE= "D:\Art\temp.txt"

            DBMS=DLM REPLACE;

     DELIMITER='3B'x;

     GETNAMES=NO;

     DATAROW=1;

     GUESSINGROWS=1500000;

RUN;

data_null__
Jade | Level 19

While you may think you're disagreeing I think you are doing exactly what I suggest albeit in a different way.:smileymischief:

art297
Opal | Level 21

That is definitely a relief!  Like I said, I seldom disagree with your advice.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1751 views
  • 6 likes
  • 5 in conversation