BookmarkSubscribeRSS Feed
Yura2301
Quartz | Level 8

Hi all,

So next code reads big file into sas table:

      filename in "&infile";

      data dset1;

            infile in dsd pad recfm=n lrecl=32767;

            input txt: $32767. @@;

            len=length(txt);

            row=_n_;

      run;


But I mentioned that sometimes in dataset creted from file last charater loses.

I also checked summury length of text columns:

proc sql;

select sum(length(txt)) from dset1;

quit;

And I see that it is also less for a few bytes from real file size...

I mentioned that usually loses last symbol from column value in rows where file length equal lrecl value that I set in infile statement(32767).

So maybe someone also had same problem and know how to fix it?

Also can somebody explain why each record in target dataset "dset1" has different length?Why it can't be always same length mentioned in lrecl option value?

Thanks!

5 REPLIES 5
Tom
Super User Tom
Super User

Not sure what you are talking about, but here are some hints that might help you solve it.

1) SAS variables are limited to length of 32767, but LRECL of an INFILE is not.  So if your real file has record length longer than max that you can create a character variable then you cannot read it into a single variable.  Should be easy to read into multiple variables instead.

2A) Reading using the $ format will trim leading blanks. Using $CHAR or $VARYING informat to preserve the leading spaces.

2B) The LENGTH() function ignores trailing blanks.

So the sum of the lengths of the lines could be smaller than the total size of the source file.

Yura2301
Quartz | Level 8

Hi Tom,

Regarding your answers:

1)...

Should be easy to read into multiple variables instead.

I actually reads data into multiply rows, so if file bigger then 32767, and has for example 33000 chars, I usually have two rows table - first row has 32767 chars and second rest 233 ... But not usually , sometimes code higher read such files into more then two rows with less length, it probably depends from specific of data in file.

2A) Reading using the $ format will trim leading blanks. Using $CHAR or $VARYING informat to preserve the leading spaces.

But I actually used $ format in my code:input txt: $32767. @@;

I also tried one that you wrote -$CHAR32767. and result was the same - last character in one row was  missed.

2B) The LENGTH() function ignores trailing blanks.

Ok, it's clear.

To simplify the problem I can show preliminary example of issue:

so file is "123456789";

Code higher reads it into dataset with two rows:

1234

6789

so "5" chars missed for a some reason.

In real case problem is the same only record length bigger(32767, I also tried 32000 result the same).

Thanks!

Tom
Super User Tom
Super User

What is the meaning of using DSD input with RECFM=N?  Also what would PAD be doing for RECFM=N?

Why are you using : modifier on the input statement?  Are you trying to read the text word  by word?  What happens when a word happens to span the artificial boundary set by the choice of setting for LRECL in the INFILE statement?

Another thing to consider is that you might be reading from a file that SAS considers to be using a different character encoding. So it might be transcoding one or more characters into more than a single byte in the internal representation and hence overflowing the content of the dataset variable.

What if you made the length of your dataset variable longer than the LRECL setting?

data dset1;

    infile in /* dsd pad */ recfm=n lrecl=16384 /*32767*/;

    input txt: $32767. @@;

  len=length(txt);

  row=_n_;

run;

Yura2301
Quartz | Level 8

Hi again Tom,

So you asked:

What is the meaning of using DSD input with RECFM=N?

You are right there are no delimiters in file, but  there are a lot of special characters in it(it's xml) , also a there a lot of text like: Name="test1,test2"...

Without this option datastep executes long time, I actually just kill the process becouse it can't ends without this option.

Also what would PAD be doing for RECFM=N?

Sorry I just experimented with this option , I remember that it works with blanks during reading so I justn try it.

About code that ypu provided - Same problem, one character loses.

I'll try to create some test file and reproduce the issue on it.

Thanks!

Yura2301
Quartz | Level 8

I reproduced the issue on test file that contains just sequanced characters 012345678901234567890...

The problem occurs becouse of DSD option, without it reading from file works ok...

Without DSD option data set creates for a long time and each row that was read has small langth - 100 char,50 etc., but with DSD option it reads much more chars(sometimes even 32767) so data set executes much fuster and has less amount of rows, but with bigger length, plus this ugly issue with losing char occurs.

File is attached to conversation, code is one of mentioned higher with dsd option.

To see the issue you can copy value of first row to N++ and then to the next row copy value of column from second row, as you can see "7" char is missing.

sas-innovate-2024.png

Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

 

Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 3385 views
  • 4 likes
  • 2 in conversation