BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JohnKeighley
Obsidian | Level 7

I am using SAS 9.4 M5

 

I am reading a text file with approximately 700,000,000 lines of data as text.

 

I need to know the number of bytes from beginning of the file to the current line.

 

The application has been used in the Windows and Unix environments so one line might have a carriage return and line feed at the end while the next line only has a line feed.

 

The application stores one variable per line so they are variable length records.

 

Lines vary in length from 0 bytes(no data) to 1950 bytes.

 

I initially assumed that every line would have both characters so I added 2 to the length of every line and kept a running total.

 

But, approximately 100,000,000 lines only have one character.

 

The data has PHI so I can't post an example

 

The code shown doesn't calculate the running total.The text string doesn't include the control characters and I don't know how to read the file and get a count with those characters.

 

data pristine;
infile "R:\test\file1.txt" firstobs=2 truncover lrecl=2000 length=lv ignoredoseof;
input @1 v1 $varying2000. lv;
linelen=lv;
line=_n_;
run;

 

This code gives me a length excluding special characters.

1 ACCEPTED SOLUTION

Accepted Solutions
JohnKeighley
Obsidian | Level 7

I only need the line number and length so I changed your code to:

 

data test;

infile test recfm=n ;

do row=1 by 1 until (ch='0A'x);

do col=1 by 1 until (ch='0A'x);

input ch $char1. ;

if _n_=1 then line=1;

if _n_>1 and col=1 then line+1;

output;

end;

end;

run;

data test1(keep=linelen line);

set test(rename=(col=linelen));

by line;

if last.line;

run;

 

With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?

 

View solution in original post

14 REPLIES 14
Tom
Super User Tom
Super User

Why not just ask the operating system how big the file is?

Or read it as a binary file and count the number of bytes?

or read it using TERMSTR=LF and add one byte per line?  You might need to remove CR's.

JohnKeighley
Obsidian | Level 7

This file has over 1,00,000 records in it of two lengths. 1001 lines for one type and 304 lines for the other type.

 

Pointers are used to jump to each record. Some of the pointers in the index file have been corrupted. I am trying to recover the pointers which is the physical location in the file in bytes counting the carriage returns and line feeds. I need to count every byte.

 

If a record is saved multiple times they are daisy chained together in the file. Approximately 500 pointers have been corrupted. The data hasn't changed. Just lost the ability to look up the record from the application.

 

I haven't read any files using binary format. Can I convert it to bytes after reading it? From your reply I would say yes. Can you point me to some example code?

 

 

Tom
Super User Tom
Super User

Sounds like the file actually is a binary file.  Does it really have CR and LF characters in it anyway?

You can use the RECFM=N to read the file byte by byte.

data test;
infile 'myfile' recfm=n ;
do row=1 by 1 until (ch='0A'x);
   do col=1 by 1 until (ch='0A'x);
      input ch $char1. ;
      output;
   end;
end;
run;
JohnKeighley
Obsidian | Level 7

The file is ascii. I can go into it with UltraEdit and read the text. I can't use Notepad++ as the file is over 1GB so Notepadd++ can't open it.

 

How do I count the bytes and include the carriage return and line feeds. Can I include a count variable in the loop?

 

I think every line has a line feed so I use that as the end of line?

 

I can experiment with this.

 

Thank you

 

Tom
Super User Tom
Super User

Doesn't matter if most (or all) of the bytes are valid ASCII codes.  If it has "pointers" then it is a binary format.

JohnKeighley
Obsidian | Level 7

I haven't use pointers since 1991 so I can't say much about that.

 

I am running the code and it appears to be working. I will post an update on the results by tomorrow.

JohnKeighley
Obsidian | Level 7

This has worked very well.

 

Thank you

 

Tom
Super User Tom
Super User

If there are enough LFs in the data so that no "line" is longer than 32K bytes then you can simplify the process if you want.

data lines (compress=yes);
  retain startpos 0;
  infile 'myfile' termstr=lf length=ll end=eof;
  line_length=ll;
  input line $varying32767. line_length ;
  output;
  startpos + line_length + 1;
  if eof then put 'Total File length=' startpos;
run;
JohnKeighley
Obsidian | Level 7

I only need the line number and length so I changed your code to:

 

data test;

infile test recfm=n ;

do row=1 by 1 until (ch='0A'x);

do col=1 by 1 until (ch='0A'x);

input ch $char1. ;

if _n_=1 then line=1;

if _n_>1 and col=1 then line+1;

output;

end;

end;

run;

data test1(keep=linelen line);

set test(rename=(col=linelen));

by line;

if last.line;

run;

 

With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?

 

JohnKeighley
Obsidian | Level 7

I should have accepted your earlier code as the solution.

 

I didn't realize that it would select my reply as the solution.

 

First time I have done that

 

Tom
Super User Tom
Super User

Doesn't really matter, but you should be able to change the which answer you mark as the solution.

JohnKeighley
Obsidian | Level 7

This is code that I ran on a small test file that I created.

 

filename test 'c:\data\test.txt';

data test;

infile test recfm=n ;

do row=1 by 1 until (ch='0A'x);

do col=1 by 1 until (ch='0A'x);

input ch $char1. ;

if _n_=1 then n=1;

if _n_>1 and col=1 then n+1;

output;

end;

end;

run;

proc print data=test (obs=1000);

run;

data test1(keep=linelen line);

set test(rename=(col=linelen n=line));

by line;

if last.line;

run;

proc freq;

table linelen;

run;

 

I have attached the data file.

 

Tom
Super User Tom
Super User

Not sure why you added another variable LINE to replicate the ROW variable.  Perhaps it shouldn't have a DO loop?

data test;
  infile test recfm=n ;
  line+1;
  do col=1 by 1 until (ch='0A'x);
    input ch $char1. ;
    output;
  end;
run;
proc summary data=test;
  by line ;
  var col;
  output out=test1 max=linelen;
run;

If you read data with LF only as the end-of-line marker then CR is treated the same as any other character.

JohnKeighley
Obsidian | Level 7

I have attached the output after changing the proc freq from a one way to a two-way with /list missing.

 

Note, the row  variable kept value of 1 for the entire file.

That is why I dropped the row variable.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 3431 views
  • 0 likes
  • 2 in conversation