I am using SAS 9.4 M5
I am reading a text file with approximately 700,000,000 lines of data as text.
I need to know the number of bytes from beginning of the file to the current line.
The application has been used in the Windows and Unix environments so one line might have a carriage return and line feed at the end while the next line only has a line feed.
The application stores one variable per line so they are variable length records.
Lines vary in length from 0 bytes(no data) to 1950 bytes.
I initially assumed that every line would have both characters so I added 2 to the length of every line and kept a running total.
But, approximately 100,000,000 lines only have one character.
The data has PHI so I can't post an example
The code shown doesn't calculate the running total.The text string doesn't include the control characters and I don't know how to read the file and get a count with those characters.
data pristine;
infile "R:\test\file1.txt" firstobs=2 truncover lrecl=2000 length=lv ignoredoseof;
input @1 v1 $varying2000. lv;
linelen=lv;
line=_n_;
run;
This code gives me a length excluding special characters.
I only need the line number and length so I changed your code to:
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then line=1;
if _n_>1 and col=1 then line+1;
output;
end;
end;
run;
data test1(keep=linelen line);
set test(rename=(col=linelen));
by line;
if last.line;
run;
With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?
Why not just ask the operating system how big the file is?
Or read it as a binary file and count the number of bytes?
or read it using TERMSTR=LF and add one byte per line? You might need to remove CR's.
This file has over 1,00,000 records in it of two lengths. 1001 lines for one type and 304 lines for the other type.
Pointers are used to jump to each record. Some of the pointers in the index file have been corrupted. I am trying to recover the pointers which is the physical location in the file in bytes counting the carriage returns and line feeds. I need to count every byte.
If a record is saved multiple times they are daisy chained together in the file. Approximately 500 pointers have been corrupted. The data hasn't changed. Just lost the ability to look up the record from the application.
I haven't read any files using binary format. Can I convert it to bytes after reading it? From your reply I would say yes. Can you point me to some example code?
Sounds like the file actually is a binary file. Does it really have CR and LF characters in it anyway?
You can use the RECFM=N to read the file byte by byte.
data test;
infile 'myfile' recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
output;
end;
end;
run;
The file is ascii. I can go into it with UltraEdit and read the text. I can't use Notepad++ as the file is over 1GB so Notepadd++ can't open it.
How do I count the bytes and include the carriage return and line feeds. Can I include a count variable in the loop?
I think every line has a line feed so I use that as the end of line?
I can experiment with this.
Thank you
Doesn't matter if most (or all) of the bytes are valid ASCII codes. If it has "pointers" then it is a binary format.
I haven't use pointers since 1991 so I can't say much about that.
I am running the code and it appears to be working. I will post an update on the results by tomorrow.
This has worked very well.
Thank you
If there are enough LFs in the data so that no "line" is longer than 32K bytes then you can simplify the process if you want.
data lines (compress=yes);
retain startpos 0;
infile 'myfile' termstr=lf length=ll end=eof;
line_length=ll;
input line $varying32767. line_length ;
output;
startpos + line_length + 1;
if eof then put 'Total File length=' startpos;
run;
I only need the line number and length so I changed your code to:
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then line=1;
if _n_>1 and col=1 then line+1;
output;
end;
end;
run;
data test1(keep=linelen line);
set test(rename=(col=linelen));
by line;
if last.line;
run;
With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?
I should have accepted your earlier code as the solution.
I didn't realize that it would select my reply as the solution.
First time I have done that
Doesn't really matter, but you should be able to change the which answer you mark as the solution.
This is code that I ran on a small test file that I created.
filename test 'c:\data\test.txt';
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then n=1;
if _n_>1 and col=1 then n+1;
output;
end;
end;
run;
proc print data=test (obs=1000);
run;
data test1(keep=linelen line);
set test(rename=(col=linelen n=line));
by line;
if last.line;
run;
proc freq;
table linelen;
run;
I have attached the data file.
Not sure why you added another variable LINE to replicate the ROW variable. Perhaps it shouldn't have a DO loop?
data test;
infile test recfm=n ;
line+1;
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
output;
end;
run;
proc summary data=test;
by line ;
var col;
output out=test1 max=linelen;
run;
If you read data with LF only as the end-of-line marker then CR is treated the same as any other character.
I have attached the output after changing the proc freq from a one way to a two-way with /list missing.
Note, the row variable kept value of 1 for the entire file.
That is why I dropped the row variable.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.