- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am using SAS 9.4 M5
I am reading a text file with approximately 700,000,000 lines of data as text.
I need to know the number of bytes from beginning of the file to the current line.
The application has been used in the Windows and Unix environments so one line might have a carriage return and line feed at the end while the next line only has a line feed.
The application stores one variable per line so they are variable length records.
Lines vary in length from 0 bytes(no data) to 1950 bytes.
I initially assumed that every line would have both characters so I added 2 to the length of every line and kept a running total.
But, approximately 100,000,000 lines only have one character.
The data has PHI so I can't post an example
The code shown doesn't calculate the running total.The text string doesn't include the control characters and I don't know how to read the file and get a count with those characters.
data pristine;
infile "R:\test\file1.txt" firstobs=2 truncover lrecl=2000 length=lv ignoredoseof;
input @1 v1 $varying2000. lv;
linelen=lv;
line=_n_;
run;
This code gives me a length excluding special characters.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I only need the line number and length so I changed your code to:
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then line=1;
if _n_>1 and col=1 then line+1;
output;
end;
end;
run;
data test1(keep=linelen line);
set test(rename=(col=linelen));
by line;
if last.line;
run;
With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why not just ask the operating system how big the file is?
Or read it as a binary file and count the number of bytes?
or read it using TERMSTR=LF and add one byte per line? You might need to remove CR's.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This file has over 1,00,000 records in it of two lengths. 1001 lines for one type and 304 lines for the other type.
Pointers are used to jump to each record. Some of the pointers in the index file have been corrupted. I am trying to recover the pointers which is the physical location in the file in bytes counting the carriage returns and line feeds. I need to count every byte.
If a record is saved multiple times they are daisy chained together in the file. Approximately 500 pointers have been corrupted. The data hasn't changed. Just lost the ability to look up the record from the application.
I haven't read any files using binary format. Can I convert it to bytes after reading it? From your reply I would say yes. Can you point me to some example code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sounds like the file actually is a binary file. Does it really have CR and LF characters in it anyway?
You can use the RECFM=N to read the file byte by byte.
data test;
infile 'myfile' recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
output;
end;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The file is ascii. I can go into it with UltraEdit and read the text. I can't use Notepad++ as the file is over 1GB so Notepadd++ can't open it.
How do I count the bytes and include the carriage return and line feeds. Can I include a count variable in the loop?
I think every line has a line feed so I use that as the end of line?
I can experiment with this.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Doesn't matter if most (or all) of the bytes are valid ASCII codes. If it has "pointers" then it is a binary format.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I haven't use pointers since 1991 so I can't say much about that.
I am running the code and it appears to be working. I will post an update on the results by tomorrow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This has worked very well.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If there are enough LFs in the data so that no "line" is longer than 32K bytes then you can simplify the process if you want.
data lines (compress=yes);
retain startpos 0;
infile 'myfile' termstr=lf length=ll end=eof;
line_length=ll;
input line $varying32767. line_length ;
output;
startpos + line_length + 1;
if eof then put 'Total File length=' startpos;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I only need the line number and length so I changed your code to:
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then line=1;
if _n_>1 and col=1 then line+1;
output;
end;
end;
run;
data test1(keep=linelen line);
set test(rename=(col=linelen));
by line;
if last.line;
run;
With your new suggestion, on the lines that have a crlf would the cr be embedded in the text string?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I should have accepted your earlier code as the solution.
I didn't realize that it would select my reply as the solution.
First time I have done that
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Doesn't really matter, but you should be able to change the which answer you mark as the solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is code that I ran on a small test file that I created.
filename test 'c:\data\test.txt';
data test;
infile test recfm=n ;
do row=1 by 1 until (ch='0A'x);
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
if _n_=1 then n=1;
if _n_>1 and col=1 then n+1;
output;
end;
end;
run;
proc print data=test (obs=1000);
run;
data test1(keep=linelen line);
set test(rename=(col=linelen n=line));
by line;
if last.line;
run;
proc freq;
table linelen;
run;
I have attached the data file.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Not sure why you added another variable LINE to replicate the ROW variable. Perhaps it shouldn't have a DO loop?
data test;
infile test recfm=n ;
line+1;
do col=1 by 1 until (ch='0A'x);
input ch $char1. ;
output;
end;
run;
proc summary data=test;
by line ;
var col;
output out=test1 max=linelen;
run;
If you read data with LF only as the end-of-line marker then CR is treated the same as any other character.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have attached the output after changing the proc freq from a one way to a two-way with /list missing.
Note, the row variable kept value of 1 for the entire file.
That is why I dropped the row variable.