Hi SAS Experts,
I have a tricky situation. I am trying to read an external file (file has no delimiters) to SAS, and I only want to extract few strings from that external file to SAS dataset. Is it possible to programtically do in SAS?
I am copying the file I have below (it is a really small part of the actual file I have). Actual file that I have is really huge & each lines can be really long. (FYI - it is a FDF file, also attached "blankcrf.fdf")
What I want to achieve is write a SAS code so I can extract the text between 'Rect[' and ']' for each occurance then store this text under a variable as different records. Say in this case, it will be
VAR1 |
---|
257.777 308.923 272.177 320.923 |
114.308 424.398 128.708 436.398 |
118.974 511.878 133.374 523.878 |
EXTERNAL FILE I HAVE IS AS BELOW
ãÏÓ
2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
Initially I was trying to read full file to SAS datasets and then extract the text I want.
But as the file is so huge, it was taking a lot of time for me to convert it to a SAS dataset and then do the extraction.
Also this file is not favourably formatted and has no delimiters with long single lines adding to the difficulty.
Experts - any tricks to do it?
Appreciate your help.
- George
It the string is always the same length this will work. If not it will take a bit more fiddling with the string. The key is the @'string' syntax.
Thanks so much for the quick solution. This is great, it works perfectly with the example I provided using @ 'string' functionality. Always I can assume that the string length will be 31 and so it is not an issue.
But the issue I got stuck is with the length of the line.
In my real file I have everything (almost the entire data) added in a single line and giving a smaller value for LRECL causes truncation of data.
And I have no control over how I recieve the file.
In some cases length of the line could be really long that giving lrecl as 30000 won't still be enough. Also increasing LRECL to a very higher value is also slowing the performance drasticaly (with system getting stuck at times) .
I was trying to automate it in a standard way so it will work even if the length of a line in ascii file is really long / short. And I am finding it dangerous to give a lrecl a higher value always as I don't know what value will be safe enough to capture all the data.
Any suggestion or ideas?
Thanks again for the help!
In the example dataset you posted the field is NOT always 31 characters. Other than the c:\art part which you would have to change to reflect where your data are stored and where you can write a temporary file, you could try something like:
data _null_;
infile 'c:\art\blankcrf.fdf' recfm=n;
FILE 'c:\art\workfile.fdf' RECFM=N lrecl=1;
informat var2 $7.;
input VAR1 $CHAR1. @;
IF VAR1 eq '<' THEN do;
input var2 $char7. @;
if var2 eq '</Rect[' then do;
do while (var1 ne ']');
input VAR1 $CHAR1. @;
if var1 ne ']' then put var1@;
end;
put '0d'x@;
put '0a'x@;
end;
end;
run;
data want;
informat var1 $31.;
infile 'c:\art\workfile.fdf';
input var1 &;
run;
What does your SAS log report for the file if you don't specify LRECL = 512? What does it report for the minimum and maximum record lengths?
You should be able to specify a record length up to at least 32767 if that is what the SAS log reports.
I looked at the file blankcrf. The "records" you want are delimited by CR or '0d'x and have max length around 9000. There are also some CRLF delimited records too but I think for your purpose just go with CR.
NOTE: 4689 records were read from the infile FT67F001.
The minimum record length was 1.
The maximum record length was 8763.
You can use a data step like this to "see" the records.
We can add the column infile statement option and use @'string' to find the column numbers of the RECT values like this data step similar to what we did before.
Would a dlm= '[]' infile option simplify the input parsing to
length rect $64 ;
input @'<</Rect' rect & @@ ;
I tried that but could not get it to work. I don't know if it is because of the @'string' syntax or not.
Correction it does work. I guess I don't know what I'm talking about. Again.
Hi,
Here's another approach. A combination of the examples in the doc for call prxnext and call prxposn. I haven't counted occurrences of </Rect[...] in the source file, so I leave it to you to further check if this output is correct.
HTH,
Scott
P.S.: Good tip from re: termstr=cr!
filename in "C:\Temp\blankcrf.fdf";
data test;
infile in lrecl=32767 length=reclen termstr=cr;
input;
rx=prxparse("#</Rect\[(.*?)\]#io");
start=1;
stop=reclen;
call prxnext(rx, start, stop, _infile_, position, length);
do while (position > 0);
call prxposn(rx, 1, start, length);
found=substr(_infile_,start,length);
* put found= position= length= reclen=;
output;
call prxnext(rx, start, stop, _infile_, position, length);
end;
run;
Here's an excerpt from the log. If you see maximum record length >= 32767 or a truncated record message be worried. Which you get without the termstr=cr infile option.
NOTE: The infile IN is:
Filename=C:\Temp\blankcrf.fdf,
RECFM=V,LRECL=32767,File Size (bytes)=667689,
Last Modified=04 February 2014 17:39:46,
Create Time=04 February 2014 17:39:46
NOTE: 4689 records were read from the infile IN.
The minimum record length was 1.
The maximum record length was 8763.
NOTE: The data set WORK.TEST has 1019 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.07 seconds
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.