- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi SAS Experts,
I have a tricky situation. I am trying to read an external file (file has no delimiters) to SAS, and I only want to extract few strings from that external file to SAS dataset. Is it possible to programtically do in SAS?
I am copying the file I have below (it is a really small part of the actual file I have). Actual file that I have is really huge & each lines can be really long. (FYI - it is a FDF file, also attached "blankcrf.fdf")
What I want to achieve is write a SAS code so I can extract the text between 'Rect[' and ']' for each occurance then store this text under a variable as different records. Say in this case, it will be
VAR1 |
---|
257.777 308.923 272.177 320.923 |
114.308 424.398 128.708 436.398 |
118.974 511.878 133.374 523.878 |
EXTERNAL FILE I HAVE IS AS BELOW
ãÏÓ
2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
Initially I was trying to read full file to SAS datasets and then extract the text I want.
But as the file is so huge, it was taking a lot of time for me to convert it to a SAS dataset and then do the extraction.
Also this file is not favourably formatted and has no delimiters with long single lines adding to the difficulty.
Experts - any tricks to do it?
Appreciate your help.
- George
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It the string is always the same length this will work. If not it will take a bit more fiddling with the string. The key is the @'string' syntax.
data rect;
infile FT15F001;
input @'<</Rect[' rect $31. @@;
output;
parmcards4;
ãÏÓ
2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
;;;;
run;
proc print;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for the quick solution. This is great, it works perfectly with the example I provided using @ 'string' functionality. Always I can assume that the string length will be 31 and so it is not an issue.
But the issue I got stuck is with the length of the line.
In my real file I have everything (almost the entire data) added in a single line and giving a smaller value for LRECL causes truncation of data.
And I have no control over how I recieve the file.
In some cases length of the line could be really long that giving lrecl as 30000 won't still be enough. Also increasing LRECL to a very higher value is also slowing the performance drasticaly (with system getting stuck at times) .
I was trying to automate it in a standard way so it will work even if the length of a line in ascii file is really long / short. And I am finding it dangerous to give a lrecl a higher value always as I don't know what value will be safe enough to capture all the data.
Any suggestion or ideas?
Thanks again for the help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In the example dataset you posted the field is NOT always 31 characters. Other than the c:\art part which you would have to change to reflect where your data are stored and where you can write a temporary file, you could try something like:
data _null_;
infile 'c:\art\blankcrf.fdf' recfm=n;
FILE 'c:\art\workfile.fdf' RECFM=N lrecl=1;
informat var2 $7.;
input VAR1 $CHAR1. @;
IF VAR1 eq '<' THEN do;
input var2 $char7. @;
if var2 eq '</Rect[' then do;
do while (var1 ne ']');
input VAR1 $CHAR1. @;
if var1 ne ']' then put var1@;
end;
put '0d'x@;
put '0a'x@;
end;
end;
run;
data want;
informat var1 $31.;
infile 'c:\art\workfile.fdf';
input var1 &;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What does your SAS log report for the file if you don't specify LRECL = 512? What does it report for the minimum and maximum record lengths?
You should be able to specify a record length up to at least 32767 if that is what the SAS log reports.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I looked at the file blankcrf. The "records" you want are delimited by CR or '0d'x and have max length around 9000. There are also some CRLF delimited records too but I think for your purpose just go with CR.
NOTE: 4689 records were read from the infile FT67F001.
The minimum record length was 1.
The maximum record length was 8763.
You can use a data step like this to "see" the records.
data _null_;
infile ft67f001 termstr=cr;
input;
list;
run;
We can add the column infile statement option and use @'string' to find the column numbers of the RECT values like this data step similar to what we did before.
infile ft67f001 termstr=cr column=c;
input @'<</Rect[' @; s=c;
input @']' @; e=c-2;
l=e-s+1;
input @s rect $varying64. l;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Would a dlm= '[]' infile option simplify the input parsing to
length rect $64 ;
input @'<</Rect' rect & @@ ;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tried that but could not get it to work. I don't know if it is because of the @'string' syntax or not.
Correction it does work. I guess I don't know what I'm talking about. Again.
infile ft67f001 termstr=cr dlm='[]';
informat rect $64.;
input @'<</Rect' rect;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Here's another approach. A combination of the examples in the doc for call prxnext and call prxposn. I haven't counted occurrences of </Rect[...] in the source file, so I leave it to you to further check if this output is correct.
HTH,
Scott
P.S.: Good tip from re: termstr=cr!
filename in "C:\Temp\blankcrf.fdf";
data test;
infile in lrecl=32767 length=reclen termstr=cr;
input;
rx=prxparse("#</Rect\[(.*?)\]#io");
start=1;
stop=reclen;
call prxnext(rx, start, stop, _infile_, position, length);
do while (position > 0);
call prxposn(rx, 1, start, length);
found=substr(_infile_,start,length);
* put found= position= length= reclen=;
output;
call prxnext(rx, start, stop, _infile_, position, length);
end;
run;
Here's an excerpt from the log. If you see maximum record length >= 32767 or a truncated record message be worried. Which you get without the termstr=cr infile option.
NOTE: The infile IN is:
Filename=C:\Temp\blankcrf.fdf,
RECFM=V,LRECL=32767,File Size (bytes)=667689,
Last Modified=04 February 2014 17:39:46,
Create Time=04 February 2014 17:39:46
NOTE: 4689 records were read from the infile IN.
The minimum record length was 1.
The maximum record length was 8763.
NOTE: The data set WORK.TEST has 1019 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.08 seconds
cpu time 0.07 seconds
Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.