BookmarkSubscribeRSS Feed
geoonline
Calcite | Level 5

Hi SAS Experts,

   I have a tricky situation. I am trying to read an external file (file has no delimiters) to SAS, and I only want to extract  few strings from that external file to SAS dataset.  Is it possible to programtically do in SAS?

I am copying the file I have below (it is a really small part of the actual file I have). Actual file that I have is really huge & each lines can be really long. (FYI - it is a FDF file, also attached "blankcrf.fdf")

What I want to achieve is write a SAS code so I can extract the text between 'Rect[' and ']' for each occurance then store this text under a variable as different records. Say in this case, it will be

VAR1
257.777 308.923 272.177 320.923
114.308 424.398 128.708 436.398
118.974 511.878 133.374 523.878

EXTERNAL FILE I HAVE IS AS BELOW

ãÏÓ

2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj

Initially I was trying to read full file to SAS datasets and then extract the text I want.

But as the file is so huge, it was taking a lot of time for me to convert it to a SAS dataset and then do the extraction.

Also this file is not favourably formatted and has no delimiters with long single lines adding to the difficulty.

Experts  - any tricks to do it?

Appreciate your help.

- George

8 REPLIES 8
data_null__
Jade | Level 19

It the string is always the same length this will work.  If not it will take a bit more fiddling with the string.  The key is the @'string' syntax.

filename FT15F001 temp lrecl=512;
data rect;
   infile FT15F001;
   input @'<</Rect[' rect $31. @@;
   output;
  
parmcards4;
ãÏÓ

2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
;;;;
   run;
proc print;
  
run;
geoonline
Calcite | Level 5

Thanks so much for the quick solution. This is great, it works perfectly with the example I provided using @ 'string' functionality. Always I can assume that the string length will be 31 and so it is not an issue.

But the issue I got stuck is with the length of the line.

In my real file I have everything (almost the entire data) added in a single line and giving a smaller value for LRECL causes truncation of data.

And I have no control over how I recieve the file.

In some cases length of the line could be really long that giving lrecl as 30000 won't still be enough. Also increasing LRECL to a very higher value is also slowing the performance drasticaly (with system getting stuck at times) .

I was trying to automate it in a standard way so it will work even if the length of a line in ascii file is really long / short. And I am finding it dangerous to give a lrecl a higher value always as I don't know what value will be safe enough to capture all the data.  

Any suggestion or ideas?

Thanks again for the help!

art297
Opal | Level 21

In the example dataset you posted the field is NOT always 31 characters.  Other than the c:\art part which you would have to change to reflect where your data are stored and where you can write a temporary file, you could try something like:

data _null_;

  infile 'c:\art\blankcrf.fdf' recfm=n;

  FILE 'c:\art\workfile.fdf' RECFM=N lrecl=1;

  informat var2 $7.;

  input VAR1 $CHAR1. @;

  IF VAR1 eq '<' THEN do;

    input var2 $char7. @;

    if var2 eq '</Rect[' then do;

      do while (var1 ne ']');

        input VAR1 $CHAR1. @;

        if var1 ne ']' then put var1@;

      end;

      put '0d'x@;

      put '0a'x@;

    end;

  end;

run;

data want;

  informat var1 $31.;

  infile 'c:\art\workfile.fdf';

  input var1 &;

run;

SASKiwi
PROC Star

What does your SAS log report for the file if you don't specify LRECL = 512? What does it report for the minimum and maximum record lengths?

You should be able to specify a record length up to at least 32767 if that is what the SAS log reports.

data_null__
Jade | Level 19

I looked at the file blankcrf.  The "records" you want are delimited by CR or '0d'x  and have max length around 9000.  There are also some CRLF delimited records too but I think for your purpose just go with CR.

NOTE: 4689 records were read from the infile FT67F001.

      The minimum record length was 1.

      The maximum record length was 8763.

You can use a data step like this to "see" the records.

filename FT67F001 '~/blankcrf.fdf' lrecl=32767;
data _null_;
  
infile ft67f001 termstr=cr;
   input;
  
list;
  
run;

We can add the column infile statement option and use @'string' to find the column numbers of the RECT values like this data step similar to what we did before.

data rect;
   infile ft67f001 termstr=cr column=c;
   input @'<</Rect[' @; s=c;
   input @']' @; e=c-2;
   l=e-s+
1;
  
input @s rect $varying64. l;
   run;
Peter_C
Rhodochrosite | Level 12

Would a dlm= '[]' infile option simplify the input parsing to

length rect $64 ;

input @'<</Rect' rect & @@ ;

data_null__
Jade | Level 19

I tried that but could not get it to work.  I don't know if it is because of the @'string' syntax or not.

Correction it does work.  I guess I don't know what I'm talking about. Again.

data rect1;
   infile ft67f001 termstr=cr dlm='[]';
  
informat rect $64.;
  
input @'<</Rect' rect;
   run;
ScottBass
Rhodochrosite | Level 12

Hi,

Here's another approach.  A combination of the examples in the doc for call prxnext and call prxposn.  I haven't counted occurrences of </Rect[...] in the source file, so I leave it to you to further check if this output is correct.

HTH,

Scott

P.S.:  Good tip from re: termstr=cr!

filename in "C:\Temp\blankcrf.fdf";

data test;

  infile in lrecl=32767 length=reclen termstr=cr;

  input;

rx=prxparse("#</Rect\[(.*?)\]#io");

  start=1;

stop=reclen;

  call prxnext(rx, start, stop, _infile_, position, length);

  do while (position > 0);

     call prxposn(rx, 1, start, length);

found=substr(_infile_,start,length);

     * put found= position= length= reclen=;

     output;

     call prxnext(rx, start, stop, _infile_, position, length);

  end;

run;

Here's an excerpt from the log.  If you see maximum record length >= 32767 or a truncated record message be worried.  Which you get without the termstr=cr infile option.

NOTE: The infile IN is:

      Filename=C:\Temp\blankcrf.fdf,

      RECFM=V,LRECL=32767,File Size (bytes)=667689,

      Last Modified=04 February 2014 17:39:46,

      Create Time=04 February 2014 17:39:46

NOTE: 4689 records were read from the infile IN.

      The minimum record length was 1.

      The maximum record length was 8763.

NOTE: The data set WORK.TEST has 1019 observations and 6 variables.

NOTE: DATA statement used (Total process time):

      real time           0.08 seconds

      cpu time            0.07 seconds


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1475 views
  • 4 likes
  • 6 in conversation