DATA Step, Macro, Functions and more

Data Parsing

Reply
Occasional Contributor
Posts: 8

Data Parsing

Hi SAS Experts,

   I have a tricky situation. I am trying to read an external file (file has no delimiters) to SAS, and I only want to extract  few strings from that external file to SAS dataset.  Is it possible to programtically do in SAS?

I am copying the file I have below (it is a really small part of the actual file I have). Actual file that I have is really huge & each lines can be really long. (FYI - it is a FDF file, also attached "blankcrf.fdf")

What I want to achieve is write a SAS code so I can extract the text between 'Rect[' and ']' for each occurance then store this text under a variable as different records. Say in this case, it will be

VAR1
257.777 308.923 272.177 320.923
114.308 424.398 128.708 436.398
118.974 511.878 133.374 523.878

EXTERNAL FILE I HAVE IS AS BELOW

ãÏÓ

2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj

Initially I was trying to read full file to SAS datasets and then extract the text I want.

But as the file is so huge, it was taking a lot of time for me to convert it to a SAS dataset and then do the extraction.

Also this file is not favourably formatted and has no delimiters with long single lines adding to the difficulty.

Experts  - any tricks to do it?

Appreciate your help.

- George

Attachment
Respected Advisor
Posts: 3,777

Re: Data Parsing

It the string is always the same length this will work.  If not it will take a bit more fiddling with the string.  The key is the @'string' syntax.

filename FT15F001 temp lrecl=512;
data rect;
   infile FT15F001;
   input @'<</Rect[' rect $31. @@;
   output;
  
parmcards4;
ãÏÓ

2 0 obj<</Rect[257.777 308.923 272.177 320.923]/CreationDate(D:20130715034903-04'00')/NM(30024a65-f161-4e17-a6ce-5348dee5cf05)/Subtype/FreeText/BS 3 0 R/F 4/M(D:20130715034909-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
3 0 obj<</W 0.0>>
endobj
4 0 obj<</Rect[114.308 424.398 128.708 436.398]/CreationDate(D:20130715035056-04'00')/NM(e13c51fe-b2b0-474d-9a8f-7e45e7b6d32d)/Subtype/FreeText/BS 5 0 R/F 4/M(D:20130715035104-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
5 0 obj<</W 0.0>>
endobj
6 0 obj<</Rect[118.974 511.878 133.374 523.878]/CreationDate(D:20130715035226-04'00')/NM(af556422-5caa-472e-bc23-c5865db207a8)/Subtype/FreeText/BS 7 0 R/F 4/M(D:20130715035312-04'00')/IT/FreeTextTypewriter/T(BINZME1)/DA(16.25 TL /Cour 12 Tf)/DS(font: Courier 12.0pt; text-align:left; color:#000000 )/Subj(Typewritten Text)/Type/Annot/Page 0>>
endobj
;;;;
   run;
proc print;
  
run;
Occasional Contributor
Posts: 8

Re: Data Parsing

Thanks so much for the quick solution. This is great, it works perfectly with the example I provided using @ 'string' functionality. Always I can assume that the string length will be 31 and so it is not an issue.

But the issue I got stuck is with the length of the line.

In my real file I have everything (almost the entire data) added in a single line and giving a smaller value for LRECL causes truncation of data.

And I have no control over how I recieve the file.

In some cases length of the line could be really long that giving lrecl as 30000 won't still be enough. Also increasing LRECL to a very higher value is also slowing the performance drasticaly (with system getting stuck at times) .

I was trying to automate it in a standard way so it will work even if the length of a line in ascii file is really long / short. And I am finding it dangerous to give a lrecl a higher value always as I don't know what value will be safe enough to capture all the data.  

Any suggestion or ideas?

Thanks again for the help!

PROC Star
Posts: 7,356

Re: Data Parsing

In the example dataset you posted the field is NOT always 31 characters.  Other than the c:\art part which you would have to change to reflect where your data are stored and where you can write a temporary file, you could try something like:

data _null_;

  infile 'c:\art\blankcrf.fdf' recfm=n;

  FILE 'c:\art\workfile.fdf' RECFM=N lrecl=1;

  informat var2 $7.;

  input VAR1 $CHAR1. @;

  IF VAR1 eq '<' THEN do;

    input var2 $char7. @;

    if var2 eq '</Rect[' then do;

      do while (var1 ne ']');

        input VAR1 $CHAR1. @;

        if var1 ne ']' then put var1@;

      end;

      put '0d'x@;

      put '0a'x@;

    end;

  end;

run;

data want;

  informat var1 $31.;

  infile 'c:\art\workfile.fdf';

  input var1 &;

run;

Super User
Posts: 3,101

Re: Data Parsing

What does your SAS log report for the file if you don't specify LRECL = 512? What does it report for the minimum and maximum record lengths?

You should be able to specify a record length up to at least 32767 if that is what the SAS log reports.

Respected Advisor
Posts: 3,777

Re: Data Parsing

I looked at the file blankcrf.  The "records" you want are delimited by CR or '0d'x  and have max length around 9000.  There are also some CRLF delimited records too but I think for your purpose just go with CR.

NOTE: 4689 records were read from the infile FT67F001.

      The minimum record length was 1.

      The maximum record length was 8763.

You can use a data step like this to "see" the records.

filename FT67F001 '~/blankcrf.fdf' lrecl=32767;
data _null_;
  
infile ft67f001 termstr=cr;
   input;
  
list;
  
run;

We can add the column infile statement option and use @'string' to find the column numbers of the RECT values like this data step similar to what we did before.

data rect;
   infile ft67f001 termstr=cr column=c;
   input @'<</Rect[' @; s=c;
   input @']' @; e=c-2;
   l=e-s+
1;
  
input @s rect $varying64. l;
   run;
Valued Guide
Posts: 2,174

Re: Data Parsing

Would a dlm= '[]' infile option simplify the input parsing to

length rect $64 ;

input @'<</Rect' rect & @@ ;

Respected Advisor
Posts: 3,777

Re: Data Parsing

I tried that but could not get it to work.  I don't know if it is because of the @'string' syntax or not.

Correction it does work.  I guess I don't know what I'm talking about. Again.

data rect1;
   infile ft67f001 termstr=cr dlm='[]';
  
informat rect $64.;
  
input @'<</Rect' rect;
   run;
Super Contributor
Posts: 376

Re: Data Parsing

Hi,

Here's another approach.  A combination of the examples in the doc for call prxnext and call prxposn.  I haven't counted occurrences of </Rect[...] in the source file, so I leave it to you to further check if this output is correct.

HTH,

Scott

P.S.:  Good tip from re: termstr=cr!

filename in "C:\Temp\blankcrf.fdf";

data test;

  infile in lrecl=32767 length=reclen termstr=cr;

  input;

rx=prxparse("#</Rect\[(.*?)\]#io");

  start=1;

stop=reclen;

  call prxnext(rx, start, stop, _infile_, position, length);

  do while (position > 0);

     call prxposn(rx, 1, start, length);

found=substr(_infile_,start,length);

     * put found= position= length= reclen=;

     output;

     call prxnext(rx, start, stop, _infile_, position, length);

  end;

run;

Here's an excerpt from the log.  If you see maximum record length >= 32767 or a truncated record message be worried.  Which you get without the termstr=cr infile option.

NOTE: The infile IN is:

      Filename=C:\Temp\blankcrf.fdf,

      RECFM=V,LRECL=32767,File Size (bytes)=667689,

      Last Modified=04 February 2014 17:39:46,

      Create Time=04 February 2014 17:39:46

NOTE: 4689 records were read from the infile IN.

      The minimum record length was 1.

      The maximum record length was 8763.

NOTE: The data set WORK.TEST has 1019 observations and 6 variables.

NOTE: DATA statement used (Total process time):

      real time           0.08 seconds

      cpu time            0.07 seconds

Ask a Question
Discussion stats
  • 8 replies
  • 496 views
  • 4 likes
  • 6 in conversation