SAS Programming

LuisB · Posted 01-20-2022 06:02 AM

Hi community.

I'm trying to parse an .fdf file that looks something like this:

File.fdf

-----------------------------------------------------------

askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe

-------------------------------------------------------

I basically want to recover (1) the variable names that are always in-between "Contents(" and the next ")", and (2) the corresponding page number that is always after "Page " and before the next ">>".

I have been trying to use a DATA step with an INFILE. The issues I have been encountering are about how to read several VARIABLES and PAGES from a single _INFILE_ line, since I can make it work to read a single VARIABLE and PAGE per input line. This is compounded with the fact that the lines are way too long in the .fdf file (> 50,000 characters).

Is there a way in which perhaps I can split the file directly from SAS into different lines? This is more or less what I have tried so far:

SAS_SCRIPT

------------------------------------------------------------

data;

infile "file.fdf" linesize=32767 N=10000;

input;

/*Gather occurrence for the loop*/
vars =countc(_infile_,"Contents(");
pags=countc(_infile_,"Page ");

pos=1;

/* Start loop to find all occurrences of variables */
do i=0 TO min(vars,pags);

pos_var = find(_infile_, "Contents(",pos)+9;

if ( pos_var > 9 ) then do;

pos = pos_var;
length_var = find(_infile_, ")",pos_var) - pos_var;

if ( length_var > ) then do;

pos_page = find(_infile_,")",pos_var) + 5;
if ( pos_page > 5 ) then do;
length_page = find(_infile_, ">",pos_page) - pos_page;
if ( length_page > 0 ) then do;
/* Now input */
var = substr(_infile_, pos_var, length_var);
page = substr(_infile_, pos_page, length_page);
end;

end;

end;
end;

end;

run;

------------------------------------------------------------

Expected output

-------------------------------------------------------

name page

variable1 1

variable1 2

variable 2 1

.... etc

-------------------------------------------------------

Any ideas?

Tom · Posted 01-20-2022 09:21 AM

That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.

First let's make an actual text file we can use as input.

options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe
;;;;

Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.

data want ;
  infile fdf column=cc ;
  input @'<</Contents(' variable :$200. @;
  variable=scan(variable,1,')');
  input @'Page ' @ ;
  s = cc ;
  input string :$20. @;
  Page = scan(string,1,'>') ;
  output;
  input @s @@;
  drop s string;
run;

Result:

Obs    variable               Page

 1     variable1               1
 2     variable2               4
 3     variable2-akljsdfkj     2
 4     variable1               4

If there are limits on what characters can be valid in the variable name then it could be even simpler:

data want ;
  infile fdf flowover dlm=')> ';
  input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;

View solution in original post

LuisB · Posted 01-20-2022 06:36 AM

I also tried, unsuccessfully. the following:
-----------------------------------------
data work.import;
infile 'file.fdf' linesize 32767;
length page $3.;

do i=0 to 1000 until (page='');
input
@'Contents(' variable $30.
@'Page ' page $3. @;
end;
stop;
run;
-------------------------------------

data_null__ · Posted 01-20-2022 09:00 AM

Given your example show the expected output.

File.fdf

-----------------------------------------------------------

askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe

-------------------------------------------------------

Tom · Posted 01-20-2022 09:08 AM

Do you have a design document that describes the file format?

It does not look like an Adobe Form Data File. https://docs.appligent.com/fdfmerge/fdfmerge-form-data-format/

LuisB · Posted 01-20-2022 09:27 AM

Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.

LuisB · Posted 01-20-2022 09:27 AM

Added! thank you!

LuisB · Posted 01-20-2022 09:29 AM

Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.

Tom · Posted 01-20-2022 09:21 AM

That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.

First let's make an actual text file we can use as input.

options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe
;;;;

Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.

data want ;
  infile fdf column=cc ;
  input @'<</Contents(' variable :$200. @;
  variable=scan(variable,1,')');
  input @'Page ' @ ;
  s = cc ;
  input string :$20. @;
  Page = scan(string,1,'>') ;
  output;
  input @s @@;
  drop s string;
run;

Result:

Obs    variable               Page

 1     variable1               1
 2     variable2               4
 3     variable2-akljsdfkj     2
 4     variable1               4

If there are limits on what characters can be valid in the variable name then it could be even simpler:

data want ;
  infile fdf flowover dlm=')> ';
  input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;

LuisB · Posted 01-20-2022 09:45 AM

@Tom

I think this, combined with the LRECL option definitely solves my problem. I still don't fully understand why, though. Is the last @@ the trigger to move to the next 'line'? Why is the input of the last 's' required even after you OUTPUT?

Tom · Posted 01-20-2022 09:54 AM

The double trailing @ holds the input pointer on the current line for the next iteration of the data step. So you can find multiple /Contents tags on the same "line" of the file.

The @S is to move the pointer back to where it started to read to string used to find the page number, just in case that read accidentally read past the start of the next /Contents tag.

SAS Programming

Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Re: Reading an .fdf file into SAS

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...