Hi community.
I'm trying to parse an .fdf file that looks something like this:
File.fdf
-----------------------------------------------------------
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
-------------------------------------------------------
I basically want to recover (1) the variable names that are always in-between "Contents(" and the next ")", and (2) the corresponding page number that is always after "Page " and before the next ">>".
I have been trying to use a DATA step with an INFILE. The issues I have been encountering are about how to read several VARIABLES and PAGES from a single _INFILE_ line, since I can make it work to read a single VARIABLE and PAGE per input line. This is compounded with the fact that the lines are way too long in the .fdf file (> 50,000 characters).
Is there a way in which perhaps I can split the file directly from SAS into different lines? This is more or less what I have tried so far:
SAS_SCRIPT
------------------------------------------------------------
data;
infile "file.fdf" linesize=32767 N=10000;
input;
/*Gather occurrence for the loop*/
vars =countc(_infile_,"Contents(");
pags=countc(_infile_,"Page ");
pos=1;
/* Start loop to find all occurrences of variables */
do i=0 TO min(vars,pags);
pos_var = find(_infile_, "Contents(",pos)+9;
if ( pos_var > 9 ) then do;
pos = pos_var;
length_var = find(_infile_, ")",pos_var) - pos_var;
if ( length_var > ) then do;
pos_page = find(_infile_,")",pos_var) + 5;
if ( pos_page > 5 ) then do;
length_page = find(_infile_, ">",pos_page) - pos_page;
if ( length_page > 0 ) then do;
/* Now input */
var = substr(_infile_, pos_var, length_var);
page = substr(_infile_, pos_page, length_page);
end;
end;
end;
end;
end;
end;
run;
------------------------------------------------------------
Expected output
-------------------------------------------------------
name page
variable1 1
variable1 2
variable 2 1
.... etc
-------------------------------------------------------
Any ideas?
That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.
First let's make an actual text file we can use as input.
options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
;;;;
Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.
data want ;
infile fdf column=cc ;
input @'<</Contents(' variable :$200. @;
variable=scan(variable,1,')');
input @'Page ' @ ;
s = cc ;
input string :$20. @;
Page = scan(string,1,'>') ;
output;
input @s @@;
drop s string;
run;
Result:
Obs variable Page 1 variable1 1 2 variable2 4 3 variable2-akljsdfkj 2 4 variable1 4
If there are limits on what characters can be valid in the variable name then it could be even simpler:
data want ;
infile fdf flowover dlm=')> ';
input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;
Given your example show the expected output.
File.fdf ----------------------------------------------------------- askdjf;lk fdasfe qweiopqwur <</Contents(variable1)-akljsdfkj Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj Page 3>> <</Contents(variable1) -akljsdfkj Page 4>> hjkasdfkjhsdfe -------------------------------------------------------
Do you have a design document that describes the file format?
It does not look like an Adobe Form Data File. https://docs.appligent.com/fdfmerge/fdfmerge-form-data-format/
Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.
That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.
First let's make an actual text file we can use as input.
options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
;;;;
Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.
data want ;
infile fdf column=cc ;
input @'<</Contents(' variable :$200. @;
variable=scan(variable,1,')');
input @'Page ' @ ;
s = cc ;
input string :$20. @;
Page = scan(string,1,'>') ;
output;
input @s @@;
drop s string;
run;
Result:
Obs variable Page 1 variable1 1 2 variable2 4 3 variable2-akljsdfkj 2 4 variable1 4
If there are limits on what characters can be valid in the variable name then it could be even simpler:
data want ;
infile fdf flowover dlm=')> ';
input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;
I think this, combined with the LRECL option definitely solves my problem. I still don't fully understand why, though. Is the last @@ the trigger to move to the next 'line'? Why is the input of the last 's' required even after you OUTPUT?
The double trailing @ holds the input pointer on the current line for the next iteration of the data step. So you can find multiple /Contents tags on the same "line" of the file.
The @S is to move the pointer back to where it started to read to string used to find the page number, just in case that read accidentally read past the start of the next /Contents tag.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.