- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi community.
I'm trying to parse an .fdf file that looks something like this:
File.fdf
-----------------------------------------------------------
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
-------------------------------------------------------
I basically want to recover (1) the variable names that are always in-between "Contents(" and the next ")", and (2) the corresponding page number that is always after "Page " and before the next ">>".
I have been trying to use a DATA step with an INFILE. The issues I have been encountering are about how to read several VARIABLES and PAGES from a single _INFILE_ line, since I can make it work to read a single VARIABLE and PAGE per input line. This is compounded with the fact that the lines are way too long in the .fdf file (> 50,000 characters).
Is there a way in which perhaps I can split the file directly from SAS into different lines? This is more or less what I have tried so far:
SAS_SCRIPT
------------------------------------------------------------
data;
infile "file.fdf" linesize=32767 N=10000;
input;
/*Gather occurrence for the loop*/
vars =countc(_infile_,"Contents(");
pags=countc(_infile_,"Page ");
pos=1;
/* Start loop to find all occurrences of variables */
do i=0 TO min(vars,pags);
pos_var = find(_infile_, "Contents(",pos)+9;
if ( pos_var > 9 ) then do;
pos = pos_var;
length_var = find(_infile_, ")",pos_var) - pos_var;
if ( length_var > ) then do;
pos_page = find(_infile_,")",pos_var) + 5;
if ( pos_page > 5 ) then do;
length_page = find(_infile_, ">",pos_page) - pos_page;
if ( length_page > 0 ) then do;
/* Now input */
var = substr(_infile_, pos_var, length_var);
page = substr(_infile_, pos_page, length_page);
end;
end;
end;
end;
end;
end;
run;
------------------------------------------------------------
Expected output
-------------------------------------------------------
name page
variable1 1
variable1 2
variable 2 1
.... etc
-------------------------------------------------------
Any ideas?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.
First let's make an actual text file we can use as input.
options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
;;;;
Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.
data want ;
infile fdf column=cc ;
input @'<</Contents(' variable :$200. @;
variable=scan(variable,1,')');
input @'Page ' @ ;
s = cc ;
input string :$20. @;
Page = scan(string,1,'>') ;
output;
input @s @@;
drop s string;
run;
Result:
Obs variable Page 1 variable1 1 2 variable2 4 3 variable2-akljsdfkj 2 4 variable1 4
If there are limits on what characters can be valid in the variable name then it could be even simpler:
data want ;
infile fdf flowover dlm=')> ';
input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
-----------------------------------------
data work.import;
infile 'file.fdf' linesize 32767;
length page $3.;
do i=0 to 1000 until (page='');
input
@'Contents(' variable $30.
@'Page ' page $3. @;
end;
stop;
run;
-------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Given your example show the expected output.
File.fdf ----------------------------------------------------------- askdjf;lk fdasfe qweiopqwur <</Contents(variable1)-akljsdfkj Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj Page 3>> <</Contents(variable1) -akljsdfkj Page 4>> hjkasdfkjhsdfe -------------------------------------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Do you have a design document that describes the file format?
It does not look like an Adobe Form Data File. https://docs.appligent.com/fdfmerge/fdfmerge-form-data-format/
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That does not look like the right output for that input file. One of the <</Content( strings is missing the slash.
First let's make an actual text file we can use as input.
options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>
hjkasdfkjhsdfe
;;;;
Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.
data want ;
infile fdf column=cc ;
input @'<</Contents(' variable :$200. @;
variable=scan(variable,1,')');
input @'Page ' @ ;
s = cc ;
input string :$20. @;
Page = scan(string,1,'>') ;
output;
input @s @@;
drop s string;
run;
Result:
Obs variable Page 1 variable1 1 2 variable2 4 3 variable2-akljsdfkj 2 4 variable1 4
If there are limits on what characters can be valid in the variable name then it could be even simpler:
data want ;
infile fdf flowover dlm=')> ';
input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think this, combined with the LRECL option definitely solves my problem. I still don't fully understand why, though. Is the last @@ the trigger to move to the next 'line'? Why is the input of the last 's' required even after you OUTPUT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The double trailing @ holds the input pointer on the current line for the next iteration of the data step. So you can find multiple /Contents tags on the same "line" of the file.
The @S is to move the pointer back to where it started to read to string used to find the page number, just in case that read accidentally read past the start of the next /Contents tag.