BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LuisB
Fluorite | Level 6

Hi community.

 

I'm trying to parse an .fdf file that looks something like this:

File.fdf

-----------------------------------------------------------

askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe

-------------------------------------------------------

 

I basically want to recover (1) the variable names that are always in-between "Contents(" and the next ")", and (2) the corresponding page number that is always after "Page " and before the next ">>".

I have been trying to use a DATA step with an INFILE. The issues I have been encountering are about how to read several VARIABLES and PAGES from a single _INFILE_ line, since I can make it work to read a single VARIABLE and PAGE per input line. This is compounded with the fact that the lines are way too long in the .fdf file (> 50,000 characters).

 

Is there a way in which perhaps I can split the file directly from SAS into different lines? This is more or less what I have tried so far:

 

SAS_SCRIPT

------------------------------------------------------------

data;

  infile "file.fdf" linesize=32767 N=10000;

  input;

  /*Gather occurrence for the loop*/
  vars =countc(_infile_,"Contents(");
  pags=countc(_infile_,"Page ");

  pos=1;

  /* Start loop to find all occurrences of variables */
  do i=0 TO min(vars,pags);

    pos_var = find(_infile_, "Contents(",pos)+9;

    if ( pos_var > 9 ) then do;

      pos = pos_var;
      length_var = find(_infile_, ")",pos_var) - pos_var;

      if ( length_var > ) then do;

        pos_page = find(_infile_,")",pos_var) + 5;
        if ( pos_page > 5 ) then do;
          length_page = find(_infile_, ">",pos_page) - pos_page;
            if ( length_page > 0 ) then do;
              /* Now input */
              var = substr(_infile_, pos_var, length_var);
              page = substr(_infile_, pos_page, length_page);
            end;

          end;

        end;
      end;

    end;

  end;

run;

------------------------------------------------------------

 

 

Expected output

-------------------------------------------------------

name         page

variable1   1

variable1   2

variable 2  1

.... etc

-------------------------------------------------------

Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That does not look like the right output for that input file.  One of the <</Content( strings is missing the slash.

First let's make an actual text file we can use as input.

options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe
;;;;

Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.

data want ;
  infile fdf column=cc ;
  input @'<</Contents(' variable :$200. @;
  variable=scan(variable,1,')');
  input @'Page ' @ ;
  s = cc ;
  input string :$20. @;
  Page = scan(string,1,'>') ;
  output;
  input @s @@;
  drop s string;
run;

Result:

Obs    variable               Page

 1     variable1               1
 2     variable2               4
 3     variable2-akljsdfkj     2
 4     variable1               4

If there are limits on what characters can be valid in the variable name then it could be even simpler:

data want ;
  infile fdf flowover dlm=')> ';
  input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;

View solution in original post

9 REPLIES 9
LuisB
Fluorite | Level 6
I also tried, unsuccessfully. the following:
-----------------------------------------
data work.import;
infile 'file.fdf' linesize 32767;
length page $3.;

do i=0 to 1000 until (page='');
input
@'Contents(' variable $30.
@'Page ' page $3. @;
end;
stop;
run;
-------------------------------------
data_null__
Jade | Level 19

Given your example show the expected output.

File.fdf

-----------------------------------------------------------

askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe

-------------------------------------------------------

 

Tom
Super User Tom
Super User

Do you have a design document that describes the file format?

It does not look like an Adobe Form Data File.  https://docs.appligent.com/fdfmerge/fdfmerge-form-data-format/

 

LuisB
Fluorite | Level 6
Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.
LuisB
Fluorite | Level 6
Added! thank you!
LuisB
Fluorite | Level 6

Hi Tom, thank you for reading my topic. The example data I provided was indeed just an abstraction of an old FDF file. It doesn't represent the exact file format. I basically have an export of comments from a PDF than I'm unable to provide at the moment. I will look a document describing the exact format.

Tom
Super User Tom
Super User

That does not look like the right output for that input file.  One of the <</Content( strings is missing the slash.

First let's make an actual text file we can use as input.

options parmcards=fdf ;
filename fdf temp;
parmcards4;
askdjf;lk
fdasfe
qweiopqwur
<</Contents(variable1)-akljsdfkj
Page 1>><<xj/variable2-akljsdfkj Page 1>><</Contents(variable2) -akljsdfkj
Page 4>>4564324<</Contents(variable2-akljsdfkj Page 2>><<Contents(ar/variable1)-akljsdfkj
Page 3>>
<</Contents(variable1) -akljsdfkj
Page 4>>

hjkasdfkjhsdfe
;;;;

Now let's use the @ command to find the starting points and the SCAN() function to remove the trailing letters.

data want ;
  infile fdf column=cc ;
  input @'<</Contents(' variable :$200. @;
  variable=scan(variable,1,')');
  input @'Page ' @ ;
  s = cc ;
  input string :$20. @;
  Page = scan(string,1,'>') ;
  output;
  input @s @@;
  drop s string;
run;

Result:

Obs    variable               Page

 1     variable1               1
 2     variable2               4
 3     variable2-akljsdfkj     2
 4     variable1               4

If there are limits on what characters can be valid in the variable name then it could be even simpler:

data want ;
  infile fdf flowover dlm=')> ';
  input @'<</Contents(' variable :$200. @'Page ' page @@ ;
run;
LuisB
Fluorite | Level 6

@Tom 

 

I think this, combined with the LRECL option definitely solves my problem. I still don't fully understand why, though. Is the last @@ the trigger to move to the next 'line'? Why is the input of the last 's' required even after you OUTPUT?

Tom
Super User Tom
Super User

The double trailing @ holds the input pointer on the current line for the next iteration of the data step.  So you can find multiple /Contents tags on the same "line" of the file.

The @S is to move the pointer back to where it started to read to string used to find the page number, just in case that read accidentally read past the start of the next /Contents tag.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2262 views
  • 4 likes
  • 3 in conversation