Extracting string from line below known string.

Reply
New Contributor
Posts: 2

Extracting string from line below known string.

 I have a few hundred text documents from a survey that are in the following format.

 

Question1?

 

Response

 

Question2:

 

Response

 

Not all quetions end in ?, some end in . or :.  There is always a space between the question and response and a space after the response, but the response can take up several lines (paragraph type).  I would like to use the question strings to search for the response that is two lines below it.  Since the responses take up different number of lines depending on the respondent I can just simply use the line number.  What is the best way to extract this data?

Grand Advisor
Posts: 10,211

Re: Extracting string from line below known string.

Are you saying that the actual text QUESTION1, QUESTION2, ..., QUESTION100 appears in the data as the first bit of a line? How many of these questions are there?

 

Do you expect to read " response can take up several lines (paragraph type)" into a single response value? Is it possible that the apparent multiple line response is actually the result of word wrapping a long line in a viewer or are there actual end-of-line characters?

 

How do you associate these responses to a specific respondent?

 

It may help us to know what filetype the existing data is in as well, text, excel, something else.

 

 

New Contributor
Posts: 2

Re: Extracting string from line below known string.

They are .txt files that were extracted from pdf's.  There are 64 questions.  The lines aren't word wrapped, there are actual end-of-line characters.  The respondents ID is always the first line of the document.

 

ex.

 

123456789

 

Do you have a dog?

 

Yes

 

What is your favorite color?

 

Blue

 

Please describe your experience with niantic servers?

 

Very long response.  Broken

into several

lines.

Grand Advisor
Posts: 10,211

Re: Extracting string from line below known string.

[ Edited ]

Is there any way to go to the source of those PDFs and see if there is an alternate data source or if they can export to a better structured file such as CSV?

Since the most likely thing I see after reading one of these files is to read more and them combine them you have lots of potential for mismatched lengths of character variables (different length responses by different respondents) and possibly even data types.

If you can get the data in a better format you will likley be much happier.

 

I can give a stub of one approach, which may not be the slickest.

Read an entire line of data into a string variable. Test the value of that string. If it is a known question then input two lines later for the value of that response. If it isn't a known question text then the value should be concatenated to the previously read question.


data raw;
   infile "filename" lrecl=256 eof=EndOfFile;
   informat Id $10. tstr Q1-Q64 $256. id ;
   array qval  Q1-Q64;
   retain Id Q1-Q64;
   retain qnum 0;
   input tstr;
   select (tstr);
      when ('text of question 1') do; 
                                    input / Q1;
                                    qnum=1;
                                  end;
      when ('text of question 2') do; 
                                    input / Q2;
                                    qnum=2;
                                  end;
      when ('text of question 3') do; 
                                    input / Q3;
                                    qnum=3;
                                  end;
/* repeat the pattern*/
      when ('text of question 64') do; 
                                    input / Q64;
                                    qnum=64;
                                  end;
      otherwise do;
                  if qnum=0 then id=tstr;
                  else qval(qnum)=catx(' ',qval(qnum),tstr);
                end;
   end; /*select*/

EndOfFile: If qnum=64 then Output;
run;
Respected Advisor
Posts: 3,836

Re: Extracting string from line below known string.

[ Edited ]

It appears that there is no unique text pattern (something like line starting with 'Q<digits> : ') which clearly identifies a line of text which is a question. That means you have to explicitely search for the full question text.

 

Below a variant to what @ballardw already proposed.

 

/* create a sample source file */
filename myfile temp;
data _null_;
  file myfile;
  infile datalines4;
  input;
  put _infile_;
  datalines;
123456789

Do you have a dog?
 
Yes
 
What is your favorite color?
 
Blue
 
Please describe your experience with niantic servers?
 
Very long response.  Broken
into several
lines.

Anything else?
Not really, no
;
run;


/*** option 1 using simple informat ***/

/* create informats to identify questions */
proc format;
   invalue isquestion (default=255)
      'DO YOU HAVE A DOG?'=1
      'WHAT IS YOUR FAVORITE COLOR?'=2
      'PLEASE DESCRIBE YOUR EXPERIENCE WITH NIANTIC SERVERS?'=3
      'ANYTHING ELSE?'=4
      other=.;
run;

/* read the data into a SAS table */
data want1;
  infile myfile lrecl=255 end=last;
  input;
  length id q_id 8 question $255 answer $5000;
  retain id q_id question answer;

  /* new survey: only digits in _infile_ */
  if _infile_ ne ' ' and compress(_infile_,,'d')=' ' then 
    do;
      call missing (of _all_);
      id=input(_infile_,best32.);
      return;
    end;

  /* new question: informat doesn't return a  */
  if input(upcase(_infile_),isquestion255.) ne . then
    do;
      if not missing(q_id) then output;
      call missing(answer);
      q_id=input(upcase(_infile_),isquestion255.);
      question=_infile_;
      return;
    end;

  /* answer: any line not being an new survey or a question */
  if not missing(_infile_) then answer=catx(' ',answer,_infile_);

  /* end of file": check required to not miss the last item */
  if last then output;

run;



/*** option 2 using informat with regular expression ***/

/* create informats to identify questions */
proc format;
   invalue q01rxp (default=255) 
      '/^\s*Do you have a dog\?\s*$/i' (regexp) = 1     
      other=.;
   invalue q02rxp (default=255) 
      '/^\s*What is your favorite color\?\s*$/i' (regexp) = 2      
      other=[q01rxp255.];
   invalue q03rxp (default=255) 
      '/^\s*Please describe your experience with niantic servers\?\s*$/i' (regexp) = 3      
      other=[q02rxp255.];
   invalue qlastrxp (default=255) 
      '/^\s*anything else\?\s*$/i' (regexp) = 4      
      other=[q03rxp255.];
run;

/* read the data into a SAS table */
data want2;
  infile myfile lrecl=255 end=last;
  input;
  length id q_id 8 question $255 answer $5000;
  retain id q_id question answer;

  /* new survey: only digits in _infile_ */
  if prxmatch('/^\s*\d+\s*$/o',_infile_) then 
    do;
      call missing (of _all_);
      id=input(_infile_,best32.);
      return;
    end;

  /* new question: informat doesn't return a missing value */
  if input(_infile_,qlastrxp255.) ne . then
    do;
      if not missing(q_id) then output;
      call missing(answer);
      q_id=input(_infile_,qlastrxp255.);
      question=_infile_;
      return;
    end;

  /* answer: any line not being an new survey or a question */
  if not missing(_infile_) then answer=catx(' ',answer,_infile_);

  /* end of file": check required to not miss the last item */
  if last then output;

run;
Grand Advisor
Posts: 9,578

Re: Extracting string from line below known string.

Is there any pattern to know which line is a Question Line ?
But I think make a group variable would give you a good start.



data have;
  input;
  have=_infile_;
  if missing(_infile_) then group+1;
  datalines;
123456789

Do you have a dog?
 
Yes
 
What is your favorite color?
 
Blue
 
Please describe your experience with niantic servers?
 
Very long response.  Broken
into several
lines.

Anything else?
Not really, no
;
run;
proc print noobs;run;


Ask a Question
Discussion stats
  • 5 replies
  • 324 views
  • 0 likes
  • 4 in conversation