SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Extract conditional information from text

Accepted Solution Solved
Reply
Regular Contributor
Posts: 228
Accepted Solution

Extract conditional information from text

Hi everyone.

Can I please ask your help on this because I have racked my brain for hours to solve this but sadly I still can't figure out how Smiley Sad

Your help would be very much appreciated Smiley Happy

I wish to produce data 'Want' based on variable 'Comment' and 'ID' from data 'Have'.

Data 'Have'
ID=1, Comment='qwertyuiopQ2asdfgh.Q35 jkl Q88zxcvbnm Q151: plmkjhq66 rfevs'
ID=2, Comment='Q98ertyu Q77: asdfgh.Q88zvbq q111: mjqq42 rvsq1plo'

Data 'Want'
ID Question   Comment
1             qwertyuiop
1  Q2         asdfgh.
1  Q35        jkl 
1  Q88        zxcvbnm
1  Q151       plmkjh
1  Q66	      rfevs	
2  Q98        ertyu
2  Q77        asdfgh. 
2  Q88        zvbq
2  Q111       mjq
2  Q42	      rvs 
2  Q1	      plo

 


Accepted Solutions
Solution
‎09-30-2016 07:42 AM
Respected Advisor
Posts: 4,173

Re: Extract conditional information from text

I couldn't come up with something simpler than in below code and it will eventually be a bit hard to digest in case you're not familiar with Regular Expressions ( http://support.sas.com/resources/papers/proceedings16/2480-2016.pdf )

data Have;
  ID=1;
  text='qwertyuiopQ2asdfgh.Q35 jkl Q88zxcvbnm Q151: plmkjhq66 rfevs';
  output;
  ID=2;
  text='Q98ertyu Q77: asdfgh.Q88zvbq q111: mjqq42 rvsq1plo';
  output;
  text='  ';
  output;
run;

data want(drop=_:);
  if _n_=1 then
    do;
      retain _prxid_q;
      _prxid_q=prxparse('/q\d+/oi');
    end;

  set have;

  if prxmatch('/^\s*q\d/i',text) ne 1 and lengthn(text) > 0 then _text='Q9999'||text;
  else _text=text;

  _start=1;
  _stop=length(_text);
  length question $5 answer $20;

  call prxnext(_prxid_q, _start, _stop, _text, _position, _length);
  do while (_position > 0);
    question = substr(_text, _position, _length);
    _start_answer=sum(_position,lengthn(question));

    call prxnext(_prxid_q, _start, _stop, _text, _position, _length);
    if _position>0 then
      do;
        answer=substrn(_text,_start_answer,_position-_start_answer);
      end;
      else
      do;
        answer=substrn(_text,_start_answer);
      end;
    output;
  end;

run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,993

Re: Extract conditional information from text

Well, the problem you are facing is that there doesn't appear to be any logical pattern to the comment.  For instance, you can get some way there by using scan, see example below, however Q isn't constant, it is sometimes, and that could also be elsewhere in the text.  Othertimes its not even present.  This is why most programmers will refuse to process comments data, you may be quicker just manually going through the data yourself.

Data Have;
ID=1; Comment='qwertyuiopQ2asdfgh.Q35 jkl Q88zxcvbnm Q151: plmkjhq66 rfevs'; output;
ID=2; Comment='Q98ertyu Q77: asdfgh.Q88zvbq q111: mjqq42 rvsq1plo'; output;
run;

data want;
  set have;
  do i=1 to countw(comment,"Q");
    section=scan(comment,i,"Q");
    output;
  end;
run;
Super User
Posts: 7,859

Re: Extract conditional information from text

[ Edited ]

So you can use a capital Q as the separator?

In the set statement of the data step, rename comment to oldcomment.

Define a suitable length for comment

For the first record, I'd set question to blank, and comment to scan(oldcomment,1,'Q')

Then remove the first part by setting oldcomment to substr(oldcomment,index(oldcomment,'Q')+1)

Then in do a while loop (check for the presence of Q in the condition):

- scan for no-number characters at the beginning of the string with notdigit(); this lets you extract the question number

- then, substr() everything after the number up to the next Q

- output

- do substr(oldcomment,index(oldcomment,'Q')+1) once again

- after the loop has finished, process the remainder of oldcomment just like in the loop

- keep only id, question and comment

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Solution
‎09-30-2016 07:42 AM
Respected Advisor
Posts: 4,173

Re: Extract conditional information from text

I couldn't come up with something simpler than in below code and it will eventually be a bit hard to digest in case you're not familiar with Regular Expressions ( http://support.sas.com/resources/papers/proceedings16/2480-2016.pdf )

data Have;
  ID=1;
  text='qwertyuiopQ2asdfgh.Q35 jkl Q88zxcvbnm Q151: plmkjhq66 rfevs';
  output;
  ID=2;
  text='Q98ertyu Q77: asdfgh.Q88zvbq q111: mjqq42 rvsq1plo';
  output;
  text='  ';
  output;
run;

data want(drop=_:);
  if _n_=1 then
    do;
      retain _prxid_q;
      _prxid_q=prxparse('/q\d+/oi');
    end;

  set have;

  if prxmatch('/^\s*q\d/i',text) ne 1 and lengthn(text) > 0 then _text='Q9999'||text;
  else _text=text;

  _start=1;
  _stop=length(_text);
  length question $5 answer $20;

  call prxnext(_prxid_q, _start, _stop, _text, _position, _length);
  do while (_position > 0);
    question = substr(_text, _position, _length);
    _start_answer=sum(_position,lengthn(question));

    call prxnext(_prxid_q, _start, _stop, _text, _position, _length);
    if _position>0 then
      do;
        answer=substrn(_text,_start_answer,_position-_start_answer);
      end;
      else
      do;
        answer=substrn(_text,_start_answer);
      end;
    output;
  end;

run;
Regular Contributor
Posts: 228

Re: Extract conditional information from text

Dear @RW9 @KurtBremser @Patrick

Thank you very much for your help.

Yes, Patrick. PERL is indeed completely new to me but I will definately try to self-learn now.

Have an awesome weekend!

Super User
Posts: 10,044

Re: Extract conditional information from text

I love this question.


data Have;
  ID=1;
  text='qwertyuiopQ2asdfgh.Q35 jkl Q88zxcvbnm Q151: plmkjhq66 rfevs';
  output;
  ID=2;
  text='Q98ertyu Q77: asdfgh.Q88zvbq q111: mjqq42 rvsq1plo';
  output;
run;

data temp;
 set have;
 length temp $ 400;
 n=0;group=0;
 pid=prxparse('/q\d+:?/io');
 s=1;
 e=length(text);
 call prxnext(pid,s,e,text,p,l);
 do while(p gt 0);
 n+1;
  if n=1 then do;
    if p ne 1 then do;group+1;temp=' ';output;temp=substr(text,1,p-1);output;end;
    group+1;temp=compress(substr(text,p,l),':');output;
  end;
  else do;
   temp=substr(text,lag_p+lag_l,p-lag_p-lag_l);output;
   group+1;temp=compress(substr(text,p,l),':');output;
  end;
  
  lag_p=p;lag_l=l;
  call prxnext(pid,s,e,text,p,l);
 
  if p le 0 then do;temp=substr(text,lag_p+lag_l);output;end;
 
 end;
keep id temp group;
run;
proc transpose data=temp out=want;
 by id group;
 var temp;
run;
  

Super User
Posts: 11,343

Re: Extract conditional information from text

And a completely different approach:

Go to who ever created that mess a see what options for other data formats there may have been. This appears to be exported from some sort of data entry or survey software and suspect some one was lazy or did not pay attention to options when extracting data.

Super User
Super User
Posts: 7,993

Re: Extract conditional information from text

@ballardw, best advice of the thread.  In a proper process you could go back to the agreed upon Data Transfer Document, but it seems like nobody bothers with silliness such as documented, validated, repeatable procedures anymore.  Just dump any old garbage into Excel and jobs a good one.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 616 views
  • 6 likes
  • 6 in conversation