BookmarkSubscribeRSS Feed
cqr525
Calcite | Level 5

Hi,

 

I have a large block of text that I am trying to extract sentences from. The sentences that I'm interested in extracting begin with the phrase "failed" and end with a period. However, sometimes the large text includes several instances of a failed to phrase and the code I'm using now does not capture each instance, but only the first. 

 

This is an example of the large block of text I am working with:

Based on document review and interview, it was determined John failed to properly put away all  materials used during construction. This could result in damage to the work place and possible injury to co-works. It was also noted that Dave failed to secure the ladder at the end of his shift. Additionally, Deborah failed to properly shut down her computer before leaving for the day. 

 

Below is the code I have been using. Another possible start phrase is "it was determined"  and another possible end phrase is "Findings", but I'm really primarily concerned with extracting between "failed" and the first period. 

 

data test3;
set test2;
failed = index(text,'failed');
determined = index(text,'it was determined');
findings = index(text,'Findings');
if findings ne 0 then do;
if failed ne 0 then do;
tmp = substr(text,failed+0);
put tmp;
pos2 = index(tmp,"Findings");
Extract1 = substr(tmp,1,pos2-1);
put Extract1;
end;
else if failed = 0 then do;
tmp2 = substr(text,determined+0);
put tmp2;
pos4 = index(tmp2,"Findings");
Extract2 = substr(tmp2,1,pos4-1);
put Extract2;
end;
end;
if findings = 0 then do;
if failed ne 0 then do;
tmp = substr(text,failed+0);
put tmp;
pos2 = index(tmp,'.');
Extract1 = substr(tmp,1,pos2-1);
put Extract1;
end;
else if failed = 0 then do;
tmp2 = substr(text,determined+0);
put tmp2;
pos4 = index(tmp2,'.');
Extract2 = substr(tmp2,1,pos4-1);
put Extract2;
end;
end;
keep text Extract1 Extract2;
run;

 

Thank you for any help!

4 REPLIES 4
AMSAS
SAS Super FREQ

Take a look at CALL PRXNEXT Routine 


Using the example code and adjusting the Regular Expression

data _null_;
   ExpressionID = prxparse('/failed.*?\./');
   text = 'The woods have a failed here for some reason. bat, cat, and failed here with some other text. a rat!';
   start = 1;
   stop = length(text);
      /* Use PRXNEXT to find the first instance of the pattern, */
      /* then use DO WHILE to find all further instances.       */
      /* PRXNEXT changes the start parameter so that searching  */
      /* begins again after the last match.                     */
   call prxnext(ExpressionID, start, stop, text, position, length);
      do while (position > 0);
         found = substr(text, position, length);
         put found= position= length=;
         call prxnext(ExpressionID, start, stop, text, position, length);
      end;
run;
cqr525
Calcite | Level 5

This worked and put all the instances of failed into the log. Is there a way to extract them into a new variable instead of being put in the log? 

ballardw
Super User

@cqr525 wrote:

This worked and put all the instances of failed into the log. Is there a way to extract them into a new variable instead of being put in the log? 


 

 

Basic approach: 1) Replace data _null_ with: Data yourdatasetnamegoeshere.

2) Found would be the name of the new variable. If you replace

put found= position= length=;

with

Output;

it will write the current record including the variables Start, Stop, Found, Position and Length to the data set each time it is "found". You would use a DROP statement to prevent any of those variables from appearing in the data set. For example, this means that Start and Stop do not make it into the data set.

drop start stop;
cqr525
Calcite | Level 5
This is great and worked, thank you very much!

My last question is it possible to make each extraction outputted as its own variable? I'll have multiple blocks of texts of to look through and if all the found text is one variable it will make for a very long and messy output to read.
So ideally my variables would be:

Text, Found1, Found2, Found3,... with each found being an instance of "failed to..." within a block the block of text.

I really appreciate all the help!

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1245 views
  • 4 likes
  • 3 in conversation