Hi,
I have a large block of text that I am trying to extract sentences from. The sentences that I'm interested in extracting begin with the phrase "failed" and end with a period. However, sometimes the large text includes several instances of a failed to phrase and the code I'm using now does not capture each instance, but only the first.
This is an example of the large block of text I am working with:
Based on document review and interview, it was determined John failed to properly put away all materials used during construction. This could result in damage to the work place and possible injury to co-works. It was also noted that Dave failed to secure the ladder at the end of his shift. Additionally, Deborah failed to properly shut down her computer before leaving for the day.
Below is the code I have been using. Another possible start phrase is "it was determined" and another possible end phrase is "Findings", but I'm really primarily concerned with extracting between "failed" and the first period.
data test3;
set test2;
failed = index(text,'failed');
determined = index(text,'it was determined');
findings = index(text,'Findings');
if findings ne 0 then do;
if failed ne 0 then do;
tmp = substr(text,failed+0);
put tmp;
pos2 = index(tmp,"Findings");
Extract1 = substr(tmp,1,pos2-1);
put Extract1;
end;
else if failed = 0 then do;
tmp2 = substr(text,determined+0);
put tmp2;
pos4 = index(tmp2,"Findings");
Extract2 = substr(tmp2,1,pos4-1);
put Extract2;
end;
end;
if findings = 0 then do;
if failed ne 0 then do;
tmp = substr(text,failed+0);
put tmp;
pos2 = index(tmp,'.');
Extract1 = substr(tmp,1,pos2-1);
put Extract1;
end;
else if failed = 0 then do;
tmp2 = substr(text,determined+0);
put tmp2;
pos4 = index(tmp2,'.');
Extract2 = substr(tmp2,1,pos4-1);
put Extract2;
end;
end;
keep text Extract1 Extract2;
run;
Thank you for any help!
Take a look at CALL PRXNEXT Routine
Using the example code and adjusting the Regular Expression
data _null_;
ExpressionID = prxparse('/failed.*?\./');
text = 'The woods have a failed here for some reason. bat, cat, and failed here with some other text. a rat!';
start = 1;
stop = length(text);
/* Use PRXNEXT to find the first instance of the pattern, */
/* then use DO WHILE to find all further instances. */
/* PRXNEXT changes the start parameter so that searching */
/* begins again after the last match. */
call prxnext(ExpressionID, start, stop, text, position, length);
do while (position > 0);
found = substr(text, position, length);
put found= position= length=;
call prxnext(ExpressionID, start, stop, text, position, length);
end;
run;
This worked and put all the instances of failed into the log. Is there a way to extract them into a new variable instead of being put in the log?
@cqr525 wrote:
This worked and put all the instances of failed into the log. Is there a way to extract them into a new variable instead of being put in the log?
Basic approach: 1) Replace data _null_ with: Data yourdatasetnamegoeshere.
2) Found would be the name of the new variable. If you replace
put found= position= length=;
with
Output;
it will write the current record including the variables Start, Stop, Found, Position and Length to the data set each time it is "found". You would use a DROP statement to prevent any of those variables from appearing in the data set. For example, this means that Start and Stop do not make it into the data set.
drop start stop;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.