- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I'm extracting information from a string based on keywords that are pre-defined by regex patterns. My question is how do I get 4 words before and after the keywords, and save them into two separate columns? BIG thanks!
Let's say the regex pattern is defined as:
patternID = prxparse('/a \w+ fruit/i');
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use prxPosn to extract sub buffers:
data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
before = prxposn(prxId, 1, line);
after = prxposn(prxId, 4, line);
end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;
proc print data=test noobs; run;
line before after Some pretend that a tomato is a real fruit, others say it's a vegetable that a tomato is others say it's
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you show an example of what you want to do?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure how to do it with regular expressions, but it should be fairly easy to do with the scan function. e.g.:
data want (drop=i j);
length string before after $200;
input string &;
do i=1 to countw(string," ,");
if scan(string,i," ,",'i')="fruit" then do;
do j=max(1,i-4) to i-1;
before=catx(' ',before,scan(string,j," ,",'i'));
end;
do j=i+1 to min(i+4,countw(string," ,",'i'));
after=catx(' ',after,scan(string,j," ,",'i'));
end;
leave;
end;
end;
cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;
Art, CEO, AnalystFinder.com
Note: Changed original post to include improved code
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use prxPosn to extract sub buffers:
data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
before = prxposn(prxId, 1, line);
after = prxposn(prxId, 4, line);
end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;
proc print data=test noobs; run;
line before after Some pretend that a tomato is a real fruit, others say it's a vegetable that a tomato is others say it's
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
THANK YOU!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
While I like the regular expression approach, the suggested expression doesn't do what you want.
I revised my suggested code to account for the test string that @PGStats posted, as well as a couple of more variants.
I suggest that someone offers whatever revision(s) are needed to @PGStats's suggested code that would enable it to produce the same results as the following code and examples:
data want (drop=i j);
length string before after $200;
input string &;
do i=1 to countw(string," ,");
if scan(string,i," ,",'i')="fruit" then do;
do j=max(1,i-4) to i-1;
before=catx(' ',before,scan(string,j," ,",'i'));
end;
do j=i+1 to min(i+4,countw(string," ,",'i'));
after=catx(' ',after,scan(string,j," ,",'i'));
end;
leave;
end;
end;
cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
While I don't fully understand regular expressions, I played around with the one suggested by @PGStats and came up with the following that I think correctly handles all of the examples I proposed in my last post
data test;
length line before after $200;
input line &;
prxId = prxparse("/(([\w\']+\W+){0,4})(fruit)\W+(([\w\']+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
before = prxposn(prxId, 1, line);
after = prxposn(prxId, 4, line);
end;
drop prxId;
datalines;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;
Art, CEO, AnalystFinder.com