BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
inyli
Calcite | Level 5

Hi, I'm extracting information from a string based on keywords that are pre-defined by regex patterns. My question is how do I get 4 words before and after the keywords, and save them into two separate columns? BIG thanks!

 

Let's say the regex pattern is defined as: 

patternID = prxparse('/a \w+ fruit/i'); 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Use prxPosn to extract sub buffers:

 

data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
	end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;

proc print data=test noobs; run;
line 	before 	after
Some pretend that a tomato is a real fruit, others say it's a vegetable 	that a tomato is 	others say it's
PG

View solution in original post

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

Can you show an example of what you want to do?

art297
Opal | Level 21

I'm not sure how to do it with regular expressions, but it should be fairly easy to do with the scan function. e.g.:

 

data want (drop=i j);
  length string before after $200;
  input string &;
  do i=1 to countw(string," ,");
    if scan(string,i," ,",'i')="fruit" then do;
      do j=max(1,i-4) to i-1;
        before=catx(' ',before,scan(string,j," ,",'i'));
      end;
      do j=i+1 to min(i+4,countw(string," ,",'i'));
        after=catx(' ',after,scan(string,j," ,",'i'));
      end;
      leave;
    end;
  end;
  cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

 

Art, CEO, AnalystFinder.com

 

Note: Changed original post to include improved code

PGStats
Opal | Level 21

Use prxPosn to extract sub buffers:

 

data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
	end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;

proc print data=test noobs; run;
line 	before 	after
Some pretend that a tomato is a real fruit, others say it's a vegetable 	that a tomato is 	others say it's
PG
inyli
Calcite | Level 5

THANK YOU!!!

art297
Opal | Level 21

While I like the regular expression approach, the suggested expression doesn't do what you want.

 

I revised my suggested code to account for the test string that @PGStats  posted, as well as a couple of more variants.

 

I suggest that someone offers whatever revision(s) are needed to @PGStats's suggested code that would enable it to produce the same results as the following code and examples:

 

data want (drop=i j);
  length string before after $200;
  input string &;
  do i=1 to countw(string," ,");
    if scan(string,i," ,",'i')="fruit" then do;
      do j=max(1,i-4) to i-1;
        before=catx(' ',before,scan(string,j," ,",'i'));
      end;
      do j=i+1 to min(i+4,countw(string," ,",'i'));
        after=catx(' ',after,scan(string,j," ,",'i'));
      end;
      leave;
    end;
  end;
  cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

 

Art, CEO, AnalystFinder.com

 

art297
Opal | Level 21

While I don't fully understand regular expressions, I played around with the one suggested by @PGStats  and came up with the following that I think correctly handles all of the examples I proposed in my last post

 

data test;
  length line before after $200;
  input line &;
  prxId = prxparse("/(([\w\']+\W+){0,4})(fruit)\W+(([\w\']+\W+){0,4})/i");
  if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
  end;
  drop prxId;
  datalines;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

Art, CEO, AnalystFinder.com

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2363 views
  • 6 likes
  • 4 in conversation