SAS Programming

inyli · Posted 01-11-2020 04:16 PM

Hi, I'm extracting information from a string based on keywords that are pre-defined by regex patterns. My question is how do I get 4 words before and after the keywords, and save them into two separate columns? BIG thanks!

Let's say the regex pattern is defined as:

patternID = prxparse('/a \w+ fruit/i');

PGStats · Posted 01-12-2020 12:29 AM

Use prxPosn to extract sub buffers:

data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
	end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;

proc print data=test noobs; run;

line 	before 	after
Some pretend that a tomato is a real fruit, others say it's a vegetable 	that a tomato is 	others say it's

PG

View solution in original post

PeterClemmensen · Posted 01-11-2020 05:03 PM

Can you show an example of what you want to do?

The DATA to DATA Step Macro
Blog: SASnrd

art297 · Posted 01-11-2020 06:14 PM

I'm not sure how to do it with regular expressions, but it should be fairly easy to do with the scan function. e.g.:

data want (drop=i j);
  length string before after $200;
  input string &;
  do i=1 to countw(string," ,");
    if scan(string,i," ,",'i')="fruit" then do;
      do j=max(1,i-4) to i-1;
        before=catx(' ',before,scan(string,j," ,",'i'));
      end;
      do j=i+1 to min(i+4,countw(string," ,",'i'));
        after=catx(' ',after,scan(string,j," ,",'i'));
      end;
      leave;
    end;
  end;
  cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

Art, CEO, AnalystFinder.com

Note: Changed original post to include improved code

PGStats · Posted 01-12-2020 12:29 AM

Use prxPosn to extract sub buffers:

data test;
length line before after $100;
input line &;
prxId = prxparse("/((\w+\W+){0,4})(a \w+ fruit)\W+((\w+\W+){0,4})/i");
if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
	end;
drop prxId;
datalines;
Some pretend that a tomato is a real fruit, others say it's a vegetable
;

proc print data=test noobs; run;

line 	before 	after
Some pretend that a tomato is a real fruit, others say it's a vegetable 	that a tomato is 	others say it's

PG

inyli · Posted 01-12-2020 12:23 PM

THANK YOU!!!

art297 · Posted 01-12-2020 01:30 PM

While I like the regular expression approach, the suggested expression doesn't do what you want.

I revised my suggested code to account for the test string that @PGStats posted, as well as a couple of more variants.

I suggest that someone offers whatever revision(s) are needed to @PGStats's suggested code that would enable it to produce the same results as the following code and examples:

data want (drop=i j);
  length string before after $200;
  input string &;
  do i=1 to countw(string," ,");
    if scan(string,i," ,",'i')="fruit" then do;
      do j=max(1,i-4) to i-1;
        before=catx(' ',before,scan(string,j," ,",'i'));
      end;
      do j=i+1 to min(i+4,countw(string," ,",'i'));
        after=catx(' ',after,scan(string,j," ,",'i'));
      end;
      leave;
    end;
  end;
  cards;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

Art, CEO, AnalystFinder.com

art297 · Posted 01-13-2020 05:15 PM

While I don't fully understand regular expressions, I played around with the one suggested by @PGStats and came up with the following that I think correctly handles all of the examples I proposed in my last post

data test;
  length line before after $200;
  input line &;
  prxId = prxparse("/(([\w\']+\W+){0,4})(fruit)\W+(([\w\']+\W+){0,4})/i");
  if prxmatch(prxId, line) then do;
	before = prxposn(prxId, 1, line);
	after = prxposn(prxId, 4, line);
  end;
  drop prxId;
  datalines;
word1 word2 word3 word4 word5 word6 fruit word7 word8 word9 word10 word11
word1 word2's fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
word1 word2 fruit word3 word4 word5 word6 word7 word8 word9 word10 word11
Some pretend that a tomato is a real fruit, others say it's a vegetable
Some say that a tomato is a fruit, I'd say it is a vegetable
Mr Afruiting told us not to eat fruit like apples, pears and oranges
;
run;

Art, CEO, AnalystFinder.com

SAS Programming

Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Re: Extract words before and after regex pattern from text

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...