Text mining and content categorization

Text parsing question

Accepted Solution Solved
Reply
Contributor
Posts: 40
Accepted Solution

Text parsing question

I’m using a sample code from SAS here, want to get a different

result.

data _null_;

   ExpressionID = prxparse('/(?:s|,?)([crb]at) ?(?:,)?/');

   text = 'The woods have a bat, cat and a rat';

   start = 1;

   stop = length(text);

      /* Use PRXNEXT to find the first instance of the pattern, */

      /* then use DO WHILE to find all further instances.       */

      /* PRXNEXT changes the start parameter so that searching  */

      /* begins again after the last match.                     */

   call prxnext(ExpressionID, start, stop, text, position, length);

      do while (position > 0);

          fnd = prxposn(ExpressionID,1,text);

         found = substr(text, position, length);

         put fnd= found= position= length= start= stop=;

         call prxnext(ExpressionID, start, stop, text, position, length);

      end;

run;

/*The following lines are written to the SAS log:*/

/*   found=bat position=18 length=3*/

/*   found=cat position=23 length=3*/

/*   found=rat position=34 length=3*/

What I want to get is this instead:

Found=The woods have a bat

Found= cat and

Found= a rat

Basically use the found-word as a delimiter to strip out the string to parts

(as many as it finds, 3 in this case).


Accepted Solutions
Solution
‎12-17-2014 03:30 PM
Respected Advisor
Posts: 4,766

Re: Text parsing question

OK, keeping your program as intact as possible ...

Add one statement just before DO WHILE:

prior_position = 1;

Then inside the DO WHILE loop, replace FOUND= with:

found = substr(text, prior_position, position - prior_position + 3);

prior_position = position + 3;

You may get ", cat" instead of "cat", but that's probably a decent result.

Good luck.

View solution in original post


All Replies
Grand Advisor
Posts: 16,908

Re: Text parsing question

WWhat's your delimiter? If it was bat/cat/hat your result would be :

the  woods have a bat

cat

and a rat


or possibly


the  woods have a bat

cat and a

rat

Contributor
Posts: 40

Re: Text parsing question

You are right, should be:

the  woods have a bat

cat

and a rat

Grand Advisor
Posts: 16,908

Re: Text parsing question

Forget the sample code from SAS, what are you trying to accomplish overall. Can you provide more than one example? Would sat also be a delimiter?

Respected Advisor
Posts: 4,766

Re: Text parsing question

Let's add to the list.  Which of these should be considered delimiters?

at

cats

matter

Secretariat

wheat

Contributor
Posts: 40

Re: Text parsing question

I have to use prxparse('/(?:\s|,?)([crb]at) ?(?:,)?/'); which means only cat|rat|bat are delimiters(words), just notice a typo in the original post, should be \s which means space, the real program is much more complicated than this one, goal is to get substrings from the big string, in a way of breaking the string by looking at any of these 3 words (in this sample, real case is much more).

Contributor
Posts: 40

Re: Text parsing question

And I know in this case a word like brat will show up, that is fine. Thanks.

Solution
‎12-17-2014 03:30 PM
Respected Advisor
Posts: 4,766

Re: Text parsing question

OK, keeping your program as intact as possible ...

Add one statement just before DO WHILE:

prior_position = 1;

Then inside the DO WHILE loop, replace FOUND= with:

found = substr(text, prior_position, position - prior_position + 3);

prior_position = position + 3;

You may get ", cat" instead of "cat", but that's probably a decent result.

Good luck.

Contributor
Posts: 40

Re: Text parsing question

Thank you Astounding! I think that works.

Contributor
Posts: 40

Re: Text parsing question

One more thing, is there a way to have the last found return 'and a rat abc' instead of 'and a rat'?

data _null_;

   ExpressionID = prxparse('/(?:\s|,?)([crb]at) ?(?:,)?/');

   text = 'The woods  have a bat, cat and a rat abc';

   start = 1;

   stop = length(text);

       /* Use PRXNEXT to find the first instance of the pattern, */

      /* then use DO WHILE to find all further instances.       */

      /* PRXNEXT changes the start parameter so that searching  */

      /* begins again after the last match.                     */

   call prxnext(ExpressionID, start, stop, text, position, length);

     prior_position = 1;

      do while (position > 0);

    fnd = prxposn(ExpressionID,1,text);

         found = substr(text, prior_position, position - prior_position + length(fnd)+1);

  prior_position = position + length(fnd)+1;

         put fnd= found= prior_position position= length= start= stop=;

         call prxnext(ExpressionID, start, stop, text, position, length);

      end;

run;

Respected Advisor
Posts: 4,766

Re: Text parsing question

No, it wouldn't be easy.  What would be easy would be to print one more line at the end:

Remaining text = abc

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 933 views
  • 0 likes
  • 3 in conversation