Help using Base SAS procedures

Using SUBSTR + FIND to extract text segment

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Using SUBSTR + FIND to extract text segment

Hello SAS community.

I am trying to extract a text segment based on a keyword from long, unstandardized text strings.  I need the text string to include 30 characters prior to the key word and 50 characters after the key word.  I am using the following code to do this.  The starting position of the extracted string appears to be accurate, but the end point varies dramatically.  I also get many warning notes in the log.  I've tried any number of parenthetical configurations around the FIND arguments to no avail.  Does anyone have any advice?

[jobtext is the variable that I am searching through.  Experience is the key word that I'm looking for.]

   

extract= substr(jobtext,FIND(jobtext,"experience",'i')-20, FIND(jobtext,"experience",'i')+30);

Thank you,

Brandi


Accepted Solutions
Solution
‎07-21-2015 12:23 PM
Super User
Posts: 5,429

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

From your description my guess is that you substring outside the variable boundaries. (Supplying any error /warning messages helps).

For the length parameter try to use a combination of length() and min() function.

Data never sleeps

View solution in original post


All Replies
Super User
Posts: 11,343

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

It sounds like you are likely getting values for your extract variable but possibly not exactly what you want, correct?

What types of warnings are you getting?

Is there any pattern that actually indicates the end of the characters after the word that you want that might help?

It may be helpful to post some example data that demonstrates the issues.

Occasional Contributor
Posts: 6

Re: Using SUBSTR + FIND to extract text segment

Yes, I'm getting something, but not getting exactly what I want.  

The error message is "Invalid second argument to function SUBSTR."  This only happens when I add the '-20' after the FIND argument.

I am not setting the parameters outside of my variable.  The extracted portion cuts off before the end of the variable (they are verrrrry long text strings). 

I am thinking that the issue has to do with my adding '-20' and '+30' to the FIND argument but I don't know of another way to capture the text around my keyword.  Unfortunately there are no patterns that I can use to set an end value.

How would I use length and min within the FIND function?

Thanks for all of the feedback.  I appreciate the assistance. 

Occasional Contributor
Posts: 6

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

I'm not sure if I can/should post my raw data here.  I'm searching through job postings, so the text is long and free-form.  Here's an example of one of my extracts:

"specialty. previous experience in geriatric primary care, or icu/er rn is preferred. knowledge of medicare reimbursement and coding, etc., and electronic medical records used for documentation. excellent oral and written communication skills, experienced working in a collaborative care team environment and functioning autonomously with minimal supervision. ability to provide exceptional customer service to patients about medical home team: medical home team is a healthcare management company with a unique medical practice model; aligning office-based physicians with specia" 

Thanks again!

Super User
Posts: 19,802

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

FIND(jobtext,"experience",'i')-20

If FIND() is < 20 then argument becomes negative.

Use ifn to return the appropriate value instead.

Solution
‎07-21-2015 12:23 PM
Super User
Posts: 5,429

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

From your description my guess is that you substring outside the variable boundaries. (Supplying any error /warning messages helps).

For the length parameter try to use a combination of length() and min() function.

Data never sleeps
Occasional Contributor
Posts: 6

Re: Using SUBSTR + FIND to extract text segment

Oh, you mean just set the length of the extract variable to what I want.  Of course.  Thanks.  That works.  I'd still like to know why the original syntax is incorrect or if there's a better way to extract text around a keyword.  The error messages are arising on cases where the argument is valid (i.e.I'm not asking SAS to return a value before the beginning/after the end).

Respected Advisor
Posts: 3,156

Re: Using SUBSTR + FIND to extract text segment

Posted in reply to BrandiLeach

Q 1. "why the original syntax is incorrect",

A 1. Please RTM of SUBSTR(). The third element is length of the extract, not the position of ending point.

Q 2. " if there's a better way",

A 2. Not sure about that, but there are definitely alternatives, for one, using PRX functions:

extract=prxchange('s/.+(.{20}experience.{30}).+/$1/io', -1, jobtext);

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 551 views
  • 6 likes
  • 5 in conversation