BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BrandiLeach
Calcite | Level 5

Hello SAS community.

I am trying to extract a text segment based on a keyword from long, unstandardized text strings.  I need the text string to include 30 characters prior to the key word and 50 characters after the key word.  I am using the following code to do this.  The starting position of the extracted string appears to be accurate, but the end point varies dramatically.  I also get many warning notes in the log.  I've tried any number of parenthetical configurations around the FIND arguments to no avail.  Does anyone have any advice?

[jobtext is the variable that I am searching through.  Experience is the key word that I'm looking for.]

   

extract= substr(jobtext,FIND(jobtext,"experience",'i')-20, FIND(jobtext,"experience",'i')+30);

Thank you,

Brandi

1 ACCEPTED SOLUTION

Accepted Solutions
LinusH
Tourmaline | Level 20

From your description my guess is that you substring outside the variable boundaries. (Supplying any error /warning messages helps).

For the length parameter try to use a combination of length() and min() function.

Data never sleeps

View solution in original post

7 REPLIES 7
ballardw
Super User

It sounds like you are likely getting values for your extract variable but possibly not exactly what you want, correct?

What types of warnings are you getting?

Is there any pattern that actually indicates the end of the characters after the word that you want that might help?

It may be helpful to post some example data that demonstrates the issues.

BrandiLeach
Calcite | Level 5

Yes, I'm getting something, but not getting exactly what I want.  

The error message is "Invalid second argument to function SUBSTR."  This only happens when I add the '-20' after the FIND argument.

I am not setting the parameters outside of my variable.  The extracted portion cuts off before the end of the variable (they are verrrrry long text strings). 

I am thinking that the issue has to do with my adding '-20' and '+30' to the FIND argument but I don't know of another way to capture the text around my keyword.  Unfortunately there are no patterns that I can use to set an end value.

How would I use length and min within the FIND function?

Thanks for all of the feedback.  I appreciate the assistance. 

BrandiLeach
Calcite | Level 5

I'm not sure if I can/should post my raw data here.  I'm searching through job postings, so the text is long and free-form.  Here's an example of one of my extracts:

"specialty. previous experience in geriatric primary care, or icu/er rn is preferred. knowledge of medicare reimbursement and coding, etc., and electronic medical records used for documentation. excellent oral and written communication skills, experienced working in a collaborative care team environment and functioning autonomously with minimal supervision. ability to provide exceptional customer service to patients about medical home team: medical home team is a healthcare management company with a unique medical practice model; aligning office-based physicians with specia" 

Thanks again!

Reeza
Super User

FIND(jobtext,"experience",'i')-20

If FIND() is < 20 then argument becomes negative.

Use ifn to return the appropriate value instead.

LinusH
Tourmaline | Level 20

From your description my guess is that you substring outside the variable boundaries. (Supplying any error /warning messages helps).

For the length parameter try to use a combination of length() and min() function.

Data never sleeps
BrandiLeach
Calcite | Level 5

Oh, you mean just set the length of the extract variable to what I want.  Of course.  Thanks.  That works.  I'd still like to know why the original syntax is incorrect or if there's a better way to extract text around a keyword.  The error messages are arising on cases where the argument is valid (i.e.I'm not asking SAS to return a value before the beginning/after the end).

Haikuo
Onyx | Level 15

Q 1. "why the original syntax is incorrect",

A 1. Please RTM of SUBSTR(). The third element is length of the extract, not the position of ending point.

Q 2. " if there's a better way",

A 2. Not sure about that, but there are definitely alternatives, for one, using PRX functions:

extract=prxchange('s/.+(.{20}experience.{30}).+/$1/io', -1, jobtext);

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 3455 views
  • 6 likes
  • 5 in conversation