BookmarkSubscribeRSS Feed
Bhuvaneswari
Obsidian | Level 7

Hello Everyone,

 

Here is the question I have, I have a long string that categorises a record into one of the three categories say AAA BBB CCC. An observation, which is specific to a certain industry, can be categorized into one category, say 'AAA' now and can be revised into a different category, say 'CCC' in the later period of time. All I need to capture is the category assigned recently. 

 

Example text: 'The industry x has so and so product lines and is previously assigned to AAA. currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes.'

 

What I need : I need to check if either of AAA, BBB, CCC occurred after the term 'Category' in a specified charater range, say within 5-7 words after the word 'Category'. I need to search the string from end and record first occurrence in a seperate variable. I tried used find() and findw() functions that yielded incorrect results.

 

Any help is appreciated.

 

Regards,

Bhuvana

 

6 REPLIES 6
collinelliot
Barite | Level 11

More examples of the text would be helpful to know how robust a solution is really required. It's pretty simple to address this one case, but what cases are making it more complicated?

Bhuvaneswari
Obsidian | Level 7

Hello,

 

The search strings are in specific case, ie., upper case in this scenario (AAA BBB CCC). Also the category can be replaced by synonyms like Classification, classify or sometimes follow-up (In case the category is modified for some reasons). It will be helpful in getting results of each type specified above in a seperate variable. I can use either coalesce function or observe such observations (Those that have more than one of those words mentioned in the target string, which I believe must be only a handful) in detail and assign final category based on domain knowledge. 

 

Thanks for the help,

 

Regards,

Bhuvana

ballardw
Super User

@Bhuvaneswari wrote:

Hello Everyone,

 

Here is the question I have, I have a long string that categorises a record into one of the three categories say AAA BBB CCC. An observation, which is specific to a certain industry, can be categorized into one category, say 'AAA' now and can be revised into a different category, say 'CCC' in the later period of time. All I need to capture is the category assigned recently. 

 

Example text: 'The industry x has so and so product lines and is previously assigned to AAA. currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes.'

 

What I need : I need to check if either of AAA, BBB, CCC occurred after the term 'Category' in a specified charater range, say within 5-7 words after the word 'Category'. I need to search the string from end and record first occurrence in a seperate variable. I tried used find() and findw() functions that yielded incorrect results.

 

Any help is appreciated.

 

Regards,

Bhuvana

 


Please describe the "incorrect results" using Find and Findw if it is something other than the functions returned character position and you wanted a word count.

 

Also since you are looking for "words" what do you consider boundaries for the words? Do any of your "words" contain special characters or punctuation (example: company names sometimes have Inc. or Ltd. where the . should actually be part of the last "word".

 

I am also not quite clear on what you want. Can you show the result with 1) some of your search terms, 2) text to search in and 3) example of what you invision the output to look like.

Bhuvaneswari
Obsidian | Level 7

Hello Ballard,

 

Thanks for the Response. The data I'm working on is confidential hence I couldnt reveal the actual data. I tried and created a sample data that is similar to the original data. Now as you see the highlighted text, those in blue are the words I need to concentrate on and string should be searched from the end to capture the first occurrance of classification alone. I have randomly inserted same words in the text (highlighted in Red) but they shouldnt be considered during the search operation as they are not at the end. 

 

Let me take one observation and explain in detail, See the second record, in that I want a code that searches the string from the end and see if either of AAA BBB or CCC occur within 4-5 words distance from either 'follow-up', 'classify', 'classified', 'categorized', 'classification' words. As the search starts from end I want it to capture only the last occurrance which is CCC next to 'category'. I want the search not to consider BBB which is also within 5 words from the specified 5 different words as its not encountered first when searched from the end.

 

Sorry for making this complicated, wish I could simplify it more.

 

Thanks again.

 

Regards,

Bhuvana

 


Sample Text Data.PNG
collinelliot
Barite | Level 11

I highly doubt this will work for the full range of text you have in your data, but using a positive look behind for "current" and then pulling out the following match of one of your codes works for the cases I tried. Maybe this will get you started?

 

 

data _null_;
    input text & :$500.;

    prxLastCat = prxparse('/(?<=current).*(AAA|BBB|CCC)/i');

    if prxmatch(prxLastCat, text) then do;
        classification = prxposn(prxLastCat, 1, text);
    end;

    put _all_;
datalines;
The industry x has so and so product lines classify and is previously assigned to AAA. Currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes. 
The industry x has so and so product lines classify and follow-up is previously assigned to AAA. Currently assigned category is BBB. This is reviewed by so and so department and is documented for future purposes. 
This is an example of some text that does not AAA have any match. 
;
ballardw
Super User

This searches a string from left to right for two words, if they are in the string you will get the first position. If you need to search from right to add B or b to the 'e' in the findw parameters 'be' or 'eb' will search from the right. E or e says to report words instead of characters.

 

data example;
   str="I'm not going to try to retype information provided in a picture. Looking for the word classify with AAA a number of words from it";
   ClassifyWord= findw(str,'classify',' .,','e');
   AAAWord  = findw(str,'AAA',' .,','e');
   if classifyword and AAAword then distance= abs(classifyword-AAAword);

run;

 

 

You could create arrays to hold 1) the first set of words to find, 2) the resulting position, 3) the second set of words to find, 4) there poisitions, 5) and possibly an array of the various possible distances combinations. You will have to come up with rules for how to ignore matches.

.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 728 views
  • 0 likes
  • 3 in conversation