Hello Everyone,
Here is the question I have, I have a long string that categorises a record into one of the three categories say AAA BBB CCC. An observation, which is specific to a certain industry, can be categorized into one category, say 'AAA' now and can be revised into a different category, say 'CCC' in the later period of time. All I need to capture is the category assigned recently.
Example text: 'The industry x has so and so product lines and is previously assigned to AAA. currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes.'
What I need : I need to check if either of AAA, BBB, CCC occurred after the term 'Category' in a specified charater range, say within 5-7 words after the word 'Category'. I need to search the string from end and record first occurrence in a seperate variable. I tried used find() and findw() functions that yielded incorrect results.
Any help is appreciated.
Regards,
Bhuvana
More examples of the text would be helpful to know how robust a solution is really required. It's pretty simple to address this one case, but what cases are making it more complicated?
Hello,
The search strings are in specific case, ie., upper case in this scenario (AAA BBB CCC). Also the category can be replaced by synonyms like Classification, classify or sometimes follow-up (In case the category is modified for some reasons). It will be helpful in getting results of each type specified above in a seperate variable. I can use either coalesce function or observe such observations (Those that have more than one of those words mentioned in the target string, which I believe must be only a handful) in detail and assign final category based on domain knowledge.
Thanks for the help,
Regards,
Bhuvana
@Bhuvaneswari wrote:
Hello Everyone,
Here is the question I have, I have a long string that categorises a record into one of the three categories say AAA BBB CCC. An observation, which is specific to a certain industry, can be categorized into one category, say 'AAA' now and can be revised into a different category, say 'CCC' in the later period of time. All I need to capture is the category assigned recently.
Example text: 'The industry x has so and so product lines and is previously assigned to AAA. currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes.'
What I need : I need to check if either of AAA, BBB, CCC occurred after the term 'Category' in a specified charater range, say within 5-7 words after the word 'Category'. I need to search the string from end and record first occurrence in a seperate variable. I tried used find() and findw() functions that yielded incorrect results.
Any help is appreciated.
Regards,
Bhuvana
Please describe the "incorrect results" using Find and Findw if it is something other than the functions returned character position and you wanted a word count.
Also since you are looking for "words" what do you consider boundaries for the words? Do any of your "words" contain special characters or punctuation (example: company names sometimes have Inc. or Ltd. where the . should actually be part of the last "word".
I am also not quite clear on what you want. Can you show the result with 1) some of your search terms, 2) text to search in and 3) example of what you invision the output to look like.
Hello Ballard,
Thanks for the Response. The data I'm working on is confidential hence I couldnt reveal the actual data. I tried and created a sample data that is similar to the original data. Now as you see the highlighted text, those in blue are the words I need to concentrate on and string should be searched from the end to capture the first occurrance of classification alone. I have randomly inserted same words in the text (highlighted in Red) but they shouldnt be considered during the search operation as they are not at the end.
Let me take one observation and explain in detail, See the second record, in that I want a code that searches the string from the end and see if either of AAA BBB or CCC occur within 4-5 words distance from either 'follow-up', 'classify', 'classified', 'categorized', 'classification' words. As the search starts from end I want it to capture only the last occurrance which is CCC next to 'category'. I want the search not to consider BBB which is also within 5 words from the specified 5 different words as its not encountered first when searched from the end.
Sorry for making this complicated, wish I could simplify it more.
Thanks again.
Regards,
Bhuvana
I highly doubt this will work for the full range of text you have in your data, but using a positive look behind for "current" and then pulling out the following match of one of your codes works for the cases I tried. Maybe this will get you started?
data _null_;
input text & :$500.;
prxLastCat = prxparse('/(?<=current).*(AAA|BBB|CCC)/i');
if prxmatch(prxLastCat, text) then do;
classification = prxposn(prxLastCat, 1, text);
end;
put _all_;
datalines;
The industry x has so and so product lines classify and is previously assigned to AAA. Currently assigned category is CCC. This is reviewed by so and so department and is documented for future purposes.
The industry x has so and so product lines classify and follow-up is previously assigned to AAA. Currently assigned category is BBB. This is reviewed by so and so department and is documented for future purposes.
This is an example of some text that does not AAA have any match.
;
This searches a string from left to right for two words, if they are in the string you will get the first position. If you need to search from right to add B or b to the 'e' in the findw parameters 'be' or 'eb' will search from the right. E or e says to report words instead of characters.
data example; str="I'm not going to try to retype information provided in a picture. Looking for the word classify with AAA a number of words from it"; ClassifyWord= findw(str,'classify',' .,','e'); AAAWord = findw(str,'AAA',' .,','e'); if classifyword and AAAword then distance= abs(classifyword-AAAword); run;
You could create arrays to hold 1) the first set of words to find, 2) the resulting position, 3) the second set of words to find, 4) there poisitions, 5) and possibly an array of the various possible distances combinations. You will have to come up with rules for how to ignore matches.
.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.