- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm looking for help in extracting a sentence from a string (this is a charecter variable in sas), that has a specific word.
Say I'm looking for a word experience in a review and wanted to extract only the sentences that used the word 'experience' in it.
Input: Variable named comment has text "The worst experience I've ever had with them, especially after being with them for over a decade.I will not be recommending this gym and I will be contacting corporate as well as the Better Business Bureau."
output: Variable named extract with sentence "The worst experience I've ever had with them, especially after being with them for over a decade" containing the word experience.
Kindly help.
Regards,
Bhuvana
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's one approach.
data example; comment ="The worst experience I've ever had with them, especially after being with them for over a decade.I will not be recommending this gym and I will be contacting corporate as well as the Better Business Bureau." ; length extract sentence $200.; do i = 1 to countw(comment,'.'); sentence = scan(comment,i,'.'); if findw( sentence,'experience',' .,/','i')>0 then do; extract= catt(sentence,'.') ; output; end; end; drop sentence i; run;
You do not specify what to do if the target word occurs in two or more sentences. The above loop would create a separate record for each sentence found.
If you think that that you have other "sentence" delimiters such as ; involved add them to the SCAN function.
The catt is to put the period back into the sentence that SCAN will remove.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's one approach.
data example; comment ="The worst experience I've ever had with them, especially after being with them for over a decade.I will not be recommending this gym and I will be contacting corporate as well as the Better Business Bureau." ; length extract sentence $200.; do i = 1 to countw(comment,'.'); sentence = scan(comment,i,'.'); if findw( sentence,'experience',' .,/','i')>0 then do; extract= catt(sentence,'.') ; output; end; end; drop sentence i; run;
You do not specify what to do if the target word occurs in two or more sentences. The above loop would create a separate record for each sentence found.
If you think that that you have other "sentence" delimiters such as ; involved add them to the SCAN function.
The catt is to put the period back into the sentence that SCAN will remove.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for quick help. Have a good weekend!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have another request, how can put all such sentences in one record say I want to append those sentences in the outputted variable extract intead of having one record per each sentence.
Thanks again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is modified example.
data example; comment ="The worst experience I've ever had with them. The worst service in a decade. Because of this experience I will not be recommending this gym." ; length extract sentence $200.; do i = 1 to countw(comment,'.'); sentence = scan(comment,i,'.'); if findw( sentence,'experience',' .,/','i')>0 then do; extract= catx(' ',extract,catt(sentence,'.')) ; end; end; drop sentence i; run;
Note that I did change the original comment.
The Length assignment for Extract in this case should likely be the length of the comment variable in practice.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have not more than 10 occurrances in the string and I would like to have them all under one charecter variable instead of one per each occurance.
say if the input variable comment has "The worst experience I've ever had with them, especially after being with them for over a decade.I will not be recommending this gym and I will be contacting corporate as well as the Better Business Bureau.
I hope my experience can be of help."
Expected output variable extract should have "he worst experience I've ever had with them, especially after being with them for over a decade.I hope my experience can be of help."
Thanks again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot! That helped!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ballardw,
I experimented a bit witht he code and noticed that if I write something like "very bad experience!", SAS won't output it - but if I omit the exclamation mark from the word "experience" then SAS outputs it. So the code looks for the exact word "experience", but is it possible to make the code such that it will be searching for the presence of a string, even if it is a part of a bigger string?
Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Delimiter choices, add or remove as needed. I just used periods for the SCAN part, you could add any character you want but make sure it is a "sentence", where sentence is whatever you want for a group of words, delimiter in your data. Note that this approach has a potential more complex problem. "My experience meant I would like to rate it 4.5 but the entry window would not allow that" would require addtional steps to identify if the . in 4.5 is actually a sentence end, or in "For a price of 53.67 the experience was too expensive and I won't return". That sentence would start with "67 the experience ..." Freeform language is rife with other suitable issues. If you see a pattern like digit.digit that might fix some but what about the guy that does $.02 for "two cents worth"? Or an email address?
Any text you search for has rules. Pick the appropriate tool. Index, Find, Indexw, FindW, and sometimes prey. If the result quality needs to be high then often you get a human involved or a much better trained AI than I know how to access.