10-31-2013 05:16 PM
Hello everyone. I have a database that has strings of comments. In this database, many of the comments are actually supposed to be the same comment.
The client missed an application of order 17.
Client Missed application order 17.
Client's missed app order #17.
All of the above comments means the same thing.
I know I can strip punctuation and use the compbl function to evaluate to single spaces, but does anyone know of a technique or function in Base sas that will attempt to realize that all of these strings are the same thing?
I'm thinking some kind of ranking system that says "comment A is 80% related to comment B", etc... I have never done this type of text mining before so I honestly don't know where to start!
If this is not a task that I should be trying in SAS please let me know! (I've heard that R can do this type of thing well, however from my VERY limited experience using R was not extremely helpful).
Thanks for your time.
11-02-2013 12:37 PM
have a look at the "spelling distance" function SPEDIS()
and test the results!
I found values under 10 were very close and results above 20 were "unreliable",
but your data will produce different results
11-04-2013 10:46 AM
Thank you for the suggestions. I will look into these functions some more. I have a feeling this process just isn't going to work, even with text mining just given the structure of the request and the data that we have...
I will keep you posted on which one of these functions I find more helpful!
11-04-2013 10:55 AM
Take a look at the approach I used as part of a presentation I recently did at the MWSUG meeting (see, Expert Panel Solution MWSUG 2013-Tabachneck - sasCommunity ). I had the most success using the compged function.
To use the same approach, you would simply have to create a second file that contains the strings you are interested in matching, compare each record with all of those strings, and capture the ones that have the lowest scores.
11-04-2013 11:01 AM
Thanks Arthur! I think here in Lies the problem.
My task is to actually look at a bunch of user responses to comments, and to FIND the values that they should have answered (each response is supposed to be logically equivalent to similar resopnses) so that I can create a "correct" mapping table.
For example, user comments can look like below.
"A decision had not been made up through the review period."
"A loan decision was not approved during the scope of this review."
"A loss mitigation decision was not made on this account."
"A modification was not approved."
"No decision was made."
"No evidence of contact with borrower, as borrower’s phone was noted as disconnected."
"Couldn't get a hold of borrower"
"Did not respond to questioning"
All of these responses were to the same question. Note for this one question there are about 190 distinct written answers. However there is really only two distinct possibilities... One, the process was never done on the account (all of responses one), or two, they couldn't get a hold of the borrower (answer 2).
Now the trick to this is, I have NO IDEA how many distinct reponses there are supposed to be per question, and there is thousands of questions.
Given that, I don't think your approach would work, but please correct me if I misinterpreted you!
11-04-2013 11:09 AM
Then I agree with Fareeza .. either text miner (using clustering), Content Categorization Studio, or both, might be your best approach. I would repost your question in that forum.
11-04-2013 11:19 AM
I'll second Reeza and Arthur. You need some form of Text Analytics here.
Sure you could try things like FIND() and build some rules based on various terms. Here is the logic (this is not sas code):
if find("DECISION") or find("REVIEW") then comment_category=1
(I'd recommend to upcase your text).
Their are big limitations and caveats with approaches like these such as spelling mistakes and synonyms which Text Analytics easily addresses.
11-04-2013 11:22 AM
Yeah I agree. Unfortunately I do not know of any open source text analytics programs that I would be able to pick up and develop, without doing it from scratch.
I have looked into them, and the only one I ever find is R, and I have tried this type of task in R before and frankly it was not that useful of an operation (although I could have easily been doing things incorrectly).
11-04-2013 11:28 AM
R has a text mining package. This helps you so that you don't have to "start from scratch".
I've used Python NLTK for various text analytics processing. You can always use more than one tool. SAS can be used up to a point, then export the data for import into an R or Python environment where you can process the text, then export to a format to import back into SAS.
If the work you are doing is meant to be an ongoing thing, then coming up with this more automated process is worth the effort.
11-04-2013 11:32 AM
Yeah, I do not like R's text mining package for this analysis.
The only thing I was able to get R to do is to find like words within strings after removing key words. I don't find that terribly helpful in this circumstance because you lose the original wording of the strings. This doesn't help me, as I don't care what words are commonly found across observations, as the actual sentence that is a logical group of these words is what I am after.
Perhaps I am looking into things incorrectly however, I will take a peek into R again. If you have a suggestion of specific packages within it please let me know.
Sorry, to clarify my above answer, I need text mining by groups. Aka the responses for Variable Group A need to be grouped differently than the responses for variable Group B (even though the answers will often be the same).
I basically need a ranking of "this is the most common answer for Group A, second most common, etc.." and I need it repeated for every group, solving for the questions I listed above.
The issue is I can't manually split out these groups as there are thousands of them, so I am not sure how to do this with R especially. Again I will continue to look into it however.
Within base sas this does not seem possible unless I were to make a distinct list of every single answer per group, and compare that to all other groups and look for the one that has the highest correlation with all other groups, and then that would be the distinct "best value", etc... I might actually go down this route unless anyone sees anything wrong with it.