04-26-2013 01:10 PM
Hello everyone. I have a distinct problem and am hoping someone has a cool trick to get around it.
Basically I have a dataset which has long comment data (2,000 - 31,000 bytes. I cannot split up the comments) and I have been tasked for scanning through the dataset for "common" answers to problems.
If a common value is found, I am tasked with populating a new column with a default answer (another long comment).
I am curious as to what the most efficient way to do this would be. I Know I could use a simple merge, but a merge on a 2,000 + string for millions of comments seems extremely inefficient.
Below is an example of a comment which I need to map too, and the new column I would want added (Comment is very small here, the smallest in my dataset is 1,000 chars).
Key Comment Answer
1 "The client that responded to the request did not specify the correct documentation, and pursuant to the BLAH BLAH BLAH, so on and so forth, more legal jargin...." "We disagree"
Thanks! Let me know if anyone needs any further clarification.
04-26-2013 05:32 PM
I would recommend to create a temporary numeric variable which may contain unique number for each unique answer. You may start using the new numeric variable instead of charcater variable!!
04-26-2013 08:16 PM
Such a number would be provided by function MD5 but I suspect that you need more sophisticated rules for identifying "types" of comments. You will certainly need a large training set for each default answer.
04-26-2013 09:46 PM
I don't think you are looking for a join but rather finding common expressions in the comments. These expressions may be a single word or a combination. In the example you gave, both "correct documentation" and "disagree" might be such key expressions. You could search for them in each comment using one of the index functions in SAS. You need to bear in mind that "incorrect documentation" and "correct doco" might also mean the same in context, and similarly "not agree". And then there are innumerable ways of misspelling "documentation" and so on.
One way to deal with this might be to output all the comments to a text file and then use external software to analyse word frequency, discarding results for insignificant words like "the" and "and". You could do this in SAS using the scan function to identify and output individaul words and then run it through Proc Freq or Proc means with 'N' specified, to get the frequency.
04-29-2013 10:20 AM
Hello all. While I understand the point you are all trying to make, the issue with any kind of text analytics (which I have worked with briefly for a few months) is that these are boiler plate legal representations, and text analytics can't differentiate the answers from eachother. This is combined with the fact that the people entering these criteria are NOT always doing so correctly, and the cost of identifying a comment as as "boiler" plate when it is truly not has to be 0%.
Comment should be:
"WE reserve the right to disagree: No additional Comments".
So theoritically the part "we reserve the right to disagree" could be removed and we could just scan for the words, no additional comments. HOWEVER, specific people are inputting things like
"WE reserve the right to disagree: No additional Comments, Your company did not provide documentation #1951". And these two comments are distinct, and even though the second one contains no additional comments, it is actually incorrect, and thus cannot be flagged as such.
I am admitally very new to text mining, so would there be something you could do to distinctfy issues like what I have listed above.
04-27-2013 03:20 AM
Sound to me like you would need SAS Text Analytics (eg. Content Categorization and/or Sentiment Analysis) Text Analytics | SAS.
Without some specialised software dealing with unstructured data it will be very tough to achieve what you're asking for.
04-27-2013 04:09 AM
Maybe you can make a window whose size is one word ,two words ........... move it through the whole comment, and find them the frequency in the entire dataset.
And better use some complev() function to calculate the spell distance.
04-29-2013 05:28 PM
I'd like to point out that apparently this problem is resolved completely in Sql server, as it can join on millions of rows of data, of the varchar type with lengths > 2,000 in a manner of seconds.
As thus, I went through and created a stored proc to do this, transferred my data to sql server and completed the task there.
Thanks everyone for the help and advice however!