Join on Very long comment data?

Reply
Super Contributor
Posts: 418

Join on Very long comment data?

Hello everyone. I have a distinct problem and am hoping someone has a cool trick to get around it.

Basically I have a dataset which has long comment data (2,000 - 31,000 bytes. I cannot split up the comments) and I have been tasked for scanning through the dataset for "common" answers to problems.

If a common value is found, I am tasked with populating a new column with a default answer (another long comment).

I am curious as to what the most efficient way to do this would be.  I Know I could use a simple merge, but a merge on a 2,000 + string for millions of comments seems extremely inefficient.

Below is an example of a comment which I need to map too, and the new column I would want added (Comment is very small here, the smallest in my dataset is 1,000 chars).

Key     Comment                                                                                                                                                                                                                                                                                               Answer

1          "The client that responded to the request did not specify the correct documentation, and pursuant to the BLAH BLAH BLAH, so on and so forth, more legal jargin...."              "We disagree"

Thanks! Let me know if anyone needs any further clarification.

Brandon

Occasional Contributor
Posts: 7

Re: Join on Very long comment data?

I would recommend to create a temporary numeric variable which may contain unique number for each unique answer. You may start using the new numeric variable instead of charcater variable!!Smiley Happy

Respected Advisor
Posts: 4,651

Re: Join on Very long comment data?

Such a number would be provided by function MD5 but I suspect that you need more sophisticated rules for identifying "types" of comments. You will certainly need a large training set for each default answer.

PG

PG
Super Contributor
Posts: 644

Re: Join on Very long comment data?

@Brandon

I don't think you are looking for a join but rather finding common expressions in the comments.  These expressions may be a single word or a combination.  In the example you gave, both "correct documentation" and "disagree" might be such key expressions.  You could search for them in each comment using one of the index functions in SAS.  You need to bear in mind that "incorrect documentation" and "correct doco" might also mean the same in context, and similarly "not agree".  And then there are innumerable ways of misspelling "documentation" and so on. 

One way to deal with this might be to output all the comments to a text file and then use external software to analyse word frequency, discarding results for insignificant words like "the" and "and".  You could do this in SAS using the scan function to identify and output individaul words and then run it through Proc Freq or Proc means with 'N' specified, to get the frequency.

Richard

Super Contributor
Posts: 418

Re: Join on Very long comment data?

Hello all. While I understand the point you are all trying to make, the issue with any kind of text analytics (which I have worked with briefly for a few months) is that these are boiler plate legal representations, and text analytics can't differentiate the answers from eachother. This is combined with the fact that the people entering these criteria are NOT always doing so correctly, and the cost of identifying a comment as as "boiler" plate when it is truly not has to be 0%.

Example:

Comment should be:

"WE reserve the right to disagree: No additional Comments".

So theoritically the part "we reserve the right to disagree" could be removed and we could just scan for the words, no additional comments. HOWEVER, specific people are inputting things like

"WE reserve the right to disagree: No additional Comments, Your company did not provide documentation #1951". And these two comments are distinct, and even though the second one contains no additional comments, it is actually incorrect, and thus cannot be flagged as such.

I am admitally very new to text mining, so would there be something you could do to distinctfy issues like what I have listed above.

Thanks!

Brandon 

"

Respected Advisor
Posts: 3,894

Re: Join on Very long comment data?

Sound to me like you would need SAS Text Analytics (eg. Content Categorization and/or Sentiment Analysis) Text Analytics | SAS.

Without some specialised software dealing with unstructured data it will be very tough to achieve what you're asking for.

Super User
Posts: 9,681

Re: Join on Very long comment data?

Maybe you can make a window whose size is one word ,two words ........... move it through the whole comment, and find them the frequency in the entire dataset.

And better use some complev() function to calculate the spell distance.

Ksharp

Super Contributor
Posts: 418

Re: Join on Very long comment data?

I'd like to point out that apparently this problem is resolved completely in Sql server, as it can join on millions of rows of data, of the varchar type with lengths > 2,000 in a manner of seconds.

As thus, I went through and created a stored proc to do this, transferred my data to sql server and completed the task there.

Thanks everyone for the help and advice however!

Ask a Question
Discussion stats
  • 7 replies
  • 308 views
  • 0 likes
  • 6 in conversation