BookmarkSubscribeRSS Feed
Anotherdream
Quartz | Level 8

Hello everyone. I have a distinct problem and am hoping someone has a cool trick to get around it.

Basically I have a dataset which has long comment data (2,000 - 31,000 bytes. I cannot split up the comments) and I have been tasked for scanning through the dataset for "common" answers to problems.

If a common value is found, I am tasked with populating a new column with a default answer (another long comment).

I am curious as to what the most efficient way to do this would be.  I Know I could use a simple merge, but a merge on a 2,000 + string for millions of comments seems extremely inefficient.

Below is an example of a comment which I need to map too, and the new column I would want added (Comment is very small here, the smallest in my dataset is 1,000 chars).

Key     Comment                                                                                                                                                                                                                                                                                               Answer

1          "The client that responded to the request did not specify the correct documentation, and pursuant to the BLAH BLAH BLAH, so on and so forth, more legal jargin...."              "We disagree"

Thanks! Let me know if anyone needs any further clarification.

Brandon

7 REPLIES 7
ChandraReddy
Calcite | Level 5

I would recommend to create a temporary numeric variable which may contain unique number for each unique answer. You may start using the new numeric variable instead of charcater variable!!Smiley Happy

PGStats
Opal | Level 21

Such a number would be provided by function MD5 but I suspect that you need more sophisticated rules for identifying "types" of comments. You will certainly need a large training set for each default answer.

PG

PG
RichardinOz
Quartz | Level 8

@Brandon

I don't think you are looking for a join but rather finding common expressions in the comments.  These expressions may be a single word or a combination.  In the example you gave, both "correct documentation" and "disagree" might be such key expressions.  You could search for them in each comment using one of the index functions in SAS.  You need to bear in mind that "incorrect documentation" and "correct doco" might also mean the same in context, and similarly "not agree".  And then there are innumerable ways of misspelling "documentation" and so on. 

One way to deal with this might be to output all the comments to a text file and then use external software to analyse word frequency, discarding results for insignificant words like "the" and "and".  You could do this in SAS using the scan function to identify and output individaul words and then run it through Proc Freq or Proc means with 'N' specified, to get the frequency.

Richard

Anotherdream
Quartz | Level 8

Hello all. While I understand the point you are all trying to make, the issue with any kind of text analytics (which I have worked with briefly for a few months) is that these are boiler plate legal representations, and text analytics can't differentiate the answers from eachother. This is combined with the fact that the people entering these criteria are NOT always doing so correctly, and the cost of identifying a comment as as "boiler" plate when it is truly not has to be 0%.

Example:

Comment should be:

"WE reserve the right to disagree: No additional Comments".

So theoritically the part "we reserve the right to disagree" could be removed and we could just scan for the words, no additional comments. HOWEVER, specific people are inputting things like

"WE reserve the right to disagree: No additional Comments, Your company did not provide documentation #1951". And these two comments are distinct, and even though the second one contains no additional comments, it is actually incorrect, and thus cannot be flagged as such.

I am admitally very new to text mining, so would there be something you could do to distinctfy issues like what I have listed above.

Thanks!

Brandon 

"

Patrick
Opal | Level 21

Sound to me like you would need SAS Text Analytics (eg. Content Categorization and/or Sentiment Analysis) Text Analytics | SAS.

Without some specialised software dealing with unstructured data it will be very tough to achieve what you're asking for.

Ksharp
Super User

Maybe you can make a window whose size is one word ,two words ........... move it through the whole comment, and find them the frequency in the entire dataset.

And better use some complev() function to calculate the spell distance.

Ksharp

Anotherdream
Quartz | Level 8

I'd like to point out that apparently this problem is resolved completely in Sql server, as it can join on millions of rows of data, of the varchar type with lengths > 2,000 in a manner of seconds.

As thus, I went through and created a stored proc to do this, transferred my data to sql server and completed the task there.

Thanks everyone for the help and advice however!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1813 views
  • 0 likes
  • 6 in conversation