topic Re: Best approach to classifying records based on manually entered text description. in SAS Data Science

Best approach to classifying records based on manually entered text description.

CurtisMackWSIPP — Thu, 14 Jan 2021 17:01:17 GMT

I have been writing traditional SAS code to classify building permit records into whether or not they are new residential construction or demolitions. There are a couple of categorical fields that help, but most of the information comes from a long manually entered text description. Up until now I have been writing code that searches that string for key words and phrases and makes decisions based on those.

I am thinking there must be a better approach to this problem using Enterprise Miner or some other tools. Something where I can manually classify some records until some sort of machine learning algorithm is trained to make the decisions for me. I process about 50 files a year with 10,000 to 100,000 records each.

Does anybody have a suggestions on how I should approach this? Maybe an paper?

Thanks!

Re: Best approach to classifying records based on manually entered text description.

Reeza — Thu, 14 Jan 2021 18:47:12 GMT

I think you'll need text miner to help with processing the text. Another one, if you know the type of words you're looking for is to do a word analysis. So the key words you've identified and just try a basic logistic regression model and that should be your 'baseline'. Any model from there on out should be giving you better results that the most basic. This can very much be an ML problem though.

Re: Best approach to classifying records based on manually entered text description.

fierceanalytics — Fri, 15 Jan 2021 15:48:15 GMT

Hello,

You can visit https://www.lexjansen.com/ to search all the papers related to this topic. Document classification could be end purpose, mainly data management. Another popular usage is to predict. There is a text miner procedure in SAS HPA you can consider. The primary benefit from using that is to better manage intermediate dataset, better than Text Miner. If you come from more open source background approaching SAS, Viya is better for you to start.

Jia

Re: Best approach to classifying records based on manually entered text description.

AnnKuo — Thu, 21 Jan 2021 20:36:20 GMT

If you are licensed with SAS Text Miner, then check out the Text Rule Builder node which creates Boolean rules from small subsets of terms to predict a categorical target variable. The Text Rule Builder node generates an ordered set of rules that together are useful in describing and predicting a target variable. There is an example in the SAS Text Miner 15.2 that shows you how to predict a categorical target variable using this node.

Also in the following SAS Global Forum paper

Classifying and Predicting Spam Messages using Text Mining in
SAS® Enterprise Miner™

five other predictive models including memory-based reasoning (MBR), logistic regression, decision tree, random forest and neural network were built and their performance was compared with the Text Rule Builder model. The best model is later used to classify and predict the messages as spam and ham (non-spam).

Hope this helps!

-Ann