In SAS Enterprise Miner Workstation 13.2, I'm using some Text Mining nodes to build Text Topics.
However, I noticed lots of phrases and tokens tha I would like filtered out of the data before analysis. Examples include html tags such as "<p>", and boilerplate text such as "This description was written by the Martin Group." I tried adding these things to the list of stop words, but that didn't seem to help: the terms still appeared in the created topics.
Is there a way to filter out multi-word phrases? And is there a way to filter out regular expressions, such as "This description was written by .*"?
Half the people will recommend doing this transformations before importing data into EM, half the people will recommend doing it in EM.
If I was to add it on EM, I would do it on a transform node (use the SAS code ellipsis!), hptransform node, or in a SAS code node.
good luck!
Regular expression matching is very flexible. There is almost certainly a way to do what you describe. But we need something more concrete to suggest good examples. Please give us a list of phrases that you would want to check and what you would expect as a result.
Hi PG Stats,
I think I can handle the construction of the regular expression, that's not a problem. My question was trying to ask, where do I put them? (Which node, which field?) I couldn't find it.
thanks!
Half the people will recommend doing this transformations before importing data into EM, half the people will recommend doing it in EM.
If I was to add it on EM, I would do it on a transform node (use the SAS code ellipsis!), hptransform node, or in a SAS code node.
good luck!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.