BookmarkSubscribeRSS Feed
Zachary
Obsidian | Level 7


I am having a lot of fun doing the Text Mining with Enterprise Miner. The course was great, but I am not seeing any examples of text mining when there is a continuous dependent variable. Most of the examples are binary, with a few nominal.

I have over 1 million documents to use- I am beginning my exploration with just 100,000. I am not as interested in clustering the text items or even exploring text topics as much as I am in isolating particular words and seeing if there is a relationship between them and my continuous dependent variable. Obviously I should not use mutual information when weighting my data because it is non-nominal.

For example, an acronym like "IME" or the string of the three words put together - "independent medical examination"-  would be all I would want to find if it is discriminatory. It actually is- when I use the following code in Enterprise Guide the differences are fairly striking:

data TempData1;

  set T_Text.SRS_TempComments100000;

  if (index(COMMENTTEXT,'IME')>0) then Flag_IME = 1;

  else Flag_IME = 0;

run;

proc univariate data = TempData1;

  var TOTALRESERVES;

  histogram TOTALRESERVES;

proc univariate data = TempData1;

  var TOTALRESERVES;

  where Flag_IME = 1;

  histogram TOTALRESERVES;

proc univariate data = TempData1;

  var TOTALRESERVES;

  where Flag_IME = 0;

  histogram TOTALRESERVES;

The only unfortunate part would be if "mime" was in my comments that I am analyzing, but I hope that with over a million documents there will not be a plethora of mime lovers out there.

Please let me know what processes you might suggest, have tried in the past, or feel is a "best practice." I am hoping to shortcut the long list of options within Enterprise Miner and really bringing focus to this. The only other solution I am currently thinking of is to maybe do the analysis on the TOTALRESERVES quartiles - but that still does not help me isolate the words themselves and not pay too much attention to the topics or clusters.

Thank you very much.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 0 replies
  • 759 views
  • 0 likes
  • 1 in conversation