Text mining and content categorization

Seek Data Mining or Text Analytics Solution With a Continuous Dependent Variable - Just Words

Reply
Frequent Contributor
Posts: 115

Seek Data Mining or Text Analytics Solution With a Continuous Dependent Variable - Just Words


I am having a lot of fun doing the Text Mining with Enterprise Miner. The course was great, but I am not seeing any examples of text mining when there is a continuous dependent variable. Most of the examples are binary, with a few nominal.

I have over 1 million documents to use- I am beginning my exploration with just 100,000. I am not as interested in clustering the text items or even exploring text topics as much as I am in isolating particular words and seeing if there is a relationship between them and my continuous dependent variable. Obviously I should not use mutual information when weighting my data because it is non-nominal.

For example, an acronym like "IME" or the string of the three words put together - "independent medical examination"-  would be all I would want to find if it is discriminatory. It actually is- when I use the following code in Enterprise Guide the differences are fairly striking:

data TempData1;

  set T_Text.SRS_TempComments100000;

  if (index(COMMENTTEXT,'IME')>0) then Flag_IME = 1;

  else Flag_IME = 0;

run;

proc univariate data = TempData1;

  var TOTALRESERVES;

  histogram TOTALRESERVES;

proc univariate data = TempData1;

  var TOTALRESERVES;

  where Flag_IME = 1;

  histogram TOTALRESERVES;

proc univariate data = TempData1;

  var TOTALRESERVES;

  where Flag_IME = 0;

  histogram TOTALRESERVES;

The only unfortunate part would be if "mime" was in my comments that I am analyzing, but I hope that with over a million documents there will not be a plethora of mime lovers out there.

Please let me know what processes you might suggest, have tried in the past, or feel is a "best practice." I am hoping to shortcut the long list of options within Enterprise Miner and really bringing focus to this. The only other solution I am currently thinking of is to maybe do the analysis on the TOTALRESERVES quartiles - but that still does not help me isolate the words themselves and not pay too much attention to the topics or clusters.

Thank you very much.

Ask a Question
Discussion stats
  • 0 replies
  • 262 views
  • 0 likes
  • 1 in conversation