Seek Data Mining or Text Analytics Solution With a Continuous Dependen...

Zachary · Posted 03-13-2015 05:13 PM

I am having a lot of fun doing the Text Mining with Enterprise Miner. The course was great, but I am not seeing any examples of text mining when there is a continuous dependent variable. Most of the examples are binary, with a few nominal.

I have over 1 million documents to use- I am beginning my exploration with just 100,000. I am not as interested in clustering the text items or even exploring text topics as much as I am in isolating particular words and seeing if there is a relationship between them and my continuous dependent variable. Obviously I should not use mutual information when weighting my data because it is non-nominal.

For example, an acronym like "IME" or the string of the three words put together - "independent medical examination"- would be all I would want to find if it is discriminatory. It actually is- when I use the following code in Enterprise Guide the differences are fairly striking:

data TempData1;

set T_Text.SRS_TempComments100000;

if (index(COMMENTTEXT,'IME')>0) then Flag_IME = 1;

else Flag_IME = 0;

run;

proc univariate data = TempData1;

var TOTALRESERVES;

histogram TOTALRESERVES;

proc univariate data = TempData1;

var TOTALRESERVES;

where Flag_IME = 1;

histogram TOTALRESERVES;

proc univariate data = TempData1;

var TOTALRESERVES;

where Flag_IME = 0;

histogram TOTALRESERVES;

The only unfortunate part would be if "mime" was in my comments that I am analyzing, but I hope that with over a million documents there will not be a plethora of mime lovers out there.

Please let me know what processes you might suggest, have tried in the past, or feel is a "best practice." I am hoping to shortcut the long list of options within Enterprise Miner and really bringing focus to this. The only other solution I am currently thinking of is to maybe do the analysis on the TOTALRESERVES quartiles - but that still does not help me isolate the words themselves and not pay too much attention to the topics or clusters.

Thank you very much.

Seek Data Mining or Text Analytics Solution With a Continuous Dependent Variable - Just Words