I am having a lot of fun doing the Text Mining with Enterprise Miner. The course was great, but I am not seeing any examples of text mining when there is a continuous dependent variable. Most of the examples are binary, with a few nominal.
I have over 1 million documents to use- I am beginning my exploration with just 100,000. I am not as interested in clustering the text items or even exploring text topics as much as I am in isolating particular words and seeing if there is a relationship between them and my continuous dependent variable. Obviously I should not use mutual information when weighting my data because it is non-nominal.
For example, an acronym like "IME" or the string of the three words put together - "independent medical examination"- would be all I would want to find if it is discriminatory. It actually is- when I use the following code in Enterprise Guide the differences are fairly striking:
data TempData1;
set T_Text.SRS_TempComments100000;
if (index(COMMENTTEXT,'IME')>0) then Flag_IME = 1;
else Flag_IME = 0;
run;
proc univariate data = TempData1;
var TOTALRESERVES;
histogram TOTALRESERVES;
proc univariate data = TempData1;
var TOTALRESERVES;
where Flag_IME = 1;
histogram TOTALRESERVES;
proc univariate data = TempData1;
var TOTALRESERVES;
where Flag_IME = 0;
histogram TOTALRESERVES;
The only unfortunate part would be if "mime" was in my comments that I am analyzing, but I hope that with over a million documents there will not be a plethora of mime lovers out there.
Please let me know what processes you might suggest, have tried in the past, or feel is a "best practice." I am hoping to shortcut the long list of options within Enterprise Miner and really bringing focus to this. The only other solution I am currently thinking of is to maybe do the analysis on the TOTALRESERVES quartiles - but that still does not help me isolate the words themselves and not pay too much attention to the topics or clusters.
Thank you very much.