Re: A question about textmining (related to predictive modelling)

deleted_user · Posted 08-03-2009 09:32 AM

Hi everybody,
I need feedback and suggestions about something that I actually plan doing:
I presented an article in SAS Global Forum in DC this year, it was about prevention of work related injuries with SAS Textminer and I used descriptive mining. Specifically it was about hot and threat related injuries for work police officers and security guards in Sweden. Now I want to proceed a step further by using these results in predictive mining. The outcome variable that I want to analyse is a categorical variable in ordered form: Degree of severity of the injury. This variable is important for the insurance company that I work for, since the degree of severity determines the costs. There are three categories: sick-leave less than 31 days of sickness absence, more than 30 days of sickness absence and medically impaired. So that will be basically an ordered probit framework where some of the control variables are demographic variables which we have in the data base. But having only demographic variables will not help to predict the probability that a worker is in one of these categories. That carries also some risk that error term can be correlated with these control variables (omitted variable problem). On the other hand since I have the accident process avaliable as text variable and since I used text mining to analyse that, I can use the available text information in my predictive modelling. I have two choices here: One is to use roll up terms as control variables, the other is singular value vectors which are artificial (and very hard to interpret) concepts as control variables. I intend to use roll up terms as control variables here and I believe that this will increase considerable the predictive power in my work since we use all the accident process information together with other demographic variables (such as age, gender, diagnose and occupation...so on). (I am a bit suspicious in using SVD vectors -which represent extarcted common meaning components of many different words and documents- as control variables). I intend to also send that article to the next SAS Global forum in Seattle.

Do you have any suggestion? Please if you also know any work or related article, I would appreciate if you share with me.

regards
Kerem Tezic
AFA Swedish Labor Market Insurances
kerem.tezic@afaforsakring.se

JamesCoxPhD · Posted 10-23-2009 10:59 AM

It sounds like you have a good handle on the differences between roll-up terms and SVD. You are correct that the SVD variables can be difficult to interpret, however, we have found in many applications that you can build a more effective model using them rather than roll-up terms, so there is a tradeoff.

I will be looking forward to seeing what kind of results you get from this. It sounds like a very interesting applicaition.

Jim Cox, Development Manager for SAS Text Miner