I would like to transform a categorically-valued predictor variable into a continuously-valued predictor variable. From, say, character class values into real-valued representations of those values. I know that I can do this in several ways: simply by substituting the frequency of a level for the level value itself or by computing the entropy of a level. I want to generalize the interpretation of the Information Value of a variable from the binary classification "good/bad" application frequently used in credit scoring to a multiclass 1-versus-many representation of the 1-of-N GLM encoding. For example, if there are 3 class values, I would compute the information value of each in turn versus the other two so that, for class labels 'A', 'B', 'C', the three information values would be 'A' vs ('B', 'C'), 'B' vs ('A', 'C') and 'C' vs ('A', 'B') so that I can numerically represent a multiclass categorical variable as a single real-valued variable. I know that there will be only N distinct values produced by this technique, but I will be able to use existing code that works well on continuous-valued variables, and I do not know how to incorporate a GLM-encoded categorical variable into my work. Is there a better way than Information Value to transform a categorical variable into a continuous variable? How does Enterprise Miner process categorical variables? Does EM convert a categorical variable into a real-valued variable and then use the real values in splitting a target variable?
... View more