Contributor
Posts: 43

Modelling Nominal categorical variable

HI;

I am working on a dataset with around 600K cases and I am giving a copy of the data structure here below.

 Type ID Weekday PSUM RSUM DP VNET DEP DSUM DRET DNET 999 5 Friday 0 -1 0 -1 X21 0 -1 -1 30 7 Friday 2 0 2 2 X52 1 0 1 30 7 Friday 2 0 2 2 X64 1 0 1 26 8 Friday 30 -2 7 28 X17 2 0 2 26 8 Friday 30 -2 7 28 X18 1 0 1 26 8 Friday 30 -2 7 28 X31 1 0 1

There are 38 levels for Type and 69 levels for Dep and all other numeric variables are discrete. I need to model Type using other variables. As the dataset is so sparse,I find it confusing for me to figure out what would be the best approach.I know its kind of multinomial logistic regression but don't know how to proceed.Please advice me of the methods from simple but less accurate to harder but with more accuracy. I really appreciate if anyone can help me with this, since I am in the middle of looking  for job interviews. If anyone can tell me of ensemble techniques using sas thats great.

Thanks.

Frequent Contributor
Posts: 140

Re: Modelling Nominal categorical variable

I think your first task is to figure out just how sparse the data are and the first step within that is to find out the distribution of type.  If it is fairly evenly distributed among your 600,000 cases then you have approximately 1500 cases per level of type, which is not sparse at all.  However, if some levels of type have very few members, you might want to combine levels.  You can also do the same for all your other variables.

Next, to get an overall sense of sparseness, the /LIST option on PROC FREQ can be very helpful, something like:

PROC FREQ data = mydata;

TABLE type*kind*weekday*psum*rsum*dp*vnet*dep*dsum*dret*dnet/LIST;

RUN:

However, this may produce too many rows to look at, in which case you can do it by smaller sets of variables.

The statistical technique should, I think, be multinomial logistic (as you suspected).  There are exact methods to deal with sparse tables, but they will take preposterous amounts of time with N = 600,000.  However, HPLOGISTIC may offer some savings of time, depending on your exact setup (see the documentation).

Contributor
Posts: 43

Re: Modelling Nominal categorical variable

Please see the distribution of type here.

 Type Frequency Percent Cumulative Frequency Cumulative Percent 40 174164 26.92 174164 26.92 39 95504 14.76 269668 41.68 37 38954 6.02 308622 47.70 38 29565 4.57 338187 52.27 25 27609 4.27 365796 56.53 7 23199 3.59 388995 60.12 8 22844 3.53 411839 63.65 36 21990 3.40 433829 67.05 44 20424 3.16 454253 70.20 42 19468 3.01 473721 73.21 24 18015 2.78 491736 76.00 999 17590 2.72 509326 78.71 9 16820 2.60 526146 81.31 32 13843 2.14 539989 83.45 5 13836 2.14 553825 85.59 35 12501 1.93 566326 87.52 33 9918 1.53 576244 89.06 15 7147 1.10 583391 90.16 3 6827 1.06 590218 91.22 43 6383 0.99 596601 92.20 41 5508 0.85 602109 93.05 30 4861 0.75 606970 93.81 34 4751 0.73 611721 94.54 27 4613 0.71 616334 95.25 21 4032 0.62 620366 95.88 22 3592 0.56 623958 96.43 6 3405 0.53 627363 96.96 20 3116 0.48 630479 97.44 18 2977 0.46 633456 97.90 28 2664 0.41 636120 98.31 26 2507 0.39 638627 98.70 12 2108 0.33 640735 99.02 29 2105 0.33 642840 99.35 31 1765 0.27 644605 99.62 19 1188 0.18 645793 99.81 4 901 0.14 646694 99.94 23 325 0.05 647019 99.99 14 35 0.01 647054 100.00

Can we decide on sparseness of data with just type distribution?

Frequent Contributor
Posts: 140