Re: Modelling Nominal categorical variable

JVarghese · Posted 11-23-2015 12:08 PM

HI;

I am working on a dataset with around 600K cases and I am giving a copy of the data structure here below.

Type	ID	Weekday	PSUM	RSUM	DP	VNET	DEP	DSUM	DRET	DNET
999	5	Friday	0	-1	0	-1	X21	0	-1	-1
30	7	Friday	2	0	2	2	X52	1	0	1
30	7	Friday	2	0	2	2	X64	1	0	1
26	8	Friday	30	-2	7	28	X17	2	0	2
26	8	Friday	30	-2	7	28	X18	1	0	1
26	8	Friday	30	-2	7	28	X31	1	0	1

There are 38 levels for Type and 69 levels for Dep and all other numeric variables are discrete. I need to model Type using other variables. As the dataset is so sparse,I find it confusing for me to figure out what would be the best approach.I know its kind of multinomial logistic regression but don't know how to proceed.Please advice me of the methods from simple but less accurate to harder but with more accuracy. I really appreciate if anyone can help me with this, since I am in the middle of looking for job interviews. If anyone can tell me of ensemble techniques using sas thats great.

Thanks.

plf515 · Posted 11-24-2015 06:34 AM

I think your first task is to figure out just how sparse the data are and the first step within that is to find out the distribution of type. If it is fairly evenly distributed among your 600,000 cases then you have approximately 1500 cases per level of type, which is not sparse at all. However, if some levels of type have very few members, you might want to combine levels. You can also do the same for all your other variables.

Next, to get an overall sense of sparseness, the /LIST option on PROC FREQ can be very helpful, something like:

PROC FREQ data = mydata;

TABLE type*kind*weekday*psum*rsum*dp*vnet*dep*dsum*dret*dnet/LIST;

RUN:

However, this may produce too many rows to look at, in which case you can do it by smaller sets of variables.

The statistical technique should, I think, be multinomial logistic (as you suspected). There are exact methods to deal with sparse tables, but they will take preposterous amounts of time with N = 600,000. However, HPLOGISTIC may offer some savings of time, depending on your exact setup (see the documentation).

JVarghese · Posted 11-24-2015 03:00 PM

Please see the distribution of type here.

Type	Frequency	Percent	Cumulative Frequency	Cumulative Percent
40	174164	26.92	174164	26.92
39	95504	14.76	269668	41.68
37	38954	6.02	308622	47.70
38	29565	4.57	338187	52.27
25	27609	4.27	365796	56.53
7	23199	3.59	388995	60.12
8	22844	3.53	411839	63.65
36	21990	3.40	433829	67.05
44	20424	3.16	454253	70.20
42	19468	3.01	473721	73.21
24	18015	2.78	491736	76.00
999	17590	2.72	509326	78.71
9	16820	2.60	526146	81.31
32	13843	2.14	539989	83.45
5	13836	2.14	553825	85.59
35	12501	1.93	566326	87.52
33	9918	1.53	576244	89.06
15	7147	1.10	583391	90.16
3	6827	1.06	590218	91.22
43	6383	0.99	596601	92.20
41	5508	0.85	602109	93.05
30	4861	0.75	606970	93.81
34	4751	0.73	611721	94.54
27	4613	0.71	616334	95.25
21	4032	0.62	620366	95.88
22	3592	0.56	623958	96.43
6	3405	0.53	627363	96.96
20	3116	0.48	630479	97.44
18	2977	0.46	633456	97.90
28	2664	0.41	636120	98.31
26	2507	0.39	638627	98.70
12	2108	0.33	640735	99.02
29	2105	0.33	642840	99.35
31	1765	0.27	644605	99.62
19	1188	0.18	645793	99.81
4	901	0.14	646694	99.94
23	325	0.05	647019	99.99
14	35	0.01	647054	100.00

Can we decide on sparseness of data with just type distribution?

plf515 · Posted 11-24-2015 04:24 PM

Well, certainly type is going to cause problems with all those IVs to be used. You can either drop that type or try to combine it with another.

Beyond that, I think you need to also look at the IVs and their distribution.

Modelling Nominal categorical variable