11-23-2015 12:08 PM
HI;
I am working on a dataset with around 600K cases and I am giving a copy of the data structure here below.
Type | ID | Weekday | PSUM | RSUM | DP | VNET | DEP | DSUM | DRET | DNET |
999 | 5 | Friday | 0 | -1 | 0 | -1 | X21 | 0 | -1 | -1 |
30 | 7 | Friday | 2 | 0 | 2 | 2 | X52 | 1 | 0 | 1 |
30 | 7 | Friday | 2 | 0 | 2 | 2 | X64 | 1 | 0 | 1 |
26 | 8 | Friday | 30 | -2 | 7 | 28 | X17 | 2 | 0 | 2 |
26 | 8 | Friday | 30 | -2 | 7 | 28 | X18 | 1 | 0 | 1 |
26 | 8 | Friday | 30 | -2 | 7 | 28 | X31 | 1 | 0 | 1 |
There are 38 levels for Type and 69 levels for Dep and all other numeric variables are discrete. I need to model Type using other variables. As the dataset is so sparse,I find it confusing for me to figure out what would be the best approach.I know its kind of multinomial logistic regression but don't know how to proceed.Please advice me of the methods from simple but less accurate to harder but with more accuracy. I really appreciate if anyone can help me with this, since I am in the middle of looking for job interviews. If anyone can tell me of ensemble techniques using sas thats great.
Thanks.
11-24-2015 06:34 AM
I think your first task is to figure out just how sparse the data are and the first step within that is to find out the distribution of type. If it is fairly evenly distributed among your 600,000 cases then you have approximately 1500 cases per level of type, which is not sparse at all. However, if some levels of type have very few members, you might want to combine levels. You can also do the same for all your other variables.
Next, to get an overall sense of sparseness, the /LIST option on PROC FREQ can be very helpful, something like:
PROC FREQ data = mydata;
TABLE type*kind*weekday*psum*rsum*dp*vnet*dep*dsum*dret*dnet/LIST;
RUN:
However, this may produce too many rows to look at, in which case you can do it by smaller sets of variables.
The statistical technique should, I think, be multinomial logistic (as you suspected). There are exact methods to deal with sparse tables, but they will take preposterous amounts of time with N = 600,000. However, HPLOGISTIC may offer some savings of time, depending on your exact setup (see the documentation).
11-24-2015 03:00 PM
Please see the distribution of type here.
Type |
Frequency |
Percent |
Cumulative |
Cumulative |
40 |
174164 |
26.92 |
174164 |
26.92 |
39 |
95504 |
14.76 |
269668 |
41.68 |
37 |
38954 |
6.02 |
308622 |
47.70 |
38 |
29565 |
4.57 |
338187 |
52.27 |
25 |
27609 |
4.27 |
365796 |
56.53 |
7 |
23199 |
3.59 |
388995 |
60.12 |
8 |
22844 |
3.53 |
411839 |
63.65 |
36 |
21990 |
3.40 |
433829 |
67.05 |
44 |
20424 |
3.16 |
454253 |
70.20 |
42 |
19468 |
3.01 |
473721 |
73.21 |
24 |
18015 |
2.78 |
491736 |
76.00 |
999 |
17590 |
2.72 |
509326 |
78.71 |
9 |
16820 |
2.60 |
526146 |
81.31 |
32 |
13843 |
2.14 |
539989 |
83.45 |
5 |
13836 |
2.14 |
553825 |
85.59 |
35 |
12501 |
1.93 |
566326 |
87.52 |
33 |
9918 |
1.53 |
576244 |
89.06 |
15 |
7147 |
1.10 |
583391 |
90.16 |
3 |
6827 |
1.06 |
590218 |
91.22 |
43 |
6383 |
0.99 |
596601 |
92.20 |
41 |
5508 |
0.85 |
602109 |
93.05 |
30 |
4861 |
0.75 |
606970 |
93.81 |
34 |
4751 |
0.73 |
611721 |
94.54 |
27 |
4613 |
0.71 |
616334 |
95.25 |
21 |
4032 |
0.62 |
620366 |
95.88 |
22 |
3592 |
0.56 |
623958 |
96.43 |
6 |
3405 |
0.53 |
627363 |
96.96 |
20 |
3116 |
0.48 |
630479 |
97.44 |
18 |
2977 |
0.46 |
633456 |
97.90 |
28 |
2664 |
0.41 |
636120 |
98.31 |
26 |
2507 |
0.39 |
638627 |
98.70 |
12 |
2108 |
0.33 |
640735 |
99.02 |
29 |
2105 |
0.33 |
642840 |
99.35 |
31 |
1765 |
0.27 |
644605 |
99.62 |
19 |
1188 |
0.18 |
645793 |
99.81 |
4 |
901 |
0.14 |
646694 |
99.94 |
23 |
325 |
0.05 |
647019 |
99.99 |
14 |
35 |
0.01 |
647054 |
100.00 |
Can we decide on sparseness of data with just type distribution?
11-24-2015 04:24 PM
Well, certainly type is going to cause problems with all those IVs to be used. You can either drop that type or try to combine it with another.
Beyond that, I think you need to also look at the IVs and their distribution.