BookmarkSubscribeRSS Feed
JVarghese
Obsidian | Level 7

HI;

I am working on a dataset with around 600K cases and I am giving a copy of the data structure here below.

Type ID Weekday PSUM RSUM DP VNET DEP DSUM DRET DNET
999 5 Friday 0 -1 0 -1 X21 0 -1 -1
30 7 Friday 2 0 2 2 X52 1 0 1
30 7 Friday 2 0 2 2 X64 1 0 1
26 8 Friday 30 -2 7 28 X17 2 0 2
26 8 Friday 30 -2 7 28 X18 1 0 1
26 8 Friday 30 -2 7 28 X31 1 0 1

There are 38 levels for Type and 69 levels for Dep and all other numeric variables are discrete. I need to model Type using other variables. As the dataset is so sparse,I find it confusing for me to figure out what would be the best approach.I know its kind of multinomial logistic regression but don't know how to proceed.Please advice me of the methods from simple but less accurate to harder but with more accuracy. I really appreciate if anyone can help me with this, since I am in the middle of looking  for job interviews. If anyone can tell me of ensemble techniques using sas thats great.

 

Thanks.

3 REPLIES 3
plf515
Lapis Lazuli | Level 10

I think your first task is to figure out just how sparse the data are and the first step within that is to find out the distribution of type.  If it is fairly evenly distributed among your 600,000 cases then you have approximately 1500 cases per level of type, which is not sparse at all.  However, if some levels of type have very few members, you might want to combine levels.  You can also do the same for all your other variables.

 

Next, to get an overall sense of sparseness, the /LIST option on PROC FREQ can be very helpful, something like:

 

PROC FREQ data = mydata;

  TABLE type*kind*weekday*psum*rsum*dp*vnet*dep*dsum*dret*dnet/LIST;

RUN:

 

However, this may produce too many rows to look at, in which case you can do it by smaller sets of variables.

 

The statistical technique should, I think, be multinomial logistic (as you suspected).  There are exact methods to deal with sparse tables, but they will take preposterous amounts of time with N = 600,000.  However, HPLOGISTIC may offer some savings of time, depending on your exact setup (see the documentation).

JVarghese
Obsidian | Level 7

Please see the distribution of type here.

Type

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

40

174164

26.92

174164

26.92

39

95504

14.76

269668

41.68

37

38954

6.02

308622

47.70

38

29565

4.57

338187

52.27

25

27609

4.27

365796

56.53

7

23199

3.59

388995

60.12

8

22844

3.53

411839

63.65

36

21990

3.40

433829

67.05

44

20424

3.16

454253

70.20

42

19468

3.01

473721

73.21

24

18015

2.78

491736

76.00

999

17590

2.72

509326

78.71

9

16820

2.60

526146

81.31

32

13843

2.14

539989

83.45

5

13836

2.14

553825

85.59

35

12501

1.93

566326

87.52

33

9918

1.53

576244

89.06

15

7147

1.10

583391

90.16

3

6827

1.06

590218

91.22

43

6383

0.99

596601

92.20

41

5508

0.85

602109

93.05

30

4861

0.75

606970

93.81

34

4751

0.73

611721

94.54

27

4613

0.71

616334

95.25

21

4032

0.62

620366

95.88

22

3592

0.56

623958

96.43

6

3405

0.53

627363

96.96

20

3116

0.48

630479

97.44

18

2977

0.46

633456

97.90

28

2664

0.41

636120

98.31

26

2507

0.39

638627

98.70

12

2108

0.33

640735

99.02

29

2105

0.33

642840

99.35

31

1765

0.27

644605

99.62

19

1188

0.18

645793

99.81

4

901

0.14

646694

99.94

23

325

0.05

647019

99.99

14

35

0.01

647054

100.00

Can we decide on sparseness of data with just type distribution?

plf515
Lapis Lazuli | Level 10

Well, certainly type is going to cause problems with all those IVs to be used.  You can either drop that type or try to combine it with another.

 

Beyond that, I think you need to also look at the IVs and their distribution.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1673 views
  • 1 like
  • 2 in conversation