Programming the statistical procedures from SAS

Large value for variation explained using PROC VARCLUS

Reply
New Contributor
Posts: 2

Large value for variation explained using PROC VARCLUS

Hi all,

I am relatively new to SAS, having been using it for the last month to do some statistical analysis on some small/medium dataset. I am now working with a much larger dataset (~40000 observations) with around 300 variables.

Of these 300 variables, more than half are not numerical (binary, categorical) so I have created another dataset with the same number of observations but all numerical variables. My conversion rules are as follows:

1. Binary variable: Y = 1, N = 0, missing data = 0

2. Categorical: 0, 1, 2 .. n where n is the number of categories. Missing data set to 0

I then run PROC VARCLUS on that data with the hope to be able to reduce the number of variables to make a better prediction model:

proc varclus data=worktable maxeigen=0.7 outtree=tree maxclusters=2;

var a-z;

run;

this give me a total variation explained ~ 30

increase to maxclusters=3 give me total variation explained ~ 50, increase all the way maxclusters=30 and total variation explained ~ 120

i then increase the maxclusters = 40 and it give me total variation explained ~ 130, this total variation explained always goes up. Reading across internet, I found that this values are normally around ~30,40 range and actually goes down if the maxclusters increases more than 10.

I am aware that each dataset is unique and different. However, the result I have is quite abnormal. Do you have any suggestion or explanation why the total variation explained I got is so large?

Thank you very much,

Thao

SAS Super FREQ
Posts: 3,537

Re: Large value for variation explained using PROC VARCLUS

I would be wary of converting missing data to a valid category. Because VARCLUS omits missing values from the analysis, you are getting a very different result than if you use the following:

1. Binary variable: Y = 1, N = 0, missing data = '.'

2. Categorical: 1, 2 .. n where n is the number of categories. Missing data set to '.'

New Contributor
Posts: 2

Re: Large value for variation explained using PROC VARCLUS

Hi Rick,

For any observation with a missing data variable, if I did not convert it to a value, varclus will not process that observation. My dataset got missing data in almost all variables and thus, if I did not do so, it will say ~39000 observation omitted due to missing data.

Do you have a suggestion for this case?

Thanks,

Thao

Ask a Question
Discussion stats
  • 2 replies
  • 233 views
  • 0 likes
  • 2 in conversation