BookmarkSubscribeRSS Feed
thaolp
Calcite | Level 5

Hi all,

I am relatively new to SAS, having been using it for the last month to do some statistical analysis on some small/medium dataset. I am now working with a much larger dataset (~40000 observations) with around 300 variables.

Of these 300 variables, more than half are not numerical (binary, categorical) so I have created another dataset with the same number of observations but all numerical variables. My conversion rules are as follows:

1. Binary variable: Y = 1, N = 0, missing data = 0

2. Categorical: 0, 1, 2 .. n where n is the number of categories. Missing data set to 0

I then run PROC VARCLUS on that data with the hope to be able to reduce the number of variables to make a better prediction model:

proc varclus data=worktable maxeigen=0.7 outtree=tree maxclusters=2;

var a-z;

run;

this give me a total variation explained ~ 30

increase to maxclusters=3 give me total variation explained ~ 50, increase all the way maxclusters=30 and total variation explained ~ 120

i then increase the maxclusters = 40 and it give me total variation explained ~ 130, this total variation explained always goes up. Reading across internet, I found that this values are normally around ~30,40 range and actually goes down if the maxclusters increases more than 10.

I am aware that each dataset is unique and different. However, the result I have is quite abnormal. Do you have any suggestion or explanation why the total variation explained I got is so large?

Thank you very much,

Thao

2 REPLIES 2
Rick_SAS
SAS Super FREQ

I would be wary of converting missing data to a valid category. Because VARCLUS omits missing values from the analysis, you are getting a very different result than if you use the following:

1. Binary variable: Y = 1, N = 0, missing data = '.'

2. Categorical: 1, 2 .. n where n is the number of categories. Missing data set to '.'

thaolp
Calcite | Level 5

Hi Rick,

For any observation with a missing data variable, if I did not convert it to a value, varclus will not process that observation. My dataset got missing data in almost all variables and thus, if I did not do so, it will say ~39000 observation omitted due to missing data.

Do you have a suggestion for this case?

Thanks,

Thao

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1368 views
  • 0 likes
  • 2 in conversation