BookmarkSubscribeRSS Feed
thaolp
Calcite | Level 5

Hi all,

I am relatively new to SAS, having been using it for the last month to do some statistical analysis on some small/medium dataset. I am now working with a much larger dataset (~40000 observations) with around 300 variables.

Of these 300 variables, more than half are not numerical (binary, categorical) so I have created another dataset with the same number of observations but all numerical variables. My conversion rules are as follows:

1. Binary variable: Y = 1, N = 0, missing data = 0

2. Categorical: 0, 1, 2 .. n where n is the number of categories. Missing data set to 0

I then run PROC VARCLUS on that data with the hope to be able to reduce the number of variables to make a better prediction model:

proc varclus data=worktable maxeigen=0.7 outtree=tree maxclusters=2;

var a-z;

run;

this give me a total variation explained ~ 30

increase to maxclusters=3 give me total variation explained ~ 50, increase all the way maxclusters=30 and total variation explained ~ 120

i then increase the maxclusters = 40 and it give me total variation explained ~ 130, this total variation explained always goes up. Reading across internet, I found that this values are normally around ~30,40 range and actually goes down if the maxclusters increases more than 10.

I am aware that each dataset is unique and different. However, the result I have is quite abnormal. Do you have any suggestion or explanation why the total variation explained I got is so large?

Thank you very much,

Thao

2 REPLIES 2
Rick_SAS
SAS Super FREQ

I would be wary of converting missing data to a valid category. Because VARCLUS omits missing values from the analysis, you are getting a very different result than if you use the following:

1. Binary variable: Y = 1, N = 0, missing data = '.'

2. Categorical: 1, 2 .. n where n is the number of categories. Missing data set to '.'

thaolp
Calcite | Level 5

Hi Rick,

For any observation with a missing data variable, if I did not convert it to a value, varclus will not process that observation. My dataset got missing data in almost all variables and thus, if I did not do so, it will say ~39000 observation omitted due to missing data.

Do you have a suggestion for this case?

Thanks,

Thao

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1386 views
  • 0 likes
  • 2 in conversation