turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Large value for variation explained using PROC VAR...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-05-2012 10:41 AM

Hi all,

I am relatively new to SAS, having been using it for the last month to do some statistical analysis on some small/medium dataset. I am now working with a much larger dataset (~40000 observations) with around 300 variables.

Of these 300 variables, more than half are not numerical (binary, categorical) so I have created another dataset with the same number of observations but all numerical variables. My conversion rules are as follows:

1. Binary variable: Y = 1, N = 0, missing data = 0

2. Categorical: 0, 1, 2 .. n where n is the number of categories. Missing data set to 0

I then run PROC VARCLUS on that data with the hope to be able to reduce the number of variables to make a better prediction model:

proc varclus data=worktable maxeigen=0.7 outtree=tree maxclusters=2;

var a-z;

run;

this give me a total variation explained ~ 30

increase to maxclusters=3 give me total variation explained ~ 50, increase all the way maxclusters=30 and total variation explained ~ 120

i then increase the maxclusters = 40 and it give me total variation explained ~ 130, this total variation explained always goes up. Reading across internet, I found that this values are normally around ~30,40 range and actually goes down if the maxclusters increases more than 10.

I am aware that each dataset is unique and different. However, the result I have is quite abnormal. Do you have any suggestion or explanation why the total variation explained I got is so large?

Thank you very much,

Thao

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-05-2012 10:57 AM

I would be wary of converting missing data to a valid category. Because VARCLUS omits missing values from the analysis, you are getting a very different result than if you use the following:

1. Binary variable: Y = 1, N = 0, missing data = '.'

2. Categorical: 1, 2 .. n where n is the number of categories. Missing data set to '.'

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-05-2012 04:09 PM

Hi Rick,

For any observation with a missing data variable, if I did not convert it to a value, varclus will not process that observation. My dataset got missing data in almost all variables and thus, if I did not do so, it will say ~39000 observation omitted due to missing data.

Do you have a suggestion for this case?

Thanks,

Thao