Calcite | Level 5

## Proc Distance and Varclus

Hi, hope someone can help me with this one!

One of the "limitations" of Proc Varclus is that only divides a set of numeric variables into disjoint or hierarchical clusters and from here you can remove redundant variables etc.. However, more often than not, your data set made of not only numerica variables but ordinal, binary etc. Then I thought I could use proc distance to produce a matrix that I could use as input for Varclus but sadly Proc VarClus doesn't accept type=DISTANCE as input data set.

I can produce a data set type=DISTANCE and then convert it manually to type=CORR in order to use it with Proc Varclus but I am not sure about the following:

• How Varclus will interpret this data set where I have zeros in the diagonal instead of ones? or
• type=CORR only tells VARCLUS how to read data and there is no calculation where the correlation is involved?
• perhaps there is a different approach to solve this problem using diff procedures?
• does this idea make sense at all (statistically speaking)?

Many Thanks,

Alberto

SAS Super FREQ

## Proc Distance and Varclus

I think there are some problem (statistically speaking) with your approach. You can't have a correlation matrix with zeros on the diagonal; VARCLUS will know that it can't compute with such a nonsensical matrix.

If you have ORDINAL character values (like "small", "medium", and "large"), you can recode the values in various ways. The simplest way to do this is to assign the value j to the j_th ordered category. However, there are other ways as well. You can use PROC FREQ to do this: use the SCORES= option on the TABLES statement and request a SCOREOUT data set http://support.sas.com/documentation/cdl/en/procstat/63963/HTML/default/viewer.htm#procstat_freq_sec...

If you have general nominal data (for example, "red," "green," and "blue") then I don't know how to make sense of your question.

Discussion stats