Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Standardize binary variables in cluster analysis?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-03-2017 10:53 AM

I'm performing a cluster analysis on a health insurance dataset (using proc distance and proc cluster) containing 4,343 observations with mixed continuous and binary variables.

I understand the importance of standardizing continuous variables. However, given the wide range of values for some of my continuous variables (notably outlier values for hospital visit counts and total medical expenses) I'm *still* seeing maximum z-score values of 15 or higher for standardized continuous variables compared with maximum values of 1 for unstandardized binary variables.

**Should binary variables be standardized as well to prevent undue weight being placed on continuous variables?**

For example, rare binary events such as MED_STROKE=1 (only 7 cases) would receive a standardized value of 24.9 given their "distance" from the mean value of MED_STROKE, which is close to zero.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-03-2017 11:10 AM

How much have you explored the options for the VAR statement in Proc Distance?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-03-2017 11:10 AM

How much have you explored the options for the VAR statement in Proc Distance?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-03-2017 11:15 AM

I'm aware there are a range of standardization options - I'm considering calculating a simple z-score ( the std=Std option in the proc distnace var line) as a measure of the "distance" between the x=0 and x=1 observations in binary variables.