turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Box Cox-Two Distributions in same data set

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 01:33 PM

I have non-transformed data whose histograms show two distributions.

Therefore, would it be okay to apply the Box Cox analysis seperately to the distributions above and below CLud > 250000 and Clud<250000 in this case? I ask because when I do seperate analyses, the resulting tests for normality and the QQ plots are vastly improved.

A follow-up question would be should one analyze as two distributions if the transformed data show two distinct distributions?

Accepted Solutions

Solution

11-22-2016
08:16 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 08:05 PM

Linear regression does not require that the variables be normal. Neither the response variable nor the explanatory variable(s) have to be normally distributed.

The assumptions for linear regression is presented in the SAS/STAT documentation. The important assumption for estimating the parameters is that the errors are identical and are independently distrivbuted. For some inferential statistics (standard errors, confidence intervals,...), the errors are assumed to be normally distributed. Thus for these statistics to be valid a plot of the RESIDUALS should look approximately normal. This is much different than saying that the response is normal.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 07:14 PM

What is your goal? Why do you want to transform the data to normality? For example, the data could be distributed like an exponential or gamma.

Try a smaller bin size (maybe 50,000) and see if a decreasing decaying distribution appears.

Are there subpopulations involved? For example, if you plot the heights of students in a classroom, you might get a distribution that is a MIXTURE of the heights of men and women. In that case, you should model the distribution as a finite mixture model. PROC FMM can do that. See "Modeling finite mixtures with the FMM procedure."

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 07:23 PM

My goal is to do a linear regression with the data which if not normal would violate the assumptions for the regression.

I chose that bin size because it appeared to coincide with the observed histogram. There may be subpopulations involved but at this point we are not sure but we are investigating.

I was not aware of Proc FMM. Are there any normality requirements for using Proc FMM?

Solution

11-22-2016
08:16 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 08:05 PM

Linear regression does not require that the variables be normal. Neither the response variable nor the explanatory variable(s) have to be normally distributed.

The assumptions for linear regression is presented in the SAS/STAT documentation. The important assumption for estimating the parameters is that the errors are identical and are independently distrivbuted. For some inferential statistics (standard errors, confidence intervals,...), the errors are assumed to be normally distributed. Thus for these statistics to be valid a plot of the RESIDUALS should look approximately normal. This is much different than saying that the response is normal.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-21-2016 09:33 PM

Or you could try non-parameter regression PROC LOESS .

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-22-2016 08:18 AM

Not sure if that would help but I may try it later.