## Outliers in simulation

Occasional Contributor
Posts: 12

# Outliers in simulation

Hi everyone I need to know how can I determine the number of outliers in my simulation? Can anyone help?

Super User
Posts: 23,774

## Outliers in simulation

You need to define an outlier (0/1) then add that up.

Defining the outlier is the problem, is it something outside the 99.9% CI?

It really depends on what you're looking at and your modelling criteria. How many paramters are you looking at, do you have a single outcome or multiple outcomes.

What's an outlier also depends on business context, for machinery it might be 99% but for medical data could be 95%...

We need more details on what your simulating and how to help out.

Occasional Contributor
Posts: 12

## Re: Outliers in simulation

What I need is generating the independent variables in a regression relationship and I need the generated independent variables to contain outliers.

By outliers I only mean values that are far away from the set of data generated (either outliers up or down)and not according to certain criteria and not something related to CI . and it is not  for a business context it is just for applying .Thanks for your effort

Super User
Posts: 23,774

## Re: Outliers in simulation

Ok...same idea then.

Take each independent variable that was generated and flag if its an outlier or not.

AFAIK there really isn't an absolute statistical definition of what is an outlier, so you'll need to come up with that.

There's some suggested methods on Wikipedia

http://en.wikipedia.org/wiki/Outlier

SAS Super FREQ
Posts: 4,245

## Re: Outliers in simulation

One way to do this is to use the idea of a "contaminated normal distribution," which is a specific kind of mixture distribution.

After you define the x variable simulate the y variable as follows:

type = rand("Bernoulli", 0.1); /* outlier with 10% probability */

if type=1 then

error = rand("Normal", 0, 10); /* error is N(0, 10) */

else

error = rand("Normal", 0, 1); /* error is N(0, 10) */

y = intercept + beta*x + error;

outlier = (abs(error)>3);

Change the probability of contamination (0.1), the magnitude of the contamination (10) and the definition of an outlier (3) as your needs require.

For more info on the general case of sampling from a mixture distribution, see http://blogs.sas.com/content/iml/2011/09/21/generate-a-random-sample-from-a-mixture-distribution/

Rick

Occasional Contributor
Posts: 12

## Re: Outliers in simulation

Thanks a lot for your effort but I still have problem in this part

If I need the outliers in the independent variables x's I would follow the same procedure? and how to determine the correlation between the produced x's if I produced each x separetly?

The other problem I have is that I am using NLPCG model and I determined the first row in the blc matrix as zeros as I need my decision variable to be positive but still the produced variables have negative values how can I solve this problem?

and I have another model that is linear in both objective function and constraint what is the suitable Call ?

SAS Super FREQ
Posts: 4,245

## Outliers in simulation

If you want correlated data, generate your X's from some multivariate distribution with the given correlation structure. Then add outliers (from the same distribution but with an larger variance?)

I don't understand how you are using NLPCG. You haven't said what you are optimizing. Nevertheless, I don't see how you can get negative values if you specify the blc matrix correctly.  Make sure that your initial guess is valid.

Occasional Contributor
Posts: 12

## Re: Outliers in simulation

• The constraints in NLPCG are put in matrix form with the first row representing the lower limit so I put the matrix as

con={0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  .  . ,

.  .  .  .  .  .  .  .  .  .  .  . ,

40. 51. 60. 24. 53. 80. 16. 34. 52. 84. 0. 42.894,

1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0. 1.  };

put still the resulted variables have negative values as -7.05E-18 so how can I solve this problem.

• The other thing is that when I generated xs by this way and added to them the outliers(generated as from same distribution with larger value) the the new variables donot have the same correlation determined in the begining so how can I solve this problem?

• I also need to know how to make a condition so that:

if correlation between y and x1 greater than or equal 0.5 x1 belongs to matrix H

if correlation between y and x1 less than 0.5 x1 belongs to matrix K

Thanks

Discussion stats
• 7 replies
• 627 views
• 0 likes
• 3 in conversation