BookmarkSubscribeRSS Feed
timtimalo
Obsidian | Level 7

I built confidence intervals for an unknown parameter (coefficient of variation) using different combinations of true parameters (mu and sigma) for simulation purpose and I calulated coverage probabilities of these confidence intervals and interval widths. The resulted confidence intervals have the same coverage probability and different interval widths in case that the sample sample is fixed and different combinatrions of mu and sigma values. Is it correct? What is the explantion? 
Thank you

17 REPLIES 17
Reeza
Super User

This is more of a stats than SAS question. Since you don't seem to be getting a response here, and keep posting the same question multiple time (please don't) I Suggest posting it at stats.stackexchange.com

Rick_SAS
SAS Super FREQ

Please review the suggestions for how to write a good question. Your question needs an example or sample code so that we have a better idea what you've tried and what you want to know..  Is this part of a simulation? If so, please post the code.

timtimalo
Obsidian | Level 7

Hello,

I used SAS to build confidence intervals using simulation work for an unknown parameter (coefficient of variation) for different combinations of true parameters (mu and sigma) and I got the same coverage probability for these confidence intervals. Is it correct? What is the explantion? 

Thank you

ballardw
Super User

Show some code and some data and/or results.

timtimalo
Obsidian | Level 7

Hello,

I used SAS to build confidence intervals of coefficient of variation for lognormal distribution using simulation work for different combinations of true parameters of lognormal distribution (mu and sigma). I tried sample size =15, mu=0 and different values of sigma such that sigma=0.04996, 0.09975, 0.19804, 0.38525, 0.47238.  I got the same coverage probability=0.9496 and interval widths =0.0349, 0.0702, 0.1436, 0.3125, 0.4141 respectively. Is it correct? What is the explanation? The following is the code.

Thank you.

Code of the program:

/*Prog to construct CI for COV of lognormal dist. using ranked set sampling technique*/  

proc iml;

sigma=0.47238;

sigmasq=sigma**2;/*the value of sigma in normal dist.*/

mu=1;

toi=sqrt(exp(sigmasq)-1);

*print toi;

mun=exp(mu+(0.5*sigmasq));/*the value of mean in lognormal dist.*/

print mun;

sigmasqn=exp((2*mu)+(2*sigmasq))-exp((2*mu)+(sigmasq));/*the value of sigmasquare in lognormal dist.*/

sigman=sqrt(sigmasqn);/*the value of sigma in lognormal dist.*/

na=5000;

nb=5000;

* Total sample size is 15 that is expressed as 3 cycles and each cycle contains 5 observations;

cycle=3;

n = 5; m = 5;

numobs=cycle*m;/*number of observations in the matrix*/

print numobs;

x1 = j(n,m);

x2 = j(n,m);

y1=j(cycle,m);

y2=j(cycle,m);

r=j(5000,1,0);

s=j(5000,1,0);

count=0;

*Generate 5000 data sets from normal dist. to calculate mean1, std1, estcov1;

do t=1 to na;

do l=1 to cycle;

call randseed(12345);

*x1 = j(n,m); /* allocate (n x m) matrix*/

call randgen(x1, "Normal", mu, sigma); /* fill it*/

do i = 1 to nrow(x1);

   v1 = x1[i, ]`;  /* get i_th row; transpose to column vector */

   call sort (v1, 1); /* sort it */

   x1[i, ] = v1`;  /* replace with sorted row */

end;/*end of i loop*/

c1=j(n,1);

c1=vecdiag(x1);/*to get diag elements of a matrix in a column*/

*print c1;

d1=c1`;

y1[l, ]=d1;

end;/*end of l loop*/

mean1=y1[:];/* calculate the mean of all elements of the matrix */

sumsq1=y1[##];/* calculate the sum of squares of all elements of the matrix */

var1=(1/(numobs-1))*(sumsq1-numobs*(mean1##2));

stdev1=sqrt(var1);

*Generate 5000 data sets from normal dist. to calculate indepent copy of the statistics mean2, std2, estcov2;

do k=1 to nb;

do h=1 to cycle;

call randseed(67891);

*x2 = j(n,m); /* allocate (n x m) matrix*/

call randgen(x2, "Normal", mu, sigma); /* fill it*/

do g = 1 to nrow(x2);

   v2 = x2[g, ]`;  /* get i_th row; transpose to column vector */

   call sort (v2, 1); /* sort it */

   x2[g, ] = v2`;  /* replace with sorted row */

end;/*end of g loop*/

c2=j(n,1);

c2=vecdiag(x2);/*to get diag elements of a matrix in a column*/

d2=c2`;

y2[h, ]=d2;

end;/*end of h loop*/

mean2=y2[:];/* calculate the mean of all elements of the matrix */

sumsq2=y2[##];/* calculate the sum of squares of all elements of the matrix */

var2=(1/(numobs-1))*(sumsq2-numobs*(mean2##2));

stdev2=sqrt(var2);

FGlognor=sqrt(exp((var1/var2)*(sigmasq))-1);

r[k,1]=FGlognor;

end;/*end of k loop*/

e=r;

r[rank(r),]=e;

low=r[125,];

up=r[4875,];

wid=up-low;

s[t,]=wid;

*Check if toi in the constructed FGCIs;

if low<=toi then if toi<=up then count=count+1;

end;/*end of t loop*/

p=count/5000;

print p;

* q is the coverage proability;

q=1-p;

print q;

*meanwid is the mean width of the constructed confidence intervals;

meanwid=s[:,];

print meanwid;

quit;

run;

Rick_SAS
SAS Super FREQ

I don't understand your program well enough to be able to tell you if it is correct, nut I do want to make you aware of some SAS/IML functions that might simplify your program.

  1. SAS/IML supports the CV function for computing the sample coeffiient of variation.
  2. You seem to be sorting every row of random variates, maybe because you later compute quantiles? SAS/IML supports the QNTL function for computing sample quantiles.

 

Two other comments:

 

It looks like you are doing two simulations, one to generate the confidence intervals and another to assess the coverage probability. Because you are using the same algorithm to construct the intervals and to evaluate the coverage probability, the second simulation will always give you a number close to 95%.  Thus the second simulation is unnecessary.

 

I'm not sure what you are trying to accomplish at the end of the program. For each set of parameters, you have 5000 intervals of the form [L, U].  What question are you trying to answer about these intervals?  It looks like you are computing the mean width of these intervals, but I'm not sure why this is relevant.  I might want to look at the distribution of the widths, or the distribution of the L's and the U's.

timtimalo
Obsidian | Level 7
Thank you for your valuable comments. The second simulation for 5000 data sets from normal dist. is important for the approach I used to calculate independent copy of the statistics (means and variances) for these simulated data. I used these variances to calculate “FGlognor” that is used to calculate L and U. It is a part of my work to calculate widths of these CIs. There is no problem that the coverage probability to be close to 95%. My question is: why for different values of sigma (sigma=0.04996, 0.09975, 0.19804, 0.38525, 0.47238), I got exactly the same coverage probability that equals 0.9496 without any change in the four decimals with different widths (width=0.0349, 0.0702, 0.1436, 0.3125, 0.4141 respectively)? Is this considered to be normal? Thank you
Rick_SAS
SAS Super FREQ

No, that doesn't sound correct. I would generate 5 samples instead of 5000 and see if it still happens.  If so, you can print out some of the intermediate computations to determine what might be happening.

 

While you're at it, you might try to vectorize some of the program so that it runs much faster.  One thing that might help is using the QNTL function for each row instead of sorting and selecting particualr indices. If you put each sample in a column rather than row, then a single call to the QNTL function can compute the 0.025 and 0.975 quantiles of all the columns. No need to loop, sort, and extract.

timtimalo
Obsidian | Level 7

I generated 10 samples instead of 5000 and still have the same problem.  I printed out the intermediate computations and they are correct. So, I could not determine what is wrong. Attached the purpose of the program, the steps I followed for creating the program, and the program.

Thank you

Rick_SAS
SAS Super FREQ

I'm sorry, but I cannot figure out what you are trying to accomplish.  Maybe part of my problem is that I don't know what parameter scale you are working on, normal or lognormal?  With respect to the Wikipedia article on the lognormal distribution (and this article, too), are you computing statistics in terms of (mu, sigma) or in terms of (m, v)?

 

Here are a few things I noticed:

1) In the first section, the definition of the lognormal CV is CV=sqrt(exp(sigma##2)-1);  This is independent of mu.

2) You partially vectorized your code, but you can still do better.  The following SAS/IML code sections are equivalent, but the second is much faster and easier to read/understand:

 

n = 15;
numSamples = 5000;
call randseed(12345);

/* the long way: Loop over number of samples */
v=j(numSamples,1,0);
do t=1 to numSamples;
   x1=j(n,1,0);/* allocate (n x 1) matrix*/
   call randgen(x1, "Normal", mu, sigma); /* fill it*/
   mean1=x1[:];/* calculate the mean of all elements of the matrix */
   sumsq1=x1[##];/* calculate the sum of squares of all elements of the matrix */
   var1=(1/(n-1))*(sumsq1-n*(mean1**2));
   v[t,1]=var1;
end;

/* the short way */
x = j(n, numSamples);   /* each column is sample */
call randgen(x, "normal", mu, sigma);  /* generate all samples */
v2 = var(x);            /* compute variance of each column */
timtimalo
Obsidian | Level 7

Thanks for your advice about the code that helps getting faster runs.

I computed coefficient of variation of lognormal distribution in terms of mu and sigma. Since the coefficient of variation of lognormal distribution depends on sigma-square only, mu is assumed to be one (without loss of generality). I simulated data from normal distribution since the statistics (FGlognor in the program) I used for constructing the CI for the coefficient of variation based on independent copies of sample variances that are calculating from the data generated from independent normal distribution.

The problem is that:

I generated data from normal distribution and used simple random sample and ranked set sampling methods to build CIs for the lognormal coefficient of variation. The coverage probability is the same for each sample size and different values assumed for sigma (as I mentioned in the previous messages). I checked the procedure I used and check SAS calculations for 10 simulations but I could not discover mistakes.

Are the results of coverage probabilities are correct in this case and in terms of the idea of the procedure I followed (which I explained in the previous message)?

Thank you.

Rick_SAS
SAS Super FREQ

 

You are computing FGlognor as sqrt(exp(Q) -1)  where

Q = (variance from a sample)/(variance from another sample) * (population variance)

Why is this the sample CV? Where did the formula come from?

 

I would have thought that the correct formula is 

CVEst = sqrt(exp( varEst ) - 1);

where varEst is the estimated variance of the simulated sample.

 

What am I missing?

timtimalo
Obsidian | Level 7

I used this estimate in previous work for unknown variance and it proved its correction. Assuming that this formula is correct, is there any mistakes in the constructed program that I sent in previous messages?  If it is O.K, can I accept the results of same coverage probabilities for each sample size if interval widths increase as the values of sigma increase? On the other hand, these coverages probabilities are different for different sample sizes (n=15, 50).

Rick_SAS
SAS Super FREQ

Notice that your estimate contains the population parameter. In practice, the population CV is unknown. Thus I am trying to find out where this formula came from and why you are not using something like sigma_hat / x_bar. If the formula is some sort of standardization, I'd like to see a reference so that I can understand what you are attempting.  What formula are you using for the CI?

 

It might be helpful to think about how you would do a simulation study for the coverage probability of the standard CI for the mean:

x_bar +/- t(1-alpha/2, n-1) * sigma_hat / sqrt(n)

1) You would choose a parameter mu, a distribution F, and a sample size. You would generate 5000 samples from that distribution.

2) For each sample, you compute the CI by applying the formula using the value of x_bar, sigma_hat, and n

3) The coverage probability is p = (number of samples for which the true parameter mu is inside the CI) / (total number of samples).

 

4) You would conclude that "when the true distribution is F, then the formula has coverage probability p."

 

Notice that the formula should give wide intervals for small samples and shrink towards zero width as n --> infinity.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 17 replies
  • 3590 views
  • 1 like
  • 4 in conversation