BookmarkSubscribeRSS Feed
Asockalypse
Calcite | Level 5
Hi,

I'm very new to SAS and don't have a basis in stats so struggling to get my head around a lot of the language in the various documentation, so I apologise if this is obvious or I'm not giving enough information. I'll describe the experiment I've done just so I'm not making incorrect assumptions.

I have 100 images which have been watermarked using 1 or 6 different methods, at 1 of 5 different levels (0.1,0.2,0.3,0.4,0.5bpp), so a total of 100*6*5 images. Each of these has then been analysed at 4 different thresholds to produce a value(%) indicating the similarity of the watermarked image to the original. So my dataset has 12000 observations and 5 variables: image(cat) watermarker(cat) bpp(interval) threshold(ordinal) PCM_value(interval)

I have found that the PCM_values are non-normally distributed for a given watermarker/bpp/threshold (actually in some cases, normal, in other cases very non-normal). As the same set image of images are watermarked for each set these are correlated. This correlation and non-normality is thwarting my attempts at understanding from other forum posts and blogs!

I was trying to perform regression using GENMOD with bpp as the IV, PCM as the DV, classes image watermarker and threshold, with repeated on image to account for the fact they are correlated, but I'm not convinced this is valid given non-normality.

Ultimately I'm looking to get values for regression coefficients that allow me to compare between watermarkers and thresholds i.e. does the choice of watermarker have a significant effect on the PCM values, similarly for threshold, as well as the general effect of bpp on PCM. But I'm really not sure how to go about this, whether I'm correct in using GENMOD, or should be using MIXED or if I'm way off base.

Will be grateful for any help anyone can give!

12 REPLIES 12
Paige
Quartz | Level 8
I guess I don't see a problem here. Regression does NOT require normally distributed data. Only the F-tests and t-tests have a distributional requirement, and the requirement is NOT normally distributed data. The requirement for F-tests and t-tests is that the errors are normally distributed. One way you can tell is by fitting the model and plotting the residuals and seeing if they appear to be approximately normally distributed.

Now, depending on your data, linear regression, or whatever you are getting from GENMOD, may not be the right fit. You may need a non-linear regression. I can't tell without seeing your data.
Dale
Pyrite | Level 9
From the information which you provide, I might guess that the response follows a beta distribution. The beta distribution is frequently employed when the response is given as a proportion (scaled on 0 to 1). The beta distribution can take on a wide variety of shapes. Take a look at the wikipedia entry on the beta distribution for examples of the variety of shapes that the beta distribution can model.

There are variations of the beta distribution which allow the response to be modeled on a range other than 0 to 1, but the only SAS procedure that I am aware of that fits a beta regression only has support for a response on the range from 0 to 1. Note that the beta density does not allow observations which have a 0 or 1 value. All values have to be between 0 and 1. If you have values which include 0 or 1, then you have to fudge the data a little bit by adding a small constant (say, 0.0005) and dividing by a value just slightly larger than the maximum (say, 1.001 if you added a constant to eliminate 0 values).

The GLIMMIX procedure (in SAS version 9.2) can be used to fit a regression model where the response has a beta distribution. The GLIMMIX procedure can also deal with issues surrounding correlated response values. The GLIMMIX procedure is the only SAS procedure that I know of with support both for the beta distribution and which has capabilities for dealing with correlated data.

You indicate that you don't have a solid stats background. I would suggest finding a good statistician to work with who can review your data and code to make sure that you are modeling the data appropriately. The following code should get you started:

proc glimmix data=mydata;
  class image watermarker threshold;
  model PCV_value = watermarker bpp threshold / dist=beta s;
  random intercept / subject=image;
run;

The above code assumes that all effects are additive. You might need to include interactions between some of the predictor variables.

HTH
Asockalypse
Calcite | Level 5
@Paige
Thanks, I see where I've misread that, previously I was just looking at statistical comparisons that did require normality and I see I've got my wires crossed. I ran the UNIVARIATE normality tests on the residuals that I obtained using GENMOD and all come out as being significantly non-normal, so presumably my first guess as you say is that linear regression is not the right fit? I'm not sure if I'd be overstepping my bounds but is there anyway to show you my data?

@Dale
Thanks for your reply. As the values are all 0 to 1 (though many points that are actually 1) and looking at the beta distribution it seems that you could well be right as many of the example shapes are similar to various patterns i've seen for given watermarker/bpp/threshold cominations. Unfortunately I'm using SAS at the moment by way of a friend of a friend and only have 9.0 which does not appear to have the GLIMMIX procedure, so I think I could be out of luck. I'll either have to find another prog that can model beta distributions or hope that it doesn't follow a beta and hit on what it does follow!;) Thanks for your help and the code though, i'll give it a go if I get the chance.
Dale
Pyrite | Level 9
About 6 months ago, I was working on a macro to fit beta regression models. I can offer up a macro that uses the NLMIXED procedure to maximize the likelihood of the beta distribution and which is, in some respects, better than what the GLIMMIX procedure has to offer. My macro can fit a 4-parameter beta regression model. In the 4-parameter model, the upper and lower bounds of the distribution can differ from 0 and 1, respectively. You can either specify upper and lower bounds that you believe would be more appropriate than 0 and 1, or you can allow the procedure to estimate upper and lower bounds.

The macro does also allow for random subject effects. Currently, only a random intercept model is allowed for the random effects. But by turning on a feature of the macro that shows the NLMIXED code which is executed, you can with minimal effort modify the code to fit more difficult random effect models.

If you would like to try this macro, post your e-mail address back to the forums and I can send it to you.
Asockalypse
Calcite | Level 5
That sounds awesome, thanks very much. My email is andrewlockphd at g mail dot com.

Much appreciated.
Susan
Calcite | Level 5
Dale, if your offer is open to people other than the OP, I'm interested in trying your macro. GLIMMIX is quirky with the beta; it would be swell if NLMIXED was better behaved, and the 4-parameter option is appealing.

Thank you,
Susan
susan.durham@usu.edu
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
I don't have a macro, but here is some code for the beta. The first program works with the mu/phi parameterization of the beta. It assumes the proportion data is in a variable x. The parameter estimates of mu and phi (if ep=0 and lam=1) are equivalent to the intercept and scale parameters obtained from GLIMMIX.
MY CODE GOT MESSED UP WHEN PASTING. DON'T USE THIS CODE. SEE MY SUBSEQUENT
MESSAGES...
proc nlmixed data=sbdata ;
title2 'MLE of beta (NLMIXED), MU version, epsilon=0, lambda=1 ';
parms mu = .3, phi = 15;
ep=0; lam=1 ; *<--ep is the lower bound parameter, and lam is upper bound param.;
bounds mu > 0; bounds phi > 0;

be=gamma(((mu-ep)/lam)*phi)*gamma(((lam+ep-mu)/lam)*phi)/gamma(phi);
be_dens=( ((x-ep)**(((mu-ep)/lam)*phi -1)) * ((lam+ep-x)**(((lam+ep-mu)/lam)*phi -1) )) /
(be*lam**(phi-1));
loglik=log(be_dens);

model x~general(loglik);

estimate 'alpha' (mu-ep)*phi/lam; *<--to get the conventional alpha/beta parameters;
estimate 'beta'(lam+ep-mu)*phi/lam;
estimate 'logit' log( (mu-ep)/(lam+ep - mu) ); *--Estimated logit assuming general ep & lam;
estimate 'variance' lam*(mu-ep)*(lam+ep-mu)/(lam*(1+phi));
estimate 'median' betainv(0.5,(mu-ep)*phi/lam, (lam+ep-mu)*phi/lam)*lam + ep;
run;

*--Use of GLIMMIX to fit. To have epsilon=0 and lambda=1, must use x variable (x);
proc glimmix data=sbdata;
title2 'Beta, using pseudo-likelihood(GLIMMIX), epsilon=0 & lambda=1, IDentity link';
model x = / dist=beta link=id solution;
run;

Now for a logit link and two predictor variables (x1 and x2).

proc nlmixed data=two ;
title2 'MLE of Beta, using NLMIXED, done in terms of logit link';
title3 'Used epsilon=0 and lambda=1. This is a regression-type problem';
parms a=-2.1 b1=.011 b2 = .1
phi = 2 to 60 by 2;
ep=0.0; lam=1.0 ; *<--For these data, use epsilon=0 and lambda=1;
bounds phi > 0;

m = a + b1*x1 + b2*x2; *<--This is logit;
mu=ep + lam*(1/(1+exp(-m))); *<--Inverse link to get expected (mu);

lbe=lgamma(((mu-ep)/lam)*phi) + lgamma(((lam+ep-mu)/lam)*phi) - lgamma(phi);
lbed1 = (((mu-ep)/lam)*phi -1) * log(x-ep) ;
lbed2 = (((lam+ep-mu)/lam)*phi -1) * log(lam+ep-x);
lbed3= (phi-1)*log(lam) + lbe;
*be_dens= ( ((x-ep)**(((mu-ep)/lam)*phi -1)) * ((lam+ep-x)**(((lam+ep-mu)/lam)*phi -1) )) / (be*lam**(phi-1));
loglik=lbed1 + lbed2 - lbed3;

model x~general(loglik);

predict m out=outp; *<--Put logits in a file;
run;

proc glimmix data=two;
title2 'GLMM fitting of beta to Griffiths data (GLIMMIX), logits';
model x = x1 x2 / dist=beta link=logit solution ; *<--Note the logit link here;
run;

MY CODE GOT MESSED UP WHEN PASTED. SEE MY SUBSEQUENT MESSAGES FOR PROPER CODE.


Message was edited by: lvm

Message was edited by: lvm Message was edited by: lvm
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
My code got cut off. I will try again.

proc nlmixed data=sbdata ;
title2 'MLE of beta (NLMIXED), MU version, epsilon=0, lambda=1 ';
parms mu = .3, phi = 15;
ep=0; lam=1 ; *<--upper and lower bounds;
bounds mu > 0; bounds phi > 0;

be=gamma(((mu-ep)/lam)*phi)*gamma(((lam+ep-mu)/lam)*phi)/gamma(phi);
be_dens=( ((x-ep)**(((mu-ep)/lam)*phi -1)) * ((lam+ep-x)**(((lam+ep-mu)/lam)*phi -1) )) /
(be*lam**(phi-1));
loglik=log(be_dens);

model x~general(loglik);

estimate 'alpha' (mu-ep)*phi/lam;
estimate 'beta'(lam+ep-mu)*phi/lam;
estimate 'logit' log( (mu-ep)/(lam+ep - mu) ); *--Estimated logit assuming general ep & lam;
estimate 'variance' lam*(mu-ep)*(lam+ep-mu)/(lam*(1+phi));
estimate 'median' betainv(0.5,(mu-ep)*phi/lam, (lam+ep-mu)*phi/lam)*lam + ep;

run;

*--Use of GLIMMIX to fit. To have epsilon=0 and lambda=1, must use x variable (x);
proc glimmix data=sbdata;
title2 'Beta, using pseudo-likelihood(GLIMMIX), epsilon=0 & lambda=1, IDentity link';
model x = / dist=beta link=id solution;
run;
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
I keep trying to paste a longer example of the beta and NLMIXED, but only the first third or so shows up on the preview (or posting). Not sure why I can't paste the entire example. I want to show the code for a logit link and two predictor variables, with NLMIXED and GLIMMIX.
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
Here is the code for the beta, with a logit link, and two predictors. x is continuous (0-1) response. Because I set the lower (ep) and upper (lam) to 0 and 1, respectively, I can do the same analysis in GLIMMIX. But with NLMIXED, one can choose other values for ep and lam.
proc nlmixed data=two ;
title2 'MLE of Beta, using NLMIXED, done in terms of logit link';
title3 'Used epsilon=0 and lambda=1. This is a regression-type problem';
parms a=-2.1 b1=.011 b2 = .1
phi = 2 to 60 by 2;
ep=0.0; lam=1.0 ;
bounds phi > 0;

m = a + b1*x1 + b2*x2; *-logit link scale;
mu=ep + lam*(1/(1+exp(-m))); *-inverse link;
lbe=lgamma(((mu-ep)/lam)*phi) + lgamma(((lam+ep-mu)/lam)*phi) - lgamma(phi);
lbed1 = (((mu-ep)/lam)*phi -1) * log(x-ep) ;
lbed2 = (((lam+ep-mu)/lam)*phi -1) * log(lam+ep-x);
lbed3= (phi-1)*log(lam) + lbe;
loglik=lbed1 + lbed2 - lbed3;

model x~general(loglik);
predict m out=outp;
run;

proc glimmix data=two;
title2 'GLMM fitting of beta to Griffiths data (GLIMMIX), logits';
model x = x1 x2 / dist=beta link=logit solution ;
run;
lvm
Rhodochrosite | Level 12 lvm
Rhodochrosite | Level 12
I realize my examples do not have a random effect. But this can be added, by using a random statement and adding a random term to the right-hand side of the equation for m. This would be a random intercept model.
Bhupinder
Calcite | Level 5
Hi Dale,

If your are still offering the macro to use then could I also get it to try for my use? My email address is bhupi80singh@yahoo.co.in

Thanks
Bhupinder

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 2496 views
  • 0 likes
  • 6 in conversation