Dear SAS Community,
I would greatly appreciate it if you could give me some feedback on the best approach to analyze data skewed to the left.
Context: I would like to know if there are significant differences in the dependent continuous variable firmness (AvgFirm) between two avocado varieties (BL516 vs Hass) after 0, 1, 3, and 6 weeks of storage for each harvest month as you can see it in these graphs. I made these graphs using the mean but I guess that is not a good idea.
When I plot the data using the following code I get the following distributions for each week, where you can see that for week 6 the data is skewed to the left.
proc univariate data=one normal plot;
where Variety in('BL516', 'Hass');
var AvgFirm;
histogram AvgFirm/normal (color=red);
class Wks;
run;
quit;
Parameters for Normal Distribution | ||
---|---|---|
Parameter | Symbol | Estimate |
Mean | Mu | 9.49237 |
Std Dev | Sigma | 9.399897 |
Goodness-of-Fit Tests for Normal Distribution | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Kolmogorov-Smirnov | D | 0.2351068 | Pr > D | <0.010 |
Cramer-von Mises | W-Sq | 3.3803286 | Pr > W-Sq | <0.005 |
Anderson-Darling | A-Sq | 17.9411189 | Pr > A-Sq | <0.005 |
Quantiles for Normal Distribution | ||
---|---|---|
Percent | Quantile | |
Observed | Estimated | |
1.0 | 1.90000 | -12.37506 |
5.0 | 2.60000 | -5.96908 |
10.0 | 3.03000 | -2.55408 |
25.0 | 4.20000 | 3.15224 |
50.0 | 6.23000 | 9.49237 |
75.0 | 10.10000 | 15.83250 |
90.0 | 22.75000 | 21.53882 |
95.0 | 28.00000 | 24.95382 |
99.0 | 51.20000 | 31.35980 |
For this reason I would analyze the data for each week separately using genmod. What dist= and link= function do you think it would be a good approach to model left skewed data effectively?
proc genmod data=one;
where Wks=6 ;
class Variety Harvest;
model AvgFirm=Variety|Harvest/type3 dist= link=;
slice Variety*Harvest/sliceby=Harvest/ adjust=simulate(seed=1);
run;
I would greatly appreciate it your help!
Thanks
Without having the data and clarity on all of the variables involved in the study (season is new), it's hard to say. But you might be able to obtain estimable results by fitting the model in PROC GLM and using the E option in the MODEL statement to see the general form of the estimable functions. With that info, you should be able to write ESTIMATE or CONTRAST statements that are estimable. It might help to treat Week as a continuous variable by not including it in the CLASS statement.
Your model defines the subpopulations as combinations of Variety and Harvest, but it seems that the histograms you are using to evaluate normality doesn't look at the distributions of the individual subpopulations. It appears that each histogram uses the data across Varieties and Harvests, so the skewness you see could be caused by shifts among the distributions of the individual subpopulations even if all of those individual distributions are quite symmetrical. So, I think you need to look at the distributions that your model defines before you decide if you have skewed data.
Of course, if you expect that your response is approximately normal, you could remove the subpopulation shifts by fitting the model and plotting the residuals to see if they are approximately normal. You can get that in the diagnostic plots from PROC GLM.
proc glm plots=diagnostics;
class harvest variety ;
model avgfirm = harvest|variety;
quit;
Thank you very much for your reply StatDave.
I got histograms for each Variety (BL516 on the left and Hass on the right) for week 3 and week 6:
Moments | |||
---|---|---|---|
N | 67 | Sum Weights | 67 |
Mean | 9.68910448 | Sum Observations | 649.17 |
Std Deviation | 7.74860002 | Variance | 60.0408022 |
Skewness | 2.95974499 | Kurtosis | 12.0875772 |
Uncorrected SS | 10252.5689 | Corrected SS | 3962.69295 |
Coeff Variation | 79.9723033 | Std Error Mean | 0.94664216 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 9.689104 | Std Deviation | 7.74860 |
Median | 7.160000 | Variance | 60.04080 |
Mode | 6.000000 | Range | 48.86000 |
Interquartile Range | 5.43000 |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 10.23523 | Pr > |t| | <.0001 |
Sign | M | 33.5 | Pr >= |M| | <.0001 |
Signed Rank | S | 1139 | Pr >= |S| | <.0001 |
Tests for Normality | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Shapiro-Wilk | W | 0.704547 | Pr < W | <0.0001 |
Kolmogorov-Smirnov | D | 0.220242 | Pr > D | <0.0100 |
Cramer-von Mises | W-Sq | 1.035726 | Pr > W-Sq | <0.0050 |
Anderson-Darling | A-Sq | 5.48558 | Pr > A-Sq | <0.0050 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 51.20 |
99% | 51.20 |
95% | 24.50 |
90% | 18.70 |
75% Q3 | 10.84 |
50% Median | 7.16 |
25% Q1 | 5.41 |
10% | 3.63 |
5% | 3.05 |
1% | 2.34 |
0% Min | 2.34 |
Moments | |||
---|---|---|---|
N | 106 | Sum Weights | 106 |
Mean | 9.36801887 | Sum Observations | 993.01 |
Std Deviation | 10.3420827 | Variance | 106.958675 |
Skewness | 2.4219283 | Kurtosis | 5.72092607 |
Uncorrected SS | 20533.1973 | Corrected SS | 11230.6609 |
Coeff Variation | 110.397757 | Std Error Mean | 1.00451187 |
Basic Statistical Measures | |||
---|---|---|---|
Location | Variability | ||
Mean | 9.368019 | Std Deviation | 10.34208 |
Median | 5.165000 | Variance | 106.95868 |
Mode | 3.030000 | Range | 51.35000 |
Interquartile Range | 5.97000 |
Note: The mode displayed is the smallest of 7 modes with a count of 2. |
Tests for Location: Mu0=0 | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Student's t | t | 9.325941 | Pr > |t| | <.0001 |
Sign | M | 53 | Pr >= |M| | <.0001 |
Signed Rank | S | 2835.5 | Pr >= |S| | <.0001 |
Tests for Normality | ||||
---|---|---|---|---|
Test | Statistic | p Value | ||
Shapiro-Wilk | W | 0.656111 | Pr < W | <0.0001 |
Kolmogorov-Smirnov | D | 0.258368 | Pr > D | <0.0100 |
Cramer-von Mises | W-Sq | 2.534216 | Pr > W-Sq | <0.0050 |
Anderson-Darling | A-Sq | 13.24827 | Pr > A-Sq | <0.0050 |
Quantiles (Definition 5) | |
---|---|
Level | Quantile |
100% Max | 52.600 |
99% | 47.550 |
95% | 35.950 |
90% | 24.850 |
75% Q3 | 9.650 |
50% Median | 5.165 |
25% Q1 | 3.680 |
10% | 2.860 |
5% | 2.460 |
1% | 1.900 |
0% Min | 1.250 |
When trying the diagnostics for BL516 and Hass
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 16 | 5603.8279 | 350.2392 | 1.53 | 0.0819 |
Error | 1069 | 244698.1694 | 228.9038 | ||
Corrected Total | 1085 | 250301.9973 |
R-Square | Coeff Var | Root MSE | AvgFirm Mean |
---|---|---|---|
0.022388 | 37.56906 | 15.12957 | 40.27135 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 3235.319563 | 359.479951 | 1.57 | 0.1193 |
Variety | 1 | 428.484524 | 428.484524 | 1.87 | 0.1715 |
Harvest*Variety | 6 | 1940.023838 | 323.337306 | 1.41 | 0.2065 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 3118.480398 | 346.497822 | 1.51 | 0.1380 |
Variety | 1 | 347.962624 | 347.962624 | 1.52 | 0.2179 |
Harvest*Variety | 6 | 1940.023838 | 323.337306 | 1.41 | 0.2065 |
The residuals need to be (approximately) normally distributed, not the raw data. The raw data can have any distribution, and if the residuals are normally distributed, then GLM is appropriate.
Thank you very much for your clarification Paige, I will keep that in mind for the future!
You've said that the data were collected at combinations of three variables - Variety, Harvest, and Week. So, a better evaluation would be to include Week in the GLM analysis to see the diagnostics:
proc glm plots=diagnostics;
class harvest variety week;
model avgfirm = harvest|variety|week;
quit;
Thank you again StatDave for your help on this. Ok, so I run the diagnostics for the two avocado varieties separately. For BL516 I got this:
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 27 | 73180.93565 | 2710.40502 | 129.12 | <.0001 |
Error | 399 | 8375.23275 | 20.99056 | ||
Corrected Total | 426 | 81556.16840 |
R-Square | Coeff Var | Root MSE | AvgFirm Mean |
---|---|---|---|
0.897307 | 11.64900 | 4.581545 | 39.32995 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 6 | 1791.94777 | 298.65796 | 14.23 | <.0001 |
Variety | 0 | 0.00000 | . | . | . |
Harvest*Variety | 0 | 0.00000 | . | . | . |
Wks | 3 | 69881.08555 | 23293.69518 | 1109.72 | <.0001 |
Harvest*Wks | 18 | 1507.90233 | 83.77235 | 3.99 | <.0001 |
Variety*Wks | 0 | 0.00000 | . | . | . |
Harvest*Variety*Wks | 0 | 0.00000 | . | . | . |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 6 | 1128.53650 | 188.08942 | 8.96 | <.0001 |
Variety | 0 | 0.00000 | . | . | . |
Harvest*Variety | 0 | 0.00000 | . | . | . |
Wks | 3 | 60694.23347 | 20231.41116 | 963.83 | <.0001 |
Harvest*Wks | 18 | 1507.90233 | 83.77235 | 3.99 | <.0001 |
Variety*Wks | 0 | 0.00000 | . | . | . |
Harvest*Variety*Wks | 0 | 0.00000 | . | . | . |
For Hass:
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 39 | 142654.5078 | 3657.8079 | 88.90 | <.0001 |
Error | 619 | 25467.6996 | 41.1433 | ||
Corrected Total | 658 | 168122.2074 |
R-Square | Coeff Var | Root MSE | AvgFirm Mean |
---|---|---|---|
0.848517 | 15.69005 | 6.414304 | 40.88134 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 3188.2587 | 354.2510 | 8.61 | <.0001 |
Variety | 0 | 0.0000 | . | . | . |
Harvest*Variety | 0 | 0.0000 | . | . | . |
Wks | 3 | 129725.2727 | 43241.7576 | 1051.00 | <.0001 |
Harvest*Wks | 27 | 9740.9764 | 360.7769 | 8.77 | <.0001 |
Variety*Wks | 0 | 0.0000 | . | . | . |
Harvest*Variety*Wks | 0 | 0.0000 | . | . | . |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 1509.1338 | 167.6815 | 4.08 | <.0001 |
Variety | 0 | 0.0000 | . | . | . |
Harvest*Variety | 0 | 0.0000 | . | . | . |
Wks | 3 | 107736.0103 | 35912.0034 | 872.85 | <.0001 |
Harvest*Wks | 27 | 9740.9764 | 360.7769 | 8.77 | <.0001 |
Variety*Wks | 0 | 0.0000 | . | . | . |
Harvest*Variety*Wks | 0 | 0.0000 | . | . | . |
Oh ok, thank you for the clarification. These are the results when including both varieties at once:
Class Level Information | ||
---|---|---|
Class | Levels | Values |
Harvest | 10 | 1 2 3 4 5 6 8 10 11 12 |
Variety | 2 | BL516 Hass |
Wks | 4 | 0 1 3 6 |
Number of Observations Read | 1185 |
---|---|
Number of Observations Used | 1086 |
Postharvest all years-Firmness_month |
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 67 | 216459.0649 | 3230.7323 | 97.18 | <.0001 |
Error | 1018 | 33842.9324 | 33.2445 | ||
Corrected Total | 1085 | 250301.9973 |
R-Square | Coeff Var | Root MSE | AvgFirm Mean |
---|---|---|---|
0.864792 | 14.31739 | 5.765807 | 40.27135 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 3235.3196 | 359.4800 | 10.81 | <.0001 |
Variety | 1 | 428.4845 | 428.4845 | 12.89 | 0.0003 |
Harvest*Variety | 6 | 1940.0238 | 323.3373 | 9.73 | <.0001 |
Wks | 3 | 198566.1555 | 66188.7185 | 1990.97 | <.0001 |
Harvest*Wks | 27 | 7589.2734 | 281.0842 | 8.46 | <.0001 |
Variety*Wks | 3 | 1808.7425 | 602.9142 | 18.14 | <.0001 |
Harvest*Variety*Wks | 18 | 2891.0656 | 160.6148 | 4.83 | <.0001 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Harvest | 9 | 1697.6082 | 188.6231 | 5.67 | <.0001 |
Variety | 1 | 37.4341 | 37.4341 | 1.13 | 0.2889 |
Harvest*Variety | 6 | 1006.8005 | 167.8001 | 5.05 | <.0001 |
Wks | 3 | 133392.5502 | 44464.1834 | 1337.49 | <.0001 |
Harvest*Wks | 27 | 7703.4927 | 285.3145 | 8.58 | <.0001 |
Variety*Wks | 3 | 2210.6355 | 736.8785 | 22.17 | <.0001 |
Harvest*Variety*Wks | 18 | 2891.0656 | 160.6148 | 4.83 | <.0001 |
It looks like the histogram for the residuals is normally distributed, so I guess I can use the following model to analyze this data, what do you think?
proc genmod data=one;
class Variety Harvest Wks;
model AvgFirm=Variety|Harvest|Wks/type3 dist=normal link=identity;
slice Variety*Harvest*Wks/sliceby=Harvest*Wks / adjust=simulate(seed=1);
run;
Thank you so much for your comments. The missing values are probably due to the fact that the variety BL516 was not harvested at every harvest month like Hass.
If I include Season, Harvest, and Variety in my model I get what it seems to me like a bimodal distribution of residuals
proc glm data=one plots=diagnostics;
class Season Harvest Variety;
model AvgFirm = Season|Harvest|Variety;
quit;
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Model | 144 | 111514.4612 | 774.4060 | 3.24 | <.0001 |
Error | 3374 | 805328.1297 | 238.6865 | ||
Corrected Total | 3518 | 916842.5909 |
R-Square | Coeff Var | Root MSE | AvgFirm Mean |
---|---|---|---|
0.121629 | 38.92753 | 15.44948 | 39.68780 |
Source | DF | Type I SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Season | 3 | 886.56917 | 295.52306 | 1.24 | 0.2942 |
Harvest | 9 | 22270.81680 | 2474.53520 | 10.37 | <.0001 |
Season*Harvest | 11 | 7399.62936 | 672.69358 | 2.82 | 0.0011 |
Variety | 15 | 53362.39688 | 3557.49313 | 14.90 | <.0001 |
Season*Variety | 21 | 11133.99185 | 530.19009 | 2.22 | 0.0011 |
Harvest*Variety | 57 | 11054.28780 | 193.93487 | 0.81 | 0.8417 |
Season*Harves*Variet | 28 | 5406.76936 | 193.09891 | 0.81 | 0.7497 |
Source | DF | Type III SS | Mean Square | F Value | Pr > F |
---|---|---|---|---|---|
Season | 3 | 4648.63915 | 1549.54638 | 6.49 | 0.0002 |
Harvest | 9 | 20269.82403 | 2252.20267 | 9.44 | <.0001 |
Season*Harvest | 11 | 7779.96955 | 707.26996 | 2.96 | 0.0006 |
Variety | 15 | 48976.59968 | 3265.10665 | 13.68 | <.0001 |
Season*Variety | 20 | 11534.33573 | 576.71679 | 2.42 | 0.0004 |
Harvest*Variety | 57 | 12750.65775 | 223.69575 | 0.94 | 0.6097 |
Season*Harves*Variet | 28 | 5406.76936 | 193.09891 | 0.81 | 0.7497 |
Which dist= and link= option would be more appropriate in this case if using genmod?
I left out weeks because I wanted to compare varieties within Harvest for each season across week like in this graph
title2 'AvgFirm 2021: Comparing varieties within Harvest across Wks';
proc genmod data=one;
where Season=2021;
class Harvest Variety;
model AvgFirm=Harvest|Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest / adjust=simulate(seed=1);
run;
But you are right, if I include weeks then I get a normal distribution
You don't have to leave variables out of the model to produce a plot like that. It looks like you need Week to be in the model to get approximately normal residuals - hopefully still true with the addition of Season. (but the diagnostics still present questions about some outliers and high influence points that might need addressing). You can store the fitted model and then use the EFFECTPLOT statement in PROC PLM to produce a plot like you want. Treating Week as continuous and producing the plots at the mean of Weeks:
proc glm plots(only)=diagnostics;
class variety harvest season;
model avgfirm = variety|harvest|season|week;
store mod;
quit;
proc plm source=mod;
effectplot interaction(x=harvest sliceby=variety plotby=season)/nrows=1 ncols=4;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.