BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
palolix
Pyrite | Level 9

Dear SAS Community,

 

I would greatly appreciate it if you could give me some feedback on the best approach to analyze data skewed to the left.

Context: I would like to know if there are significant differences in the dependent continuous variable firmness (AvgFirm) between two avocado varieties (BL516 vs Hass) after 0, 1, 3, and 6 weeks of storage for each harvest month as you can see it in these graphs. I made these graphs using  the mean but I guess that is not a good idea.

 

palolix_0-1759347741263.png

 When I plot the data using the following code I get the following distributions for each week, where you can see that for week 6 the data is skewed to the left.

 

proc univariate data=one normal plot;
where Variety in('BL516', 'Hass');
var AvgFirm;
histogram AvgFirm/normal (color=red);
class Wks;
run;
quit;

 

The UNIVARIATE Procedure
Wks = 6
Fitted Normal Distribution for AvgFirm

 

Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 9.49237
Std Dev Sigma 9.399897

Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-Smirnov D 0.2351068 Pr > D <0.010
Cramer-von Mises W-Sq 3.3803286 Pr > W-Sq <0.005
Anderson-Darling A-Sq 17.9411189 Pr > A-Sq <0.005

Quantiles for Normal Distribution
Percent Quantile
Observed Estimated
1.0 1.90000 -12.37506
5.0 2.60000 -5.96908
10.0 3.03000 -2.55408
25.0 4.20000 3.15224
50.0 6.23000 9.49237
75.0 10.10000 15.83250
90.0 22.75000 21.53882
95.0 28.00000 24.95382
99.0 51.20000 31.35980

 

palolix_1-1759348461421.png

palolix_3-1759348499566.png

For this reason I would analyze the data for each week separately using genmod. What dist= and link= function do you think it would be a good approach to model left skewed data effectively?

 

proc genmod data=one;
where Wks=6 ;
class Variety Harvest;
model AvgFirm=Variety|Harvest/type3 dist=  link=;
slice Variety*Harvest/sliceby=Harvest/ adjust=simulate(seed=1);
run;

 

I would greatly appreciate it your help!

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

Without having the data and clarity on all of the variables involved in the study (season is new), it's hard to say. But you might be able to obtain estimable results by fitting the model in PROC GLM and using the E option in the MODEL statement to see the general form of the estimable functions. With that info, you should be able to write ESTIMATE or CONTRAST statements that are estimable. It might help to treat Week as a continuous variable by not including it in the CLASS statement. 

View solution in original post

19 REPLIES 19
StatDave
SAS Super FREQ

Your model defines the subpopulations as combinations of Variety and Harvest, but it seems that the histograms you are using to evaluate normality doesn't look at the distributions of the individual subpopulations. It appears that each histogram uses the data across Varieties and Harvests, so the skewness you see could be caused by shifts among the distributions of the individual subpopulations even if all of those individual distributions are quite symmetrical. So, I think you need to look at the distributions that your model defines before you decide if you have skewed data. 

 

Of course, if you expect that your response is approximately normal, you could remove the subpopulation shifts by fitting the model and plotting the residuals to see if they are approximately normal. You can get that in the diagnostic plots from PROC GLM. 

 

proc glm plots=diagnostics; 
   class harvest variety ;
   model avgfirm = harvest|variety;
quit;
palolix
Pyrite | Level 9

Thank you very much for your reply StatDave.

I got histograms for each Variety (BL516 on the left and Hass on the right) for week 3 and week 6:

palolix_0-1759355748467.pngpalolix_1-1759355966548.png

The UNIVARIATE Procedure for 'BL516'
Variable: AvgFirm
Wks = 6

 

Moments
N 67 Sum Weights 67
Mean 9.68910448 Sum Observations 649.17
Std Deviation 7.74860002 Variance 60.0408022
Skewness 2.95974499 Kurtosis 12.0875772
Uncorrected SS 10252.5689 Corrected SS 3962.69295
Coeff Variation 79.9723033 Std Error Mean 0.94664216

Basic Statistical Measures
Location Variability
Mean 9.689104 Std Deviation 7.74860
Median 7.160000 Variance 60.04080
Mode 6.000000 Range 48.86000
    Interquartile Range 5.43000

Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 10.23523 Pr > |t| <.0001
Sign M 33.5 Pr >= |M| <.0001
Signed Rank S 1139 Pr >= |S| <.0001

Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.704547 Pr < W <0.0001
Kolmogorov-Smirnov D 0.220242 Pr > D <0.0100
Cramer-von Mises W-Sq 1.035726 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 5.48558 Pr > A-Sq <0.0050

Quantiles (Definition 5)
Level Quantile
100% Max 51.20
99% 51.20
95% 24.50
90% 18.70
75% Q3 10.84
50% Median 7.16
25% Q1 5.41
10% 3.63
5% 3.05
1% 2.34
0% Min 2.34

 

The UNIVARIATE Procedure for 'Hass'
Variable: AvgFirm
Wks = 6

 

Moments
N 106 Sum Weights 106
Mean 9.36801887 Sum Observations 993.01
Std Deviation 10.3420827 Variance 106.958675
Skewness 2.4219283 Kurtosis 5.72092607
Uncorrected SS 20533.1973 Corrected SS 11230.6609
Coeff Variation 110.397757 Std Error Mean 1.00451187

Basic Statistical Measures
Location Variability
Mean 9.368019 Std Deviation 10.34208
Median 5.165000 Variance 106.95868
Mode 3.030000 Range 51.35000
    Interquartile Range 5.97000

 

Note: The mode displayed is the smallest of 7 modes with a count of 2.

 


Tests for Location: Mu0=0
Test Statistic p Value
Student's t t 9.325941 Pr > |t| <.0001
Sign M 53 Pr >= |M| <.0001
Signed Rank S 2835.5 Pr >= |S| <.0001

Tests for Normality
Test Statistic p Value
Shapiro-Wilk W 0.656111 Pr < W <0.0001
Kolmogorov-Smirnov D 0.258368 Pr > D <0.0100
Cramer-von Mises W-Sq 2.534216 Pr > W-Sq <0.0050
Anderson-Darling A-Sq 13.24827 Pr > A-Sq <0.0050

Quantiles (Definition 5)
Level Quantile
100% Max 52.600
99% 47.550
95% 35.950
90% 24.850
75% Q3 9.650
50% Median 5.165
25% Q1 3.680
10% 2.860
5% 2.460
1% 1.900
0% Min 1.250

 

When trying the diagnostics for BL516 and Hass

The GLM Procedure
 
Dependent Variable: AvgFirm

 

Source DF Sum of Squares Mean Square F Value Pr > F
Model 16 5603.8279 350.2392 1.53 0.0819
Error 1069 244698.1694 228.9038    
Corrected Total 1085 250301.9973      

R-Square Coeff Var Root MSE AvgFirm Mean
0.022388 37.56906 15.12957 40.27135

Source DF Type I SS Mean Square F Value Pr > F
Harvest 9 3235.319563 359.479951 1.57 0.1193
Variety 1 428.484524 428.484524 1.87 0.1715
Harvest*Variety 6 1940.023838 323.337306 1.41 0.2065

Source DF Type III SS Mean Square F Value Pr > F
Harvest 9 3118.480398 346.497822 1.51 0.1380
Variety 1 347.962624 347.962624 1.52 0.2179
Harvest*Variety 6 1940.023838 323.337306 1.41 0.2065

palolix_8-1759357469991.png

 

 
palolix_9-1759357485550.png

 

 


 

 

PaigeMiller
Diamond | Level 26

The residuals need to be (approximately) normally distributed, not the raw data. The raw data can have any distribution, and if the residuals are normally distributed, then GLM is appropriate.

 

--
Paige Miller
palolix
Pyrite | Level 9

Thank you very much for your clarification Paige, I will keep that in mind for the future!

StatDave
SAS Super FREQ

You've said that the data were collected at combinations of three variables - Variety, Harvest, and Week. So, a better evaluation would be to include Week in the GLM analysis to see the diagnostics:

proc glm plots=diagnostics; 
   class harvest variety week;
   model avgfirm = harvest|variety|week;
quit;
palolix
Pyrite | Level 9

Thank you again StatDave for your help on this. Ok, so I run the diagnostics for the two avocado varieties separately. For BL516 I got this:

Dependent Variable: AvgFirm

 

Source DF Sum of Squares Mean Square F Value Pr > F
Model 27 73180.93565 2710.40502 129.12 <.0001
Error 399 8375.23275 20.99056    
Corrected Total 426 81556.16840      

R-Square Coeff Var Root MSE AvgFirm Mean
0.897307 11.64900 4.581545 39.32995

Source DF Type I SS Mean Square F Value Pr > F
Harvest 6 1791.94777 298.65796 14.23 <.0001
Variety 0 0.00000 . . .
Harvest*Variety 0 0.00000 . . .
Wks 3 69881.08555 23293.69518 1109.72 <.0001
Harvest*Wks 18 1507.90233 83.77235 3.99 <.0001
Variety*Wks 0 0.00000 . . .
Harvest*Variety*Wks 0 0.00000 . . .

Source DF Type III SS Mean Square F Value Pr > F
Harvest 6 1128.53650 188.08942 8.96 <.0001
Variety 0 0.00000 . . .
Harvest*Variety 0 0.00000 . . .
Wks 3 60694.23347 20231.41116 963.83 <.0001
Harvest*Wks 18 1507.90233 83.77235 3.99 <.0001
Variety*Wks 0 0.00000 . . .
Harvest*Variety*Wks 0 0.00000 . . .
 
 
palolix_3-1759441445991.png

 For Hass:

 

 

The GLM Procedure
 
Dependent Variable: AvgFirm

 

Source DF Sum of Squares Mean Square F Value Pr > F
Model 39 142654.5078 3657.8079 88.90 <.0001
Error 619 25467.6996 41.1433    
Corrected Total 658 168122.2074      

R-Square Coeff Var Root MSE AvgFirm Mean
0.848517 15.69005 6.414304 40.88134

Source DF Type I SS Mean Square F Value Pr > F
Harvest 9 3188.2587 354.2510 8.61 <.0001
Variety 0 0.0000 . . .
Harvest*Variety 0 0.0000 . . .
Wks 3 129725.2727 43241.7576 1051.00 <.0001
Harvest*Wks 27 9740.9764 360.7769 8.77 <.0001
Variety*Wks 0 0.0000 . . .
Harvest*Variety*Wks 0 0.0000 . . .

Source DF Type III SS Mean Square F Value Pr > F
Harvest 9 1509.1338 167.6815 4.08 <.0001
Variety 0 0.0000 . . .
Harvest*Variety 0 0.0000 . . .
Wks 3 107736.0103 35912.0034 872.85 <.0001
Harvest*Wks 27 9740.9764 360.7769 8.77 <.0001
Variety*Wks 0 0.0000 . . .
Harvest*Variety*Wks 0 0.0000 . . .

 

 
palolix_18-1759441611993.png

 

 

 

 

 

 
 
 

 

 

 

 
 
 
 

 

 

 

 
 

 
 

 

 
 

 

 
 

 

StatDave
SAS Super FREQ
You only need to run GLM once and for all of the data including both varieties, all harvest levels, and all weeks. Doing it for the separate varieties is why you see the zero DF in both analyses. I suspect that the results of the single analysis on all the data will probably be about like either of the ones you show - if so, the histogram of the residuals looks pretty symmetrical.
palolix
Pyrite | Level 9

Oh ok, thank you for the clarification. These are the results when including both varieties at once:

 

The GLM Procedure

 

Class Level Information
Class Levels Values
Harvest 10 1 2 3 4 5 6 8 10 11 12
Variety 2 BL516 Hass
Wks 4 0 1 3 6

Number of Observations Read 1185
Number of Observations Used 1086

 


Postharvest all years-Firmness_month

The GLM Procedure
 
Dependent Variable: AvgFirm

 

Source DF Sum of Squares Mean Square F Value Pr > F
Model 67 216459.0649 3230.7323 97.18 <.0001
Error 1018 33842.9324 33.2445    
Corrected Total 1085 250301.9973      

R-Square Coeff Var Root MSE AvgFirm Mean
0.864792 14.31739 5.765807 40.27135

Source DF Type I SS Mean Square F Value Pr > F
Harvest 9 3235.3196 359.4800 10.81 <.0001
Variety 1 428.4845 428.4845 12.89 0.0003
Harvest*Variety 6 1940.0238 323.3373 9.73 <.0001
Wks 3 198566.1555 66188.7185 1990.97 <.0001
Harvest*Wks 27 7589.2734 281.0842 8.46 <.0001
Variety*Wks 3 1808.7425 602.9142 18.14 <.0001
Harvest*Variety*Wks 18 2891.0656 160.6148 4.83 <.0001

Source DF Type III SS Mean Square F Value Pr > F
Harvest 9 1697.6082 188.6231 5.67 <.0001
Variety 1 37.4341 37.4341 1.13 0.2889
Harvest*Variety 6 1006.8005 167.8001 5.05 <.0001
Wks 3 133392.5502 44464.1834 1337.49 <.0001
Harvest*Wks 27 7703.4927 285.3145 8.58 <.0001
Variety*Wks 3 2210.6355 736.8785 22.17 <.0001
Harvest*Variety*Wks 18 2891.0656 160.6148 4.83 <.0001

palolix_2-1759444271016.png

 It looks like the histogram for the residuals is normally distributed, so I guess I can use the following model to analyze this data, what do you think?

 

proc genmod data=one;
class Variety Harvest Wks;
model AvgFirm=Variety|Harvest|Wks/type3 dist=normal link=identity;
slice Variety*Harvest*Wks/sliceby=Harvest*Wks / adjust=simulate(seed=1);
run;

 

 

 

 

 

StatDave
SAS Super FREQ
Could be fairly reasonable. You might want to look into why about 100 observations were not used (maybe missing values), and if some of the observations with high leverage or influence are problematic or are errors. Also a bit odd that there are no predicted values in a big gap in the middle of the range.
palolix
Pyrite | Level 9

Thank you so much for your comments. The missing values are probably due to the fact that the variety BL516 was not harvested at every harvest month like Hass. 

palolix
Pyrite | Level 9

If I include Season, Harvest, and Variety in my model I get what it seems to me like a bimodal distribution of residuals

 

proc glm data=one plots=diagnostics;
class Season Harvest Variety;
model AvgFirm = Season|Harvest|Variety;
quit;

The GLM Procedure
 
Dependent Variable: AvgFirm

 

Source DF Sum of Squares Mean Square F Value Pr > F
Model 144 111514.4612 774.4060 3.24 <.0001
Error 3374 805328.1297 238.6865    
Corrected Total 3518 916842.5909      

R-Square Coeff Var Root MSE AvgFirm Mean
0.121629 38.92753 15.44948 39.68780

Source DF Type I SS Mean Square F Value Pr > F
Season 3 886.56917 295.52306 1.24 0.2942
Harvest 9 22270.81680 2474.53520 10.37 <.0001
Season*Harvest 11 7399.62936 672.69358 2.82 0.0011
Variety 15 53362.39688 3557.49313 14.90 <.0001
Season*Variety 21 11133.99185 530.19009 2.22 0.0011
Harvest*Variety 57 11054.28780 193.93487 0.81 0.8417
Season*Harves*Variet 28 5406.76936 193.09891 0.81 0.7497

Source DF Type III SS Mean Square F Value Pr > F
Season 3 4648.63915 1549.54638 6.49 0.0002
Harvest 9 20269.82403 2252.20267 9.44 <.0001
Season*Harvest 11 7779.96955 707.26996 2.96 0.0006
Variety 15 48976.59968 3265.10665 13.68 <.0001
Season*Variety 20 11534.33573 576.71679 2.42 0.0004
Harvest*Variety 57 12750.65775 223.69575 0.94 0.6097
Season*Harves*Variet 28 5406.76936 193.09891 0.81 0.7497

 

 
palolix_1-1759513572840.png

Which dist= and link= option would be more appropriate in this case if using genmod?

 

StatDave
SAS Super FREQ
You've left out Week again.
palolix
Pyrite | Level 9

I left out weeks because I wanted to compare varieties within Harvest for each season across week like in this graph

palolix_0-1759528229165.png

 

title2 'AvgFirm 2021: Comparing varieties within Harvest across Wks';

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model AvgFirm=Harvest|Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest / adjust=simulate(seed=1);
run;

 

 

But you are right, if I include weeks then I get a normal distribution

palolix_1-1759528269919.png

 

 

StatDave
SAS Super FREQ

You don't have to leave variables out of the model to produce a plot like that. It looks like you need Week to be in the model to get approximately normal residuals - hopefully still true with the addition of Season. (but the diagnostics still present questions about some outliers and high influence points that might need addressing). You can store the fitted model and then use the EFFECTPLOT statement in PROC PLM to produce a plot like you want. Treating Week as continuous and producing the plots at the mean of Weeks:

proc glm plots(only)=diagnostics;
class variety harvest season;
model avgfirm = variety|harvest|season|week;
store mod;
quit;
proc plm source=mod;
effectplot interaction(x=harvest sliceby=variety plotby=season)/nrows=1 ncols=4;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 19 replies
  • 638 views
  • 7 likes
  • 4 in conversation