Solved: Best approach to analyze data skewed to the left with GLM

palolix

Dear SAS Community,

I would greatly appreciate it if you could give me some feedback on the best approach to analyze data skewed to the left.

Context: I would like to know if there are significant differences in the dependent continuous variable firmness (AvgFirm) between two avocado varieties (BL516 vs Hass) after 0, 1, 3, and 6 weeks of storage for each harvest month as you can see it in these graphs. I made these graphs using the mean but I guess that is not a good idea.

When I plot the data using the following code I get the following distributions for each week, where you can see that for week 6 the data is skewed to the left.

proc univariate data=one normal plot;
where Variety in('BL516', 'Hass');
var AvgFirm;
histogram AvgFirm/normal (color=red);
class Wks;
run;
quit;

The UNIVARIATE Procedure

Wks = 6

Fitted Normal Distribution for AvgFirm

Parameters for Normal Distribution
Parameter	Symbol	Estimate
Mean	Mu	9.49237
Std Dev	Sigma	9.399897

Goodness-of-Fit Tests for Normal Distribution
Test	Statistic		p Value
Kolmogorov-Smirnov	D	0.2351068	Pr > D	<0.010
Cramer-von Mises	W-Sq	3.3803286	Pr > W-Sq	<0.005
Anderson-Darling	A-Sq	17.9411189	Pr > A-Sq	<0.005

Quantiles for Normal Distribution
Percent	Quantile
Percent	Observed	Estimated
1.0	1.90000	-12.37506
5.0	2.60000	-5.96908
10.0	3.03000	-2.55408
25.0	4.20000	3.15224
50.0	6.23000	9.49237
75.0	10.10000	15.83250
90.0	22.75000	21.53882
95.0	28.00000	24.95382
99.0	51.20000	31.35980

For this reason I would analyze the data for each week separately using genmod. What dist= and link= function do you think it would be a good approach to model left skewed data effectively?

proc genmod data=one;
where Wks=6 ;
class Variety Harvest;
model AvgFirm=Variety|Harvest/type3 dist= link=;
slice Variety*Harvest/sliceby=Harvest/ adjust=simulate(seed=1);
run;

I would greatly appreciate it your help!

Thanks

StatDave

Without having the data and clarity on all of the variables involved in the study (season is new), it's hard to say. But you might be able to obtain estimable results by fitting the model in PROC GLM and using the E option in the MODEL statement to see the general form of the estimable functions. With that info, you should be able to write ESTIMATE or CONTRAST statements that are estimable. It might help to treat Week as a continuous variable by not including it in the CLASS statement.

View solution in original post

StatDave

Your model defines the subpopulations as combinations of Variety and Harvest, but it seems that the histograms you are using to evaluate normality doesn't look at the distributions of the individual subpopulations. It appears that each histogram uses the data across Varieties and Harvests, so the skewness you see could be caused by shifts among the distributions of the individual subpopulations even if all of those individual distributions are quite symmetrical. So, I think you need to look at the distributions that your model defines before you decide if you have skewed data.

Of course, if you expect that your response is approximately normal, you could remove the subpopulation shifts by fitting the model and plotting the residuals to see if they are approximately normal. You can get that in the diagnostic plots from PROC GLM.

proc glm plots=diagnostics; 
   class harvest variety ;
   model avgfirm = harvest|variety;
quit;

palolix

Thank you very much for your reply StatDave.

I got histograms for each Variety (BL516 on the left and Hass on the right) for week 3 and week 6:

The UNIVARIATE Procedure for 'BL516'

Variable: AvgFirm

Wks = 6

Moments
N	67	Sum Weights	67
Mean	9.68910448	Sum Observations	649.17
Std Deviation	7.74860002	Variance	60.0408022
Skewness	2.95974499	Kurtosis	12.0875772
Uncorrected SS	10252.5689	Corrected SS	3962.69295
Coeff Variation	79.9723033	Std Error Mean	0.94664216

Basic Statistical Measures
Location		Variability
Mean	9.689104	Std Deviation	7.74860
Median	7.160000	Variance	60.04080
Mode	6.000000	Range	48.86000
		Interquartile Range	5.43000

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	10.23523	Pr > \|t\|	<.0001
Sign	M	33.5	Pr >= \|M\|	<.0001
Signed Rank	S	1139	Pr >= \|S\|	<.0001

Tests for Normality
Test	Statistic		p Value
Shapiro-Wilk	W	0.704547	Pr < W	<0.0001
Kolmogorov-Smirnov	D	0.220242	Pr > D	<0.0100
Cramer-von Mises	W-Sq	1.035726	Pr > W-Sq	<0.0050
Anderson-Darling	A-Sq	5.48558	Pr > A-Sq	<0.0050

Quantiles (Definition 5)
Level	Quantile
100% Max	51.20
99%	51.20
95%	24.50
90%	18.70
75% Q3	10.84
50% Median	7.16
25% Q1	5.41
10%	3.63
5%	3.05
1%	2.34
0% Min	2.34

The UNIVARIATE Procedure for 'Hass'

Variable: AvgFirm

Wks = 6

Moments
N	106	Sum Weights	106
Mean	9.36801887	Sum Observations	993.01
Std Deviation	10.3420827	Variance	106.958675
Skewness	2.4219283	Kurtosis	5.72092607
Uncorrected SS	20533.1973	Corrected SS	11230.6609
Coeff Variation	110.397757	Std Error Mean	1.00451187

Basic Statistical Measures
Location		Variability
Mean	9.368019	Std Deviation	10.34208
Median	5.165000	Variance	106.95868
Mode	3.030000	Range	51.35000
		Interquartile Range	5.97000

Note: The mode displayed is the smallest of 7 modes with a count of 2.

Tests for Location: Mu0=0
Test	Statistic		p Value
Student's t	t	9.325941	Pr > \|t\|	<.0001
Sign	M	53	Pr >= \|M\|	<.0001
Signed Rank	S	2835.5	Pr >= \|S\|	<.0001

Tests for Normality
Test	Statistic		p Value
Shapiro-Wilk	W	0.656111	Pr < W	<0.0001
Kolmogorov-Smirnov	D	0.258368	Pr > D	<0.0100
Cramer-von Mises	W-Sq	2.534216	Pr > W-Sq	<0.0050
Anderson-Darling	A-Sq	13.24827	Pr > A-Sq	<0.0050

Quantiles (Definition 5)
Level	Quantile
100% Max	52.600
99%	47.550
95%	35.950
90%	24.850
75% Q3	9.650
50% Median	5.165
25% Q1	3.680
10%	2.860
5%	2.460
1%	1.900
0% Min	1.250

When trying the diagnostics for BL516 and Hass

The GLM Procedure

Dependent Variable: AvgFirm

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	16	5603.8279	350.2392	1.53	0.0819
Error	1069	244698.1694	228.9038
Corrected Total	1085	250301.9973

R-Square	Coeff Var	Root MSE	AvgFirm Mean
0.022388	37.56906	15.12957	40.27135

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Harvest	9	3235.319563	359.479951	1.57	0.1193
Variety	1	428.484524	428.484524	1.87	0.1715
Harvest*Variety	6	1940.023838	323.337306	1.41	0.2065

Source	DF	Type III SS	Mean Square	F Value	Pr > F
Harvest	9	3118.480398	346.497822	1.51	0.1380
Variety	1	347.962624	347.962624	1.52	0.2179
Harvest*Variety	6	1940.023838	323.337306	1.41	0.2065

PaigeMiller

The residuals need to be (approximately) normally distributed, not the raw data. The raw data can have any distribution, and if the residuals are normally distributed, then GLM is appropriate.

--
Paige Miller

palolix

Thank you very much for your clarification Paige, I will keep that in mind for the future!

StatDave

You've said that the data were collected at combinations of three variables - Variety, Harvest, and Week. So, a better evaluation would be to include Week in the GLM analysis to see the diagnostics:

proc glm plots=diagnostics; 
   class harvest variety week;
   model avgfirm = harvest|variety|week;
quit;

palolix

Thank you again StatDave for your help on this. Ok, so I run the diagnostics for the two avocado varieties separately. For BL516 I got this:

Dependent Variable: AvgFirm

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	27	73180.93565	2710.40502	129.12	<.0001
Error	399	8375.23275	20.99056
Corrected Total	426	81556.16840

R-Square	Coeff Var	Root MSE	AvgFirm Mean
0.897307	11.64900	4.581545	39.32995

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Harvest	6	1791.94777	298.65796	14.23	<.0001
Variety	0	0.00000	.	.	.
Harvest*Variety	0	0.00000	.	.	.
Wks	3	69881.08555	23293.69518	1109.72	<.0001
Harvest*Wks	18	1507.90233	83.77235	3.99	<.0001
Variety*Wks	0	0.00000	.	.	.
HarvestVarietyWks	0	0.00000	.	.	.

Source	DF	Type III SS	Mean Square	F Value	Pr > F
Harvest	6	1128.53650	188.08942	8.96	<.0001
Variety	0	0.00000	.	.	.
Harvest*Variety	0	0.00000	.	.	.
Wks	3	60694.23347	20231.41116	963.83	<.0001
Harvest*Wks	18	1507.90233	83.77235	3.99	<.0001
Variety*Wks	0	0.00000	.	.	.
HarvestVarietyWks	0	0.00000	.	.	.

For Hass:

The GLM Procedure

Dependent Variable: AvgFirm

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	39	142654.5078	3657.8079	88.90	<.0001
Error	619	25467.6996	41.1433
Corrected Total	658	168122.2074

R-Square	Coeff Var	Root MSE	AvgFirm Mean
0.848517	15.69005	6.414304	40.88134

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Harvest	9	3188.2587	354.2510	8.61	<.0001
Variety	0	0.0000	.	.	.
Harvest*Variety	0	0.0000	.	.	.
Wks	3	129725.2727	43241.7576	1051.00	<.0001
Harvest*Wks	27	9740.9764	360.7769	8.77	<.0001
Variety*Wks	0	0.0000	.	.	.
HarvestVarietyWks	0	0.0000	.	.	.

Source	DF	Type III SS	Mean Square	F Value	Pr > F
Harvest	9	1509.1338	167.6815	4.08	<.0001
Variety	0	0.0000	.	.	.
Harvest*Variety	0	0.0000	.	.	.
Wks	3	107736.0103	35912.0034	872.85	<.0001
Harvest*Wks	27	9740.9764	360.7769	8.77	<.0001
Variety*Wks	0	0.0000	.	.	.
HarvestVarietyWks	0	0.0000	.	.	.

StatDave

You only need to run GLM once and for all of the data including both varieties, all harvest levels, and all weeks. Doing it for the separate varieties is why you see the zero DF in both analyses. I suspect that the results of the single analysis on all the data will probably be about like either of the ones you show - if so, the histogram of the residuals looks pretty symmetrical.

palolix

Oh ok, thank you for the clarification. These are the results when including both varieties at once:

The GLM Procedure

Class Level Information
Class	Levels	Values
Harvest	10	1 2 3 4 5 6 8 10 11 12
Variety	2	BL516 Hass
Wks	4	0 1 3 6

Number of Observations Read	1185
Number of Observations Used	1086

Postharvest all years-Firmness_month

The GLM Procedure

Dependent Variable: AvgFirm

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	67	216459.0649	3230.7323	97.18	<.0001
Error	1018	33842.9324	33.2445
Corrected Total	1085	250301.9973

R-Square	Coeff Var	Root MSE	AvgFirm Mean
0.864792	14.31739	5.765807	40.27135

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Harvest	9	3235.3196	359.4800	10.81	<.0001
Variety	1	428.4845	428.4845	12.89	0.0003
Harvest*Variety	6	1940.0238	323.3373	9.73	<.0001
Wks	3	198566.1555	66188.7185	1990.97	<.0001
Harvest*Wks	27	7589.2734	281.0842	8.46	<.0001
Variety*Wks	3	1808.7425	602.9142	18.14	<.0001
HarvestVarietyWks	18	2891.0656	160.6148	4.83	<.0001

Source	DF	Type III SS	Mean Square	F Value	Pr > F
Harvest	9	1697.6082	188.6231	5.67	<.0001
Variety	1	37.4341	37.4341	1.13	0.2889
Harvest*Variety	6	1006.8005	167.8001	5.05	<.0001
Wks	3	133392.5502	44464.1834	1337.49	<.0001
Harvest*Wks	27	7703.4927	285.3145	8.58	<.0001
Variety*Wks	3	2210.6355	736.8785	22.17	<.0001
HarvestVarietyWks	18	2891.0656	160.6148	4.83	<.0001

It looks like the histogram for the residuals is normally distributed, so I guess I can use the following model to analyze this data, what do you think?

proc genmod data=one;
class Variety Harvest Wks;
model AvgFirm=Variety|Harvest|Wks/type3 dist=normal link=identity;
slice Variety*Harvest*Wks/sliceby=Harvest*Wks / adjust=simulate(seed=1);
run;

StatDave

Could be fairly reasonable. You might want to look into why about 100 observations were not used (maybe missing values), and if some of the observations with high leverage or influence are problematic or are errors. Also a bit odd that there are no predicted values in a big gap in the middle of the range.

palolix

Thank you so much for your comments. The missing values are probably due to the fact that the variety BL516 was not harvested at every harvest month like Hass.

palolix

If I include Season, Harvest, and Variety in my model I get what it seems to me like a bimodal distribution of residuals

proc glm data=one plots=diagnostics;
class Season Harvest Variety;
model AvgFirm = Season|Harvest|Variety;
quit;

The GLM Procedure

Dependent Variable: AvgFirm

Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	144	111514.4612	774.4060	3.24	<.0001
Error	3374	805328.1297	238.6865
Corrected Total	3518	916842.5909

R-Square	Coeff Var	Root MSE	AvgFirm Mean
0.121629	38.92753	15.44948	39.68780

Source	DF	Type I SS	Mean Square	F Value	Pr > F
Season	3	886.56917	295.52306	1.24	0.2942
Harvest	9	22270.81680	2474.53520	10.37	<.0001
Season*Harvest	11	7399.62936	672.69358	2.82	0.0011
Variety	15	53362.39688	3557.49313	14.90	<.0001
Season*Variety	21	11133.99185	530.19009	2.22	0.0011
Harvest*Variety	57	11054.28780	193.93487	0.81	0.8417
SeasonHarvesVariet	28	5406.76936	193.09891	0.81	0.7497

Source	DF	Type III SS	Mean Square	F Value	Pr > F
Season	3	4648.63915	1549.54638	6.49	0.0002
Harvest	9	20269.82403	2252.20267	9.44	<.0001
Season*Harvest	11	7779.96955	707.26996	2.96	0.0006
Variety	15	48976.59968	3265.10665	13.68	<.0001
Season*Variety	20	11534.33573	576.71679	2.42	0.0004
Harvest*Variety	57	12750.65775	223.69575	0.94	0.6097
SeasonHarvesVariet	28	5406.76936	193.09891	0.81	0.7497

Which dist= and link= option would be more appropriate in this case if using genmod?

StatDave

You've left out Week again.

palolix

I left out weeks because I wanted to compare varieties within Harvest for each season across week like in this graph

title2 'AvgFirm 2021: Comparing varieties within Harvest across Wks';

proc genmod data=one;
where Season=2021;
class Harvest Variety;
model AvgFirm=Harvest|Variety/type3 dist=normal link=identity;
slice Harvest*Variety/sliceby=Harvest / adjust=simulate(seed=1);
run;

But you are right, if I include weeks then I get a normal distribution

StatDave

You don't have to leave variables out of the model to produce a plot like that. It looks like you need Week to be in the model to get approximately normal residuals - hopefully still true with the addition of Season. (but the diagnostics still present questions about some outliers and high influence points that might need addressing). You can store the fitted model and then use the EFFECTPLOT statement in PROC PLM to produce a plot like you want. Treating Week as continuous and producing the plots at the mean of Weeks:

proc glm plots(only)=diagnostics;
class variety harvest season;
model avgfirm = variety|harvest|season|week;
store mod;
quit;
proc plm source=mod;
effectplot interaction(x=harvest sliceby=variety plotby=season)/nrows=1 ncols=4;
run;

Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

Re: Best approach to analyze data skewed to the left with GLM

The 2025 SAS Hackathon has begun!