Re: Has the distribution of values in categorical variable been stable...

Rodcjones · Posted 01-26-2023 11:00 PM

This is probably more of a statistics question than programming one but hopefully that's ok. I use SAS Enterprise Guide 7.15 and am searching for the best method(s) to use to conduct two hypothesis tests related to the distribution of a categorical variable in a population over time.

Using a simulated dataset (named SIM, provided at bottom), the first question I have is: Is this distribution of the categorical variable “stable” over the course of the first 12 time points? Is there a way to statistically answer that question?

Though I’m using the same simulated data set for the second question, it can be considered independently from the one above.

Let's say in the study from which this data set was drawn, there was a change applied on Jan. 1, 2022 - after 12 time points/halfway through the time series.

The question is: In what way(s), if any, is the distribution of the categorical variable different, or changing over time, in 2022?

By eyeball, we would maybe guess that the distribution of the four values is stable in the pre-intervention period, but changes over time in the post-intervention period – perhaps A and/or B decrease and C and/or D increase (in terms of proportions of total).

For this, I am not enthusiastic about a single chi-square test of homogeneity in which I aggregate 2021 and 2022 and analyze as a 2x4 contingency table. I have it in my head that interrupted time series could yield what we want – but I’m unsure because I’m most interested in being able to detect or describe the change in distribution rather than change in 1 individual categorical variable alone.

data SIM;
input Month $ CategoricalVar $ Frequency;
datalines;
2021-01 A 152
2021-01 B 289
2021-01 C 193
2021-01 D 103
2021-02 A 145
2021-02 B 250
2021-02 C 193
2021-02 D 101
2021-03 A 178
2021-03 B 312
2021-03 C 248
2021-03 D 117
2021-04 A 174
2021-04 B 309
2021-04 C 238
2021-04 D 135
2021-05 A 184
2021-05 B 339
2021-05 C 234
2021-05 D 116
2021-06 A 180
2021-06 B 340
2021-06 C 241
2021-06 D 113
2021-07 A 203
2021-07 B 370
2021-07 C 241
2021-07 D 109
2021-08 A 185
2021-08 B 345
2021-08 C 252
2021-08 D 134
2021-09 A 198
2021-09 B 333
2021-09 C 252
2021-09 D 130
2021-10 A 207
2021-10 B 378
2021-10 C 233
2021-10 D 127
2021-11 A 168
2021-11 B 298
2021-11 C 223
2021-11 D 127
2021-12 A 172
2021-12 B 308
2021-12 C 260
2021-12 D 127
2022-01 A 122
2022-01 B 290
2022-01 C 247
2022-01 D 144
2022-02 A 151
2022-02 B 287
2022-02 C 218
2022-02 D 107
2022-03 A 170
2022-03 B 316
2022-03 C 276
2022-03 D 162
2022-04 A 150
2022-04 B 325
2022-04 C 277
2022-04 D 119
2022-05 A 148
2022-05 B 289
2022-05 C 287
2022-05 D 134
2022-06 A 148
2022-06 B 238
2022-06 C 252
2022-06 D 154
2022-07 A 130
2022-07 B 258
2022-07 C 241
2022-07 D 153
2022-08 A 135
2022-08 B 235
2022-08 C 300
2022-08 D 140
2022-09 A 152
2022-09 B 229
2022-09 C 280
2022-09 D 172
2022-10 A 154
2022-10 B 330
2022-10 C 315
2022-10 D 187
2022-11 A 130
2022-11 B 278
2022-11 C 312
2022-11 D 179
2022-12 A 135
2022-12 B 267
2022-12 C 299
2022-12 D 175
;

acordes · Posted 01-27-2023 03:04 AM


data sim;
set sim;
format yym monyy.;
yym=input(cats(month, "-01"), yymmdd10.);
run;

proc sort data=sim;
by yym;
run;

proc freq data=sim noprint;
by yym;                    
tables  CategoricalVar/ out=sim2;
weight frequency;   
run;

ods graphics on;
proc sgplot data=sim2 PCTLEVEL=GROUP ;
vbarbasic yym / response=percent group=CategoricalVar stat=sum groupdisplay=stack ;
run;

Rick_SAS · Posted 01-27-2023 06:43 AM

I think you can build on Arne's visualization. If your main interest is whether there is a linear trend for a certain time period, you can analyze the trend of the proportions. For example, if you fit an OLS line for the proportions in each category over time, does any line have a statistically significant slope (not zero)? The test for unequal slopes across linear models (ANCOVA) is explained in 24177 - Comparing parameters (slopes) from a model fit to two or more groups (sas.com)

/* https://support.sas.com/kb/24/177.html */
 proc glm data=sim2;
   class CategoricalVar;
   model Percent = CategoricalVar yym CategoricalVar*yym / noint solution;
quit;

You can look at the Type3 tests and the parameter estimates to conclude whether the slopes of the lines are different. Note that since these are proportions, they can't all increase! If one proportion goes up, at least one other must go down.

Ksharp · Posted 01-27-2023 06:10 AM

You could try multinomial proportions 's confidence interval.
Check @Rick_SAS 's blog

https://blogs.sas.com/content/iml/2017/02/15/confidence-intervals-multinomial-proportions.html

sbxkoenk · Posted 01-27-2023 06:52 AM

POPULATION STABILITY INDEX (PSI)

Examining Distributional Shifts by Using Population Stability Index (PSI) for Model Validation and Diagnosis
Alec Zhixiao Lin, LoanDepot, Foothill Ranch, CA
https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf

Koen

StatDave · Posted 01-27-2023 11:47 AM

You can use a generalized logistic model to test those hypotheses. For the first, this model provides a test of the effect of MONTH in the first year data. The Type3 test of MONTH is not significant (p=.86) suggesting "stability" in the sense of no changes in the proportions over the months. The LSMEANS statement provides the proportions and plots them.

data SIM;
input year 1-4 Month 6-7 CategoricalVar $ Frequency;
datalines;
...
;
proc logistic data=sim;
where year=2021;
freq frequency;
class month / param=glm;
model categoricalvar=month / link=glogit;
lsmeans month / ilink plots=meanplot(ilink);
run;

This next model assesses the change between years. The Type3 test for YEAR is significant (p<.0001). The LSMEANS statement shows the probabilities for each category in each year and confirms your eyeball conclusion.

proc logistic data=sim;
freq frequency;
class year month / param=glm;
model categoricalvar=year month year*month / link=glogit;
lsmeans year / ilink plots=meanplot(ilink);
run;

But both could be done using a nonmodeling approach with simple chi-square tests. This approach might be necessary with data that is more sparse which could cause numerical problems in the model-based approach.

proc freq data=sim;
weight frequency;
table year*categoricalvar/chisq;
run;
proc freq data=sim;
where year=2021;
weight frequency;
table month*categoricalvar/chisq;
run;

Rodcjones · Posted 01-31-2023 08:23 PM

@acordes @Rick_SAS @Ksharp @sbxkoenk @StatDave

Thank you all for these prompt and thoughtful responses! I've begun investigating each and plan to report back on the results of my learning/experimentation.

Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

Re: Has the distribution of values in categorical variable been stable or changed over time series

SAS Innovate 2025: Save the Date