Solved: Re: paired categorical data

lone0708 · Posted 06-24-2021 06:57 AM

Hi all,
I am working on a dataset, where I have to test if there is a significant difference in categorical variables from one timepoint to another in the same person.

My variables contains 2 or 3 categories (0,1,2).

Which test will be suitable to use - especially for the 3 category variable?

Thanks

ballardw · Posted 06-24-2021 11:18 AM

@lone0708 wrote:

I am searching to test significant difference between the categorical variables themselves. My dataset looks like this:

Patient Time1 Time2. Postion_time1. position_time2 Light_time1 light_time2

A. 13:14. 14:00. 0 1 1 0

B. 12:00. 12:15 2 2 1 1

C. 12:13. 14:45 3 1 0 1

I want to test, if position and light are generally the same at the two timepoints or is for example position 3 overrepresented at time 1. I hope it makes sense

Proc freq with the EXPECTED option on the tables statement sounds like what you might be looking for.

Here's a brief example creating a data set with two variables to "compare". The Rand ('integer', n) function creates random integers in the 1 to n interval.

You can se the counts of the intersections of the values and compare with an "expected" value based on the distribution.

data example;
   /*should produce relatively similar distributions*/
   do i=1 to 50;
      x= rand('integer',3);
      y= rand('integer',3);
      output;
   end;
   /* now add some to bias a variable, y won't have any 3*/
   do i=51 to 100;
      x= rand('integer',3);
      y= rand('integer',2);
      output;
   end;
run;

proc freq data=example;
   tables x*y /expected chisq;
run;

Throw in a Chi-square test and you have a statistic that tests similarity of distribution.

View solution in original post

Kurt_Bremser · Posted 06-24-2021 07:04 AM

With categorical values, any difference is significant. So I would simply count the distinct values:

data have;
input person $ cat_var;
datalines;
A 0
A 1
B 0
B 0
;

proc sql;
create table want as
  select
    person
  from have
  group by person
  having count(distinct cat_var) > 1
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 06-24-2021 07:23 AM

@lone0708 wrote:

Hi all,
I am working on a dataset, where I have to test if there is a significant difference in categorical variables ...

Significant difference between some statistic for the categorical variables (if so, what statistic?) or significant difference between the categorical variables themselves (if so, please explain in a lot more detail)

--
Paige Miller

lone0708 · Posted 06-24-2021 08:20 AM

I am searching to test significant difference between the categorical variables themselves. My dataset looks like this:

Patient Time1 Time2. Postion_time1. position_time2 Light_time1 light_time2

A. 13:14. 14:00. 0 1 1 0

B. 12:00. 12:15 2 2 1 1

C. 12:13. 14:45 3 1 0 1

I want to test, if position and light are generally the same at the two timepoints or is for example position 3 overrepresented at time 1. I hope it makes sense

PaigeMiller · Posted 06-24-2021 09:13 AM

I am searching to test significant difference between the categorical variables themselves.

I am very confused. As far as I know, this can't be done. It is not a statistical concept to test categorical variables themselves. (Or in the trivial sense, they are always different). The only statistical concept is to test statistics for each categorical variable to see if the statistics are different in the different categories, and you seem to be saying that's not what you want.

In the data set you show, describe the steps (in words) to show how you would answer the question.

--
Paige Miller

Kurt_Bremser · Posted 06-24-2021 09:47 AM

Wouldn't a simple PROC FREQ show you an imbalance in these values?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ballardw · Posted 06-24-2021 11:18 AM

@lone0708 wrote:

I am searching to test significant difference between the categorical variables themselves. My dataset looks like this:

Patient Time1 Time2. Postion_time1. position_time2 Light_time1 light_time2

A. 13:14. 14:00. 0 1 1 0

B. 12:00. 12:15 2 2 1 1

C. 12:13. 14:45 3 1 0 1

I want to test, if position and light are generally the same at the two timepoints or is for example position 3 overrepresented at time 1. I hope it makes sense

Proc freq with the EXPECTED option on the tables statement sounds like what you might be looking for.

Here's a brief example creating a data set with two variables to "compare". The Rand ('integer', n) function creates random integers in the 1 to n interval.

You can se the counts of the intersections of the values and compare with an "expected" value based on the distribution.

data example;
   /*should produce relatively similar distributions*/
   do i=1 to 50;
      x= rand('integer',3);
      y= rand('integer',3);
      output;
   end;
   /* now add some to bias a variable, y won't have any 3*/
   do i=51 to 100;
      x= rand('integer',3);
      y= rand('integer',2);
      output;
   end;
run;

proc freq data=example;
   tables x*y /expected chisq;
run;

Throw in a Chi-square test and you have a statistic that tests similarity of distribution.

FreelanceReinh · Posted 06-24-2021 10:40 AM

Hi @lone0708,

Do you mean a test for marginal homogeneity (i.e., whether the distribution of "position" has changed from time 1 to time 2, and similar for "light")?

If so, the test "equivalent to Bhapkar’s test" presented in Example 35.7 Repeated Measures, 4 Response Levels, 1 Population of the PROC CATMOD documentation might be appropriate, especially in the case of more than two categories. (See also https://support.sas.com/kb/39/243.html.) For dichotomous variables (e.g., if "light" is either 0 or 1) McNemar's test should be applicable, see the Tests and Measures of Agreement available in PROC FREQ.

Example:

/* Create sample data for demonstration */

data have;
call streaminit(27182818);
do patient=1 to 250;
  time1=round(rand('integer','8:00't,'14:00't),60);
  time2=time1+round(rand('integer','0:15't,'6:00't),60);
  position_time1=rand('table',0.2, 0.3, 0.4)-1;
  position_time2=rand('table',0.25,0.35,0.25)-1;
  light_time1=rand('bern',0.6);
  light_time2=rand('bern',0.5);
  output;
end;
format time: time5.;
run;

/* Perform tests for marginal homogeneity */

proc catmod data=have namelen=29;
response marginals;
model position_time1*position_time2=_response_ / freq design;
repeated time 2;
quit;

proc freq data=have;
tables light_time1*light_time2 / agree;
run;

Edit: Note that the difference between time 1 and time 2, be it 15 minutes or 6 hours, is disregarded in these tests.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away