BookmarkSubscribeRSS Feed
rebeccachau
Obsidian | Level 7

I just want to perform a simple chi-square test between two categorical variables, gender and depression. However, I am having trouble formatting my raw data so that I can conduct my analysis. I believe I need to have a COUNT variable included in the data set.

 

My data is structured like this:

ID       DEPRESSION     GENDER

1234      1                        0

2345     0                         1

4567     1                         1

 

Can anyone point me in the right direction? I have tried reading the documentation but don't understand the formatting step.

 

Thanks

5 REPLIES 5
FreelanceReinh
Jade | Level 19

@rebeccachau wrote:

I just want to perform a simple chi-square test between two categorical variables, gender and depression. (...) I believe I need to have a COUNT variable included in the data set.


No, you don't need a COUNT variable. Assuming that each subject corresponds to one observation in your input dataset, you can use this dataset as it is. It's only an optional alternative to start with an aggregated input dataset (if you had one already). See the example below and note that the results (both frequency tables and statistical tests) are identical.

/* Create test data for demonstration */

data have;
call streaminit(27182818);
do id=1234 to 1500;
  depression=rand('bern',0.1);
  gender=rand('bern',0.5);
  output;
end;
run; /* 267 obs. */

/* Perform chi-square test */

proc freq data=have;
tables depression*gender / chisq;
run;


/* Alternative: Use an aggregated analysis dataset */

/* Create the aggregated analysis dataset */

proc freq data=have noprint;
tables depression*gender / out=aggreg;
run; /* 4 obs. (including a variable COUNT) */

/* Specify the variable containing the counts in the WEIGHT statement */

proc freq data=aggreg;
tables depression*gender / chisq;
weight count;
run;
rebeccachau
Obsidian | Level 7

Thank you. I seem to be able to get this code to work. However, the output tells me that frequency missing is 1048281 (sample size=145) and there is a WARNING: 100% of the data are missing. (Output attached.) What does this mean??

FreelanceReinh
Jade | Level 19

@rebeccachau wrote:

However, the output tells me that frequency missing is 1048281 (sample size=145) and there is a WARNING: 100% of the data are missing. (Output attached.) What does this mean??


It means that 1048283 of the 1048428 observations (=99.986%, SAS rounded this up to "100%") in dataset WORK.IMPORT have missing values for variable GDS or variable GENDER or both. If this is unexpected, it indicates a serious issue with your data, most likely resulting from an inappropriate raw data import (perhaps with PROC IMPORT). This must be investigated thoroughly and all data issues resolved before any reliable statistical analysis can be done.

 

The first step would be to identify the analysis variable(s) with almost all values missing: Run PROC FREQ steps with only one variable in the TABLES statement (e.g. tables GDS;). Then review the step which created WORK.IMPORT (with a focus on the affected variables) and compare some raw data records (from the original Excel/text/database/... file) to the corresponding observations in WORK.IMPORT so as to make sure that they have been imported correctly.

 

If your raw data look correct and you don't see what's wrong with your data import, don't hesitate to open a new thread and describe the issue. Many people in this forum will be able to help you with that.

rebeccachau
Obsidian | Level 7

Thanks. I investigated my Excel file further and there was a problem there. Now the output correctly reports only two items as missing. 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1395 views
  • 5 likes
  • 3 in conversation