- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I just want to perform a simple chi-square test between two categorical variables, gender and depression. However, I am having trouble formatting my raw data so that I can conduct my analysis. I believe I need to have a COUNT variable included in the data set.
My data is structured like this:
ID DEPRESSION GENDER
1234 1 0
2345 0 1
4567 1 1
Can anyone point me in the right direction? I have tried reading the documentation but don't understand the formatting step.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How about
proc sql;
create table want as
select
gender,
count(*) as count,
sum(depression as depression
from have
group by gender;
quit;
?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@rebeccachau wrote:
I just want to perform a simple chi-square test between two categorical variables, gender and depression. (...) I believe I need to have a COUNT variable included in the data set.
No, you don't need a COUNT variable. Assuming that each subject corresponds to one observation in your input dataset, you can use this dataset as it is. It's only an optional alternative to start with an aggregated input dataset (if you had one already). See the example below and note that the results (both frequency tables and statistical tests) are identical.
/* Create test data for demonstration */
data have;
call streaminit(27182818);
do id=1234 to 1500;
depression=rand('bern',0.1);
gender=rand('bern',0.5);
output;
end;
run; /* 267 obs. */
/* Perform chi-square test */
proc freq data=have;
tables depression*gender / chisq;
run;
/* Alternative: Use an aggregated analysis dataset */
/* Create the aggregated analysis dataset */
proc freq data=have noprint;
tables depression*gender / out=aggreg;
run; /* 4 obs. (including a variable COUNT) */
/* Specify the variable containing the counts in the WEIGHT statement */
proc freq data=aggreg;
tables depression*gender / chisq;
weight count;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you. I seem to be able to get this code to work. However, the output tells me that frequency missing is 1048281 (sample size=145) and there is a WARNING: 100% of the data are missing. (Output attached.) What does this mean??
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@rebeccachau wrote:
However, the output tells me that frequency missing is 1048281 (sample size=145) and there is a WARNING: 100% of the data are missing. (Output attached.) What does this mean??
It means that 1048283 of the 1048428 observations (=99.986%, SAS rounded this up to "100%") in dataset WORK.IMPORT have missing values for variable GDS or variable GENDER or both. If this is unexpected, it indicates a serious issue with your data, most likely resulting from an inappropriate raw data import (perhaps with PROC IMPORT). This must be investigated thoroughly and all data issues resolved before any reliable statistical analysis can be done.
The first step would be to identify the analysis variable(s) with almost all values missing: Run PROC FREQ steps with only one variable in the TABLES statement (e.g. tables GDS;). Then review the step which created WORK.IMPORT (with a focus on the affected variables) and compare some raw data records (from the original Excel/text/database/... file) to the corresponding observations in WORK.IMPORT so as to make sure that they have been imported correctly.
If your raw data look correct and you don't see what's wrong with your data import, don't hesitate to open a new thread and describe the issue. Many people in this forum will be able to help you with that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. I investigated my Excel file further and there was a problem there. Now the output correctly reports only two items as missing.