Hi,
I have a dataset that contains the sample (2,500 participants), in which I have a subsample that I'm using for analysis (1000 participants). I am trying to see if there are statistically significant differences between the whole sample and my subsample. I am using T tests for my interval variables, but have a lot of categorical variables, so I am trying to use Chi square to see if there are significant differences.
To differentiate between the whole sample and the subsample I have a new variable called "sample" and if the value in the "sample" column is 1 then that participant is part of the large sample but not the subsample, and if the value in the "sample" column is 2 then the participant is part of the subsample.
For t tests I used the class statement saying "class sample; var age" which seems to have worked.
Is there a similar way to do this for Chi Square? I want to compare employment (employed or not) for sample 1 and sample 2.
Thanks!
Hi @klongway,
So you always compare the 1000 observations with sample=2 to the remaining 1500 with sample=1. The example below shows how to do this comparison for a categorical variable:
/* Create test data for demonstration */
data have;
set sashelp.heart(obs=2500 rename=(status=employment));
sample=1+(_n_<=1000);
run;
/* Perform chi-square test */
proc freq data=have;
tables sample*employment / chisq;
run;
This is so helpful!!! Thank you!!! I'm running into problems with some of my variables when doing this- some are working, others are not.
I have
data file2;
set file1;
if dep=. then sample=2;
if dep=1 then sample=1;
if dep=2 then sample=1;
run;
I then run a table
Proc freq data=file2;
tables sample;
run;
And the table shows all of my 2500 samples into sample 1 or sample 2.
When I try to run the chi square for employment, though, sample 2 comes up blank.
I did:
proc freq data=file2
tables sample*emp/chisq;
run;
And the table comes up with sample 2 empty.
I checked the file 2 and I have plenty of people who answered the employment question in sample 2, so it isn't that there isn't any data.
Any ideas?
Thanks!!!
To get a quick overview of several categorical variables I often use PROC FREQ with the MISSING and LIST options in the TABLES statement.
So I would run
proc freq data=file2;
tables dep*sample*emp / missing list;
run;
and examine the resulting output. What does it look like for your file2?
Thank you so much! When I run that I get: a chart with
Sample 2-full time-400
Sample 2-parttime-400
Sample 2- not working-700
Sample 1-missing=1500
So It is pulling all of sample 2 but only has the "missing" in sample 1. But the "missing" in sample 2 adds up to the total number in the dataset in sample 2....!!! Is it possible everyone in Sample 1 did not answer this question?! Ahh!
@klongway wrote:
... When I run that I get: a chart with
Sample 2-full time-400
Sample 2-parttime-400
Sample 2- not working-700
Sample 1-missing=1500
Assuming that "Sample 2" in the above PROC FREQ output refers to your "subsample" consisting of 1000 participants, I would be wondering why the corresponding frequencies, 400, 400 and 700, add up to 1500, not 1000.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.