Hi,
I have a dataset that contains the sample (2,500 participants), in which I have a subsample that I'm using for analysis (1000 participants). I am trying to see if there are statistically significant differences between the whole sample and my subsample. I am using T tests for my interval variables, but have a lot of categorical variables, so I am trying to use Chi square to see if there are significant differences.
To differentiate between the whole sample and the subsample I have a new variable called "sample" and if the value in the "sample" column is 1 then that participant is part of the large sample but not the subsample, and if the value in the "sample" column is 2 then the participant is part of the subsample.
For t tests I used the class statement saying "class sample; var age" which seems to have worked.
Is there a similar way to do this for Chi Square? I want to compare employment (employed or not) for sample 1 and sample 2.
Thanks!
Hi @klongway,
So you always compare the 1000 observations with sample=2 to the remaining 1500 with sample=1. The example below shows how to do this comparison for a categorical variable:
/* Create test data for demonstration */
data have;
set sashelp.heart(obs=2500 rename=(status=employment));
sample=1+(_n_<=1000);
run;
/* Perform chi-square test */
proc freq data=have;
tables sample*employment / chisq;
run;
This is so helpful!!! Thank you!!! I'm running into problems with some of my variables when doing this- some are working, others are not.
I have
data file2;
set file1;
if dep=. then sample=2;
if dep=1 then sample=1;
if dep=2 then sample=1;
run;
I then run a table
Proc freq data=file2;
tables sample;
run;
And the table shows all of my 2500 samples into sample 1 or sample 2.
When I try to run the chi square for employment, though, sample 2 comes up blank.
I did:
proc freq data=file2
tables sample*emp/chisq;
run;
And the table comes up with sample 2 empty.
I checked the file 2 and I have plenty of people who answered the employment question in sample 2, so it isn't that there isn't any data.
Any ideas?
Thanks!!!
To get a quick overview of several categorical variables I often use PROC FREQ with the MISSING and LIST options in the TABLES statement.
So I would run
proc freq data=file2;
tables dep*sample*emp / missing list;
run;
and examine the resulting output. What does it look like for your file2?
Thank you so much! When I run that I get: a chart with
Sample 2-full time-400
Sample 2-parttime-400
Sample 2- not working-700
Sample 1-missing=1500
So It is pulling all of sample 2 but only has the "missing" in sample 1. But the "missing" in sample 2 adds up to the total number in the dataset in sample 2....!!! Is it possible everyone in Sample 1 did not answer this question?! Ahh!
@klongway wrote:
... When I run that I get: a chart with
Sample 2-full time-400
Sample 2-parttime-400
Sample 2- not working-700
Sample 1-missing=1500
Assuming that "Sample 2" in the above PROC FREQ output refers to your "subsample" consisting of 1000 participants, I would be wondering why the corresponding frequencies, 400, 400 and 700, add up to 1500, not 1000.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.