Posted 06-07-2021 02:32 PM
Hi,

I have a dataset that contains the sample (2,500 participants), in which I have a subsample that I'm using for analysis (1000 participants). I am trying to see if there are statistically significant differences between the whole sample and my subsample. I am using T tests for my interval variables, but have a lot of categorical variables, so I am trying to use Chi square to see if there are significant differences.

To differentiate between the whole sample and the subsample I have a new variable called "sample" and if the value in the "sample" column is 1 then that participant is part of the large sample but not the subsample, and if the value in the "sample" column is 2 then the participant is part of the subsample.

For t tests I used the class statement saying "class sample; var age" which seems to have worked.

Is there a similar way to do this for Chi Square? I want to compare employment (employed or not) for sample 1 and sample 2.

Thanks!

Hi @klongway,

So you always compare the 1000 observations with sample=2 to the remaining 1500 with sample=1. The example below shows how to do this comparison for a categorical variable:

```
/* Create test data for demonstration */
data have;
set sashelp.heart(obs=2500 rename=(status=employment));
sample=1+(_n_<=1000);
run;
/* Perform chi-square test */
proc freq data=have;
tables sample*employment / chisq;
run;
```

This is so helpful!!! Thank you!!! I'm running into problems with some of my variables when doing this- some are working, others are not.

I have

data file2;

set file1;

if dep=. then sample=2;

if dep=1 then sample=1;

if dep=2 then sample=1;

run;

I then run a table

Proc freq data=file2;

tables sample;

run;

And the table shows all of my 2500 samples into sample 1 or sample 2.

When I try to run the chi square for employment, though, sample 2 comes up blank.

I did:

proc freq data=file2

tables sample*emp/chisq;

run;

And the table comes up with sample 2 empty.

I checked the file 2 and I have plenty of people who answered the employment question in sample 2, so it isn't that there isn't any data.

Any ideas?

Thanks!!!

To get a quick overview of several categorical variables I often use PROC FREQ with the MISSING and LIST options in the TABLES statement.

So I would run

```
proc freq data=file2;
tables dep*sample*emp / missing list;
run;
```

and examine the resulting output. What does it look like for your file2?

Thank you so much! When I run that I get: a chart with

Sample 2-full time-400

Sample 2-parttime-400

Sample 2- not working-700

Sample 1-missing=1500

So It is pulling all of sample 2 but only has the "missing" in sample 1. But the "missing" in sample 2 adds up to the total number in the dataset in sample 2....!!! Is it possible everyone in Sample 1 did not answer this question?! Ahh!

@klongway wrote:

... When I run that I get: a chart with

Sample 2-full time-400

Sample 2-parttime-400

Sample 2- not working-700

Sample 1-missing=1500

Assuming that "Sample 2" in the above PROC FREQ output refers to your "subsample" consisting of 1000 participants, I would be wondering why the corresponding frequencies, 400, 400 and 700, add up to 1500, not 1000.

