BookmarkSubscribeRSS Feed
Calcite | Level 5



I have a dataset that contains the sample (2,500 participants), in which I have a subsample that I'm using for analysis (1000 participants). I am trying to see if there are statistically significant differences between the whole sample and my subsample. I am using T tests for my interval variables, but have a lot of categorical variables, so I am trying to use Chi square to see if there are significant differences.


To differentiate between the whole sample and the subsample I have a new variable called "sample" and if the value in the "sample" column is 1 then that participant is part of the large sample but not the subsample, and if the value in the "sample" column is 2 then the participant is part of the subsample. 


For t tests I used the class statement saying "class sample; var age" which seems to have worked.


Is there a similar way to do this for Chi Square?  I want to compare employment (employed or not) for sample 1 and sample 2. 





Jade | Level 19

Hi @klongway,


So you always compare the 1000 observations with sample=2 to the remaining 1500 with sample=1. The example below shows how to do this comparison for a categorical variable:

/* Create test data for demonstration */

data have;
set sashelp.heart(obs=2500 rename=(status=employment));

/* Perform chi-square test */

proc freq data=have;
tables sample*employment / chisq;
Calcite | Level 5

This is so helpful!!! Thank you!!! I'm running into problems with some of my variables when doing this- some are working, others are not. 


I have 

data file2;

set file1;

if dep=. then sample=2;

if dep=1 then sample=1;

if dep=2 then sample=1;


I then run a table

Proc freq data=file2;

tables sample;


And the table shows all of my 2500 samples into sample 1 or sample 2.

When I try to run the chi square for employment, though, sample 2 comes up blank.


I did:

proc freq data=file2

tables sample*emp/chisq;



And the table comes up with sample 2 empty.


I checked the file 2 and I have plenty of people who answered the employment question in sample 2, so it isn't that there isn't any data.


Any ideas?



Jade | Level 19

To get a quick overview of several categorical variables I often use PROC FREQ with the MISSING and LIST options in the TABLES statement.

So I would run

proc freq data=file2;
tables dep*sample*emp / missing list;

and examine the resulting output. What does it look like for your file2?

Calcite | Level 5

Thank you so much! When I run that I get: a chart with 

Sample 2-full time-400

Sample 2-parttime-400

Sample 2- not working-700

Sample 1-missing=1500


So It is pulling all of sample 2 but only has the "missing" in sample 1. But the "missing" in sample 2 adds up to the total number in the dataset in sample 2....!!! Is it possible everyone in Sample 1 did not answer this question?! Ahh!


Jade | Level 19

@klongway wrote:

... When I run that I get: a chart with 

Sample 2-full time-400

Sample 2-parttime-400

Sample 2- not working-700

Sample 1-missing=1500

Assuming that "Sample 2" in the above PROC FREQ output refers to your "subsample" consisting of 1000 participants, I would be wondering why the corresponding frequencies, 400, 400 and 700, add up to 1500, not 1000.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2 in conversation