Re: How do I prepare this data for a chisq test in SAS?

bazingarollcall · Posted 10-20-2021 10:59 AM

Hello,

I have the following dataset:

Adult_Survey_Results agec MEAN N Frequency Percent Cumulative

1.Access	Adult	1.8631301731	2215	1	4.17	1
1.Access	Older Adult	2.0261437908	306	1	4.17	2
1.Access	Young Adult	1.8697916667	128	1	4.17	3
2.Quality and Appropriateness	Adult	1.9121645347	2215	1	4.17	4
2.Quality and Appropriateness	Older Adult	2.107480029	306	1	4.17	5
2.Quality and Appropriateness	Young Adult	1.8784722222	128	1	4.17	6

Where agec is broken into 3 categories "Young Adult, Adult, and Older Adult," and Adult_Survey_Results is broken into 2 domains "Access, and Quality and Appropriateness."

I want to answer the question "What is the amount of interaction between the age groups and the survey domain?" Essentially, does age group affect the client's answer to the survey?

I've tried the simple

proc freq data=b2_table;

table Adult_Survey_Results*gender / chisq; run;

But it prints results based on the Adult_Survey_Results frequency, where I think I need this based on Adult_Survey_Results N instead.

How would you go about this?

Thank you.

Reeza · Posted 10-20-2021 11:25 AM

That looks like an output from PROC MEANs. What does your raw data look like? Which data set are you using in PROC FREQ - the output from PROC MEANS or the raw data?

bazingarollcall · Posted 10-20-2021 11:42 AM

You're right, this is a proc means output. I've output this table to it's own dataset called b2_table. I am using this same dataset for the proc freq table.

Reeza · Posted 10-20-2021 11:47 AM

@bazingarollcall wrote:
You're right, this is a proc means output. I've output this table to it's own dataset called b2_table. I am using this same dataset for the proc freq table.

Assuming, 'this same dataset' is b2_table, then you need to add a WEIGHT N statement to your PROC FREQ or use the raw data instead.

proc freq data=b2_table;
table Adult_Survey_Results*gender / chisq; 
weight N;
run;

ballardw · Posted 10-20-2021 11:55 AM

It appears that you would use the data set that you used for INPUT to Proc Means as the data for Proc freq.
The tables statement would look like:
Tables Adult_Survey_Results * agec / chisq;

bazingarollcall · Posted 10-20-2021 12:47 PM

I think my question needs more back-up information.

The code I used to create the b2_stats database is as follows:

proc summary data=formatting mean std lclm uclm n noprint;
class agec gender;
var mean_func ;
var mean_sc ;
var mean_acc ;
var mean_qa ;
var mean_out ;
var mean_part ;
var mean_sat ;
var mean_qol;
output out=b1_b2_stats;

output out=mean1 mean=;
*output out=uclm1 uclm=;
*output out=lclm1 lclm=;

run;

data b1_b2_stats2;
format _STAT_ $30.;
set b1_b2_stats
mean1(in=in2) ;
if in2 then _STAT_ = 'Mean';
run;

proc sort data=b1_b2_stats2
out=b1_stats;
by agec _TYPE_ _stat_;
run;

proc transpose data=b1_stats out=b1_han;
by agec _TYPE_;
id _stat_;
run;

data b1_table /*(keep=agec _TYPE_ _STAT_ Adult_Survey_Results Responses Number_Positive Percent_Positive Confidence_Interval)*/;
format Adult_Survey_Results $40.;

set b1_han;
if _NAME_ = 'mean_acc' then Adult_Survey_Results= '1.Access';
else if _NAME_= 'mean_qa' then Adult_Survey_Results= '2.Quality and Appropriateness';
else if _NAME_= 'mean_func' then Adult_Survey_Results= '7.Functioning';
else if _NAME_= 'mean_sat' then Adult_Survey_Results= '5.General Satisfaction';
else if _NAME_= 'mean_out' then Adult_Survey_Results= '3.Outcomes';
else if _NAME_= 'mean_sc' then Adult_Survey_Results= '6.Social Connectedness';
else if _NAME_= 'mean_part' then Adult_Survey_Results= '4.Participation In Treatment Planning';
else if _NAME_= 'mean_qol' then Adult_Survey_Results= '8.Quality of Life Assessment';
else if _NAME_='_FREQ_' and agec='Adult' then Adult_Survey_Results='Adult Overall';
else if _NAME_='_FREQ_' and agec='Older Adult' then Adult_Survey_Results='O.A. Overall';
else if _NAME_='_FREQ_' and agec='Young Adult' then Adult_Survey_Results='Y.A. Overall';

Responses=N;
STD1=STD;
MEAN1=MEAN*100;
*P951=P95;
*LCLM1=round(LCLM*N);
*UCLM1=round(UCLM*N);
*Percent_Positive=(Number_Positive /Responses);
*Confidence_Interval=cats(trim(LCLM1),'-',trim(UCLM1));
run;

The chisq command produced the attached table, which is very close to what I need but still not quite right. The weight statement pulled from N, but the program summed the totals of all the rows (which totaled the total of the entire dataset, 2757), and added all of those together, which is incorrect. I need the total to remain the same of the entire dataset (2757).

Proc report of b2_table is also attached.

proc freq data=b2_table;
table Adult_Survey_Results*agec / chisq;
weight N;
run;

I'm thinking that I need to somehow "pluck" the N of each Adult_Survey_Results , along with the agec categories and their frequencies, into another table and use this for chisq. Would this work?

ballardw · Posted 10-20-2021 01:53 PM

@bazingarollcall wrote:

The chisq command produced the attached table, which is very close to what I need but still not quite right. The weight statement pulled from N, but the program summed the totals of all the rows (which totaled the total of the entire dataset, 2757), and added all of those together, which is incorrect. I need the total to remain the same of the entire dataset (2757).

Look at your output data sets from Proc Summary such as b1_b2_stats. You will see that there is a variable named _type_ that indicates the combinations of the Class variables. Since you have two variables you will have 4 levels of _type_, 0, 1, 2 and 3. The 0, 1 and 2 will be: 0 is overall records, 1 each level of one of the class variables and 2 is each level of the other class variables. It is very likely that want to use the NWAY option on Proc summary to only include the _type_ = 3 values, which are the actual combinations of the levels for both class variables.

Otherwise N is going to be about 4 times the number of original records.

I am very confused about doing Chisq with the means of multiple variables as categories though. What is the exact question this chisq is supposed to answer? "Amount of interaction" is not what a chisq tests for. It checks for similarity of distribution of values between two variables. Or in other words, given the counts are they close to the expected counts if the row/column were distributed the same. More of a yes/no than a "how much" test.

bazingarollcall · Posted 10-21-2021 09:12 AM

I thought about your response overnight and can't thank you enough for it.

I no longer think chisq is appropriate in this situation; I will need something like T-Test to determine if there is significant difference between the means of the 3 groups of agec.

Ready to join fellow brilliant minds for the SAS Hackathon?