BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kweg
Fluorite | Level 6

Hi all,

 

I have a data set with 100 observations; each observation represents an organization that submitted a survey response to us that includes their demographic distributions as proportions. For example, an organization reported members aged 20-29 at 10% of total, 30-39 at 30% of total, 40-49 at 50% of total, and 50-59 at 10% of total, whereas another organization reported 30%, 40%, 30%, and 0% respectively.

 

I would like to create a variable that indicates whether or not each of these age distributions significantly differ from the known age distribution in the area (for this hypothetical, lets say 30%, 30%, 25%, and 15%.

 

As I'd like to avoid typing these manually into a chi-sq on OpenEpi, is there a way to code this to generate that "goodness of fit" variable automatically?

 

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@kweg wrote:
Thank you for responding.

Here is a sample of the age group values we were provided for 25
observations. The variables are age groups 10-19, 20-29, 30-39, 40-49, and
50-59. The comparison values would be 10000, 20000, 25000, 40000. and 30000

0 0 0 1 0
0 0 17 42 7
0 0 1 3 0
0 1 152 184 45
0 2 24 9 0
0 0 2 7 3
0 0 30 20 10
0 1 4 2 0
0 0 1 0 0
0 1 54 31 8
0 3 31 9 1
0 0 11 15 2
0 7 187 132 22
0 4 144 64 6
0 7 107 54 10
0 5 13 10 2
0 0 20 25 5
0 6 222 461 84
0 10 55 56 5
0 25 475 600 100
0 3 88 128 18
0 0 20 20 1
0 0 2 16 1
22 50 329 333 70
0 11 464 706 186

To answer your questions:

I am looking to determine if each organization is different from the
comparison population. I would then go back to see which variables may
account for said differences.

Thank you for reminding me that the population size will affect the
outcome, and I really appreciate your help!



Right off hand I can tell that since most of your organizations have no one in the first group that is likely to be an issue.

Any of the groups with 2 or more zeroes almost certainly will not get a reliable estimate. The chi-square tests get pretty flaky when one of the cells has a count < 5 much less all of them.

 

Here is one way. The first step reads the data into a structure that is needed for a chi-square test assigning an organization based on the order of the values in the data. So we have an organization, group (your age groups) and count for each level of the group. Similar for the Reference data which has an arbitrary largish value for the org variable. Then we combine the data so that the reference and organizations data are available. The last data set used the organization data to create one call to proc freq for each organization with the two values of org: 999 for the reference and 1 to 25 (the number of organizations). Call execute stacks instructions that execute after the data step completes.

Creating custom formats to display the range of the ages for the group variable and the names for the organizations would make the tables easier to read.

 

data have;
   org=_n_;
   do group= 1 to 5;
      input count @;
      output;
   end; 
datalines;
0 0 0 1 0
0 0 17 42 7
0 0 1 3 0
0 1 152 184 45
0 2 24 9 0
0 0 2 7 3
0 0 30 20 10
0 1 4 2 0
0 0 1 0 0
0 1 54 31 8
0 3 31 9 1
0 0 11 15 2
0 7 187 132 22
0 4 144 64 6
0 7 107 54 10
0 5 13 10 2
0 0 20 25 5
0 6 222 461 84
0 10 55 56 5
0 25 475 600 100
0 3 88 128 18
0 0 20 20 1
0 0 2 16 1
22 50 329 333 70
0 11 464 706 186
run;

data reference;
   org=999;
   do group= 1 to 5;
      input count @;
      output;
   end; 
datalines;
10000 20000 25000 40000 30000
;
run;

data analysis;
   set have
       reference
   ;
run;

data _null_;
   set have ;
   by org;
   if first.org then do;
      call execute ('proc freq data=analysis;');
      length str $100;
      str = catx(' ','where org in (999',org,');');
      call execute (str);
      call execute ('   table group*org/chisq;
         weight count;
         run;');
   end;
run;

View solution in original post

5 REPLIES 5
ballardw
Super User

@kweg wrote:

Hi all,

 

I have a data set with 100 observations; each observation represents an organization that submitted a survey response to us that includes their demographic distributions as proportions. For example, an organization reported members aged 20-29 at 10% of total, 30-39 at 30% of total, 40-49 at 50% of total, and 50-59 at 10% of total, whereas another organization reported 30%, 40%, 30%, and 0% respectively.

 

I would like to create a variable that indicates whether or not each of these age distributions significantly differ from the known age distribution in the area (for this hypothetical, lets say 30%, 30%, 25%, and 15%.

 

As I'd like to avoid typing these manually into a chi-sq on OpenEpi, is there a way to code this to generate that "goodness of fit" variable automatically?

 

Thanks in advance!


 

If you have this as a SAS data set I might help to provide an example of what your actual data looks like.

 

Are you looking to determine if each organization is different than your comparison population or the combination of all of the organizations?

Do you have any idea what population size (total number of members) for each organization? Without a base count per organization the results have the potential to be quite misleading.

kweg
Fluorite | Level 6
Thank you for responding.

Here is a sample of the age group values we were provided for 25
observations. The variables are age groups 10-19, 20-29, 30-39, 40-49, and
50-59. The comparison values would be 10000, 20000, 25000, 40000. and 30000

0 0 0 1 0
0 0 17 42 7
0 0 1 3 0
0 1 152 184 45
0 2 24 9 0
0 0 2 7 3
0 0 30 20 10
0 1 4 2 0
0 0 1 0 0
0 1 54 31 8
0 3 31 9 1
0 0 11 15 2
0 7 187 132 22
0 4 144 64 6
0 7 107 54 10
0 5 13 10 2
0 0 20 25 5
0 6 222 461 84
0 10 55 56 5
0 25 475 600 100
0 3 88 128 18
0 0 20 20 1
0 0 2 16 1
22 50 329 333 70
0 11 464 706 186

To answer your questions:

I am looking to determine if each organization is different from the
comparison population. I would then go back to see which variables may
account for said differences.

Thank you for reminding me that the population size will affect the
outcome, and I really appreciate your help!


ballardw
Super User

@kweg wrote:
Thank you for responding.

Here is a sample of the age group values we were provided for 25
observations. The variables are age groups 10-19, 20-29, 30-39, 40-49, and
50-59. The comparison values would be 10000, 20000, 25000, 40000. and 30000

0 0 0 1 0
0 0 17 42 7
0 0 1 3 0
0 1 152 184 45
0 2 24 9 0
0 0 2 7 3
0 0 30 20 10
0 1 4 2 0
0 0 1 0 0
0 1 54 31 8
0 3 31 9 1
0 0 11 15 2
0 7 187 132 22
0 4 144 64 6
0 7 107 54 10
0 5 13 10 2
0 0 20 25 5
0 6 222 461 84
0 10 55 56 5
0 25 475 600 100
0 3 88 128 18
0 0 20 20 1
0 0 2 16 1
22 50 329 333 70
0 11 464 706 186

To answer your questions:

I am looking to determine if each organization is different from the
comparison population. I would then go back to see which variables may
account for said differences.

Thank you for reminding me that the population size will affect the
outcome, and I really appreciate your help!



Right off hand I can tell that since most of your organizations have no one in the first group that is likely to be an issue.

Any of the groups with 2 or more zeroes almost certainly will not get a reliable estimate. The chi-square tests get pretty flaky when one of the cells has a count < 5 much less all of them.

 

Here is one way. The first step reads the data into a structure that is needed for a chi-square test assigning an organization based on the order of the values in the data. So we have an organization, group (your age groups) and count for each level of the group. Similar for the Reference data which has an arbitrary largish value for the org variable. Then we combine the data so that the reference and organizations data are available. The last data set used the organization data to create one call to proc freq for each organization with the two values of org: 999 for the reference and 1 to 25 (the number of organizations). Call execute stacks instructions that execute after the data step completes.

Creating custom formats to display the range of the ages for the group variable and the names for the organizations would make the tables easier to read.

 

data have;
   org=_n_;
   do group= 1 to 5;
      input count @;
      output;
   end; 
datalines;
0 0 0 1 0
0 0 17 42 7
0 0 1 3 0
0 1 152 184 45
0 2 24 9 0
0 0 2 7 3
0 0 30 20 10
0 1 4 2 0
0 0 1 0 0
0 1 54 31 8
0 3 31 9 1
0 0 11 15 2
0 7 187 132 22
0 4 144 64 6
0 7 107 54 10
0 5 13 10 2
0 0 20 25 5
0 6 222 461 84
0 10 55 56 5
0 25 475 600 100
0 3 88 128 18
0 0 20 20 1
0 0 2 16 1
22 50 329 333 70
0 11 464 706 186
run;

data reference;
   org=999;
   do group= 1 to 5;
      input count @;
      output;
   end; 
datalines;
10000 20000 25000 40000 30000
;
run;

data analysis;
   set have
       reference
   ;
run;

data _null_;
   set have ;
   by org;
   if first.org then do;
      call execute ('proc freq data=analysis;');
      length str $100;
      str = catx(' ','where org in (999',org,');');
      call execute (str);
      call execute ('   table group*org/chisq;
         weight count;
         run;');
   end;
run;
kweg
Fluorite | Level 6

Thank you! This worked beautifully and was a huge time saver!

 

I merged the first two groups since they were both had lower counts to help reduce the frequency of cells <5. Now to review the results...

ballardw
Super User

@kweg wrote:

Thank you! This worked beautifully and was a huge time saver!

 

I merged the first two groups since they were both had lower counts to help reduce the frequency of cells <5. Now to review the results...


If you are familiar with Proc Freq syntax you could likely capture the output results you want by adding an OUTPUT statement and naming the output data set based on the ORG variable. Then perhaps append all of those and create an easier to read summary report.

 

Such as:

data _null_;
   set have ;
   by org;
   if first.org then do;
      call execute ('proc freq data=analysis;');
      length str $100;
      str = catx(' ','where org in (999',org,');');
      call execute (str);
      str = catx(' ','output out=',cats('work.chisq',org),'Pchi mhchi lrchi ;');
      call execute (str);
      call execute ('   table group*org/chisq;
         weight count;
         run;');
   end;
run;

data work.allchi;
   length source $40.;
   set work.chi: indsname=dsname;
   source = dsname;
run;

The last data set combines all of the chi-square (or other statistics if the proper tables statement and stats are added to the OUTPUT statement) plus the source data set name. I might place the result in a more permanent library though.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1366 views
  • 0 likes
  • 2 in conversation