BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LEINAARE
Obsidian | Level 7

Hello,

 

I have a dataset that contains frequencies of patients within specified age categories for multiple clinics as shown below.

 

Site            Year           Age1       Age2      Age3      Age4       Age5      Age6       Age7

 

Clinic A       2017             17         163        288          911        876         617         398

Clinic B       2017              6            56        145          557        561         325         196

Clinic C       2017             12           84         216         764        714         581         224

........

 

 

I need to create an age variable 'AgeCat' that dichotomizes frequencies of clinic patients as either 'Younger' or 'Older'.  The 'Younger' category needs to contain the sum of variables 'Age1' + 'Age2' + 'Age3' and the 'Older' category needs to contain the sum of variables 'Age4' + 'Age5' + 'Age6'.  So far, I have been able to sum the values of age variables I would like to categorize.  I just don't know how to create a dichotomous variable by assigning the values of 'Younger' and 'Older'.  Below is my code so far.

 

data P2.BaselineAge2017 (keep=site year clinic age: MeanAge young old);
    set P1.FPAR_Age_Gender (rename=(FemAge15_17=Age1 FemAge18_19=Age2 FemAge20_24=Age3 
	FemAge25_29=Age4 FemAge30_34=Age5 FemAge35_39=Age6 FemAge40_44=Age7)); 
    Clinic = propcase(scan(site,1));
    if Clinic in (&clinics) and (year = '2017');
    young = sum(of Age1 - Age3);
    old = sum (of Age4 - Age7);
    MeanAge = mean (of Age1 - Age7);
    format MeanAge 8.2;
    AgeCat = /*I want '1' to equal 'young' and '2' to equal 'old'*/;
run;

I would greatly appreciate any insight.

 

Thanks,

 

Ted

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

One thing you may consider later is that you don't always need to create a separate variable to create disjoint value groups (categories that don't overlap), and often multiple variables, as in this case adds to complexity.

 

Custom formats can be used to create groups that are honored by almost all the analysis, reporting and graphing procedures. Example using the SASHELP.CLASS data set that should be installed in your system:

proc format library=work;
value classage
11-13 = 'Young'
14-16 = 'Older'
;
run;


proc freq data=sashelp.class;
   tables age;
   format age classage.;
   tables age*sex/chisq;
run;

Note that the format applied to the age variable works both to create the simple frequency table and to create a group used in the chisq test with sex. This is not a good set for testing chisq as the cell counts are small but for a syntax example it is small enough to run quickly and not produce a lot of output.

 

A big advantage of not having to add variables is that the exact same data set can be used for somewhat different analysis just by changing the format. I have data sets that use as many as 10 different age grouping formats: teen/not teen; 5 year age categories, 10 year age categories, categories that may correspond to other important ages such as minimum recommended age for a medical test or program service model.

 

By having the data in a one record per person per clinic you have lots of granularity for summaries. If you had summarized the data by age (not group) to get count per clinic and age then an actual mean age would be pretty trivial for each clinic:

Proc means data=clinicagesummary mean std;
   class clinic;
   var age;
   freq count; /* count of a single age at the clinic*/
run;

A separate distribution could be done by using age in the class statement not var and applying an age grouping format. Or in proc freq with the chisq;

 

View solution in original post

8 REPLIES 8
PaigeMiller
Diamond | Level 26

I have scratched my head at this description, and it's just not clear to me what you want the output to be.

 

So, please show us the output you want given this data. (Which, by the way, is something you should do on all similar questions ... which will lead to you getting faster answers.)

--
Paige Miller
LEINAARE
Obsidian | Level 7

Hi @PaigeMiller

 

Thanks for taking a look at this.  I need to think about how to display or describe this better.

 

LEINAARE
Obsidian | Level 7

For my current experiment, I am using two clinics.  The dataset 'P2.BaselineAge2017 ' created in my original post looks like this.

 

Site         Year    Age1   Age2   Age3   Age4  Age5   Age6  Age7    Clinic             young    old   MeanAge

Clinic A    2017     163     288     911     876     617     398    195    ClinicName1   1362    2086   492.57

Clinic B    2017      56      145     557     561     325     196      96    ClinicName2     758    1178   276.57

 

 

Basically, my research group is recruiting participants from different clinics into a longitudinal study.  Over time, we want to monitor different demographic characteristics of participants enrolled from each clinic to ensure they are representative of the clinic population they are recruited from.  My strategy to accomplish this is to use annual summary data for each clinic from the year before to estimate a null hypothesized clinic population parameter.  I have a raw data file that contains N observations for recruited participants from two clinics. 

 

I am trying to use cross-sectional data from the previous year to create estimates for use as null values.  For a t-test of average age, I want to do a one-sample t-test, using the value of 'MeanAge' (above) as the null value for each clinic.  However, I am trying to do chi-square tests for each clinic comparing frequencies of 'young' and 'old' from our actual survey sample data to dichotomized summary values 'young' and 'old' obtained from the cross-sectional data from the previous year (shown above).  I am trying to obtain a contingency table that looks like this (below). 

 

           | Baseline |  Current  |

Young   1362   |(N Young)|

  Old      2086   |  (N Old)   |

           |                |                |

 

I'm pretty well stuck right now.  I don't know if this helps, or if I'm way off base. 

LEINAARE
Obsidian | Level 7

Hi @PaigeMiller

 

I just ran what I was thinking about doing, and the logic doesn't work at all.  I'm completely off base.  I need to go back to the drawing board.  Thanks for looking at this, but I think I need to scrap the whole strategy. 

 

Thanks,

 

Ted

ballardw
Super User

@LEINAARE wrote:

Hello,

 

I have a dataset that contains frequencies of patients within specified age categories for multiple clinics as shown below.

 

Site            Year           Age1       Age2      Age3      Age4       Age5      Age6       Age7

 

Clinic A       2017             17         163        288          911        876         617         398

Clinic B       2017              6            56        145          557        561         325         196

Clinic C       2017             12           84         216         764        714         581         224

........

 

 

I need to create an age variable 'AgeCat' that dichotomizes frequencies of clinic patients as either 'Younger' or 'Older'.  The 'Younger' category needs to contain the sum of variables 'Age1' + 'Age2' + 'Age3' and the 'Older' category needs to contain the sum of variables 'Age4' + 'Age5' + 'Age6'.  So far, I have been able to sum the values of age variables I would like to categorize.  I just don't know how to create a dichotomous variable by assigning the values of 'Younger' and 'Older'.  Below is my code so far.

 

data P2.BaselineAge2017 (keep=site year clinic age: MeanAge young old);
    set P1.FPAR_Age_Gender (rename=(FemAge15_17=Age1 FemAge18_19=Age2 FemAge20_24=Age3 
	FemAge25_29=Age4 FemAge30_34=Age5 FemAge35_39=Age6 FemAge40_44=Age7)); 
    Clinic = propcase(scan(site,1));
    if Clinic in (&clinics) and (year = '2017');
    young = sum(of Age1 - Age3);
    old = sum (of Age4 - Age7);
    MeanAge = mean (of Age1 - Age7);
    format MeanAge 8.2;
    AgeCat = /*I want '1' to equal 'young' and '2' to equal 'old'*/;
run;

I would greatly appreciate any insight.

 

Thanks,

 

Ted

 


If instead of age1, age2 etc you had assigned the middle of your age categories such as Age16 you might have had a chance but the femage18_19 throws that approach off. Was the data that generated the FemAge18_19 and related read that way or created by summarizing another data set. If the result was from summarization then you might want to consider using the individual records if available.

 

Do you actually want the mean count of the age groups? Or an approximate actual mean age of the persons involved?

For a large number of purposes you may be better off with a single variable "age" and "count", often referred to as "long" data instead of variables that contain data in their names across or "wide" data.

 

LEINAARE
Obsidian | Level 7

Hi @ballardw,

 

The input data came from a wide dataset created for a different purpose.  I used the actual age group values in the variable names because I created the original dataset with the intention of sharing it with a lot of people, and figured that would make it easier for unfamiliar users to understand what each variable represents.  At first, I though I would use the 'mean' count to get a baseline average.  But when I sat down and looked at it, I remembered it was frequencies for age categories, so I wouldn't be able to get a true average age.  I've decided to use chi-square instead of t-test to get around that.

 

I ended up using means and transpose procedures to create a dataset that looks like this:

 

Clinic       AgeCat     Time      Count

 

clinic1           1             1          1364

clinic1           1             2              33

clinic1           2             1          2086

clinic1           2             2              28

clinic2           1             1            758

clinic2           1             2              36

clinic2           2             1          1178

clinic2           2             2              42

 

Now I am doing chi-square analyses as below.

 

proc freq data=agecats;
     weight Count;
     table AgeCat*Time / chisq;
     by Clinic;
run;
ballardw
Super User

One thing you may consider later is that you don't always need to create a separate variable to create disjoint value groups (categories that don't overlap), and often multiple variables, as in this case adds to complexity.

 

Custom formats can be used to create groups that are honored by almost all the analysis, reporting and graphing procedures. Example using the SASHELP.CLASS data set that should be installed in your system:

proc format library=work;
value classage
11-13 = 'Young'
14-16 = 'Older'
;
run;


proc freq data=sashelp.class;
   tables age;
   format age classage.;
   tables age*sex/chisq;
run;

Note that the format applied to the age variable works both to create the simple frequency table and to create a group used in the chisq test with sex. This is not a good set for testing chisq as the cell counts are small but for a syntax example it is small enough to run quickly and not produce a lot of output.

 

A big advantage of not having to add variables is that the exact same data set can be used for somewhat different analysis just by changing the format. I have data sets that use as many as 10 different age grouping formats: teen/not teen; 5 year age categories, 10 year age categories, categories that may correspond to other important ages such as minimum recommended age for a medical test or program service model.

 

By having the data in a one record per person per clinic you have lots of granularity for summaries. If you had summarized the data by age (not group) to get count per clinic and age then an actual mean age would be pretty trivial for each clinic:

Proc means data=clinicagesummary mean std;
   class clinic;
   var age;
   freq count; /* count of a single age at the clinic*/
run;

A separate distribution could be done by using age in the class statement not var and applying an age grouping format. Or in proc freq with the chisq;

 

LEINAARE
Obsidian | Level 7

Hi @ballardw,

 

Wow!  I didn't know that proc formats could be used to group observations for analyses.  I though it only modified the appearance of variables.  This will definitely come in handy in the future.

 

Thank you,

 

Ted 

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 4671 views
  • 0 likes
  • 3 in conversation