Solved: Re: Creating single variable from multiple variables

LEINAARE · Posted 01-12-2019 06:17 PM

Hello,

I have a dataset that contains frequencies of patients within specified age categories for multiple clinics as shown below.

Site Year Age1 Age2 Age3 Age4 Age5 Age6 Age7

Clinic A 2017 17 163 288 911 876 617 398

Clinic B 2017 6 56 145 557 561 325 196

Clinic C 2017 12 84 216 764 714 581 224

........

I need to create an age variable 'AgeCat' that dichotomizes frequencies of clinic patients as either 'Younger' or 'Older'. The 'Younger' category needs to contain the sum of variables 'Age1' + 'Age2' + 'Age3' and the 'Older' category needs to contain the sum of variables 'Age4' + 'Age5' + 'Age6'. So far, I have been able to sum the values of age variables I would like to categorize. I just don't know how to create a dichotomous variable by assigning the values of 'Younger' and 'Older'. Below is my code so far.

data P2.BaselineAge2017 (keep=site year clinic age: MeanAge young old);
    set P1.FPAR_Age_Gender (rename=(FemAge15_17=Age1 FemAge18_19=Age2 FemAge20_24=Age3 
	FemAge25_29=Age4 FemAge30_34=Age5 FemAge35_39=Age6 FemAge40_44=Age7)); 
    Clinic = propcase(scan(site,1));
    if Clinic in (&clinics) and (year = '2017');
    young = sum(of Age1 - Age3);
    old = sum (of Age4 - Age7);
    MeanAge = mean (of Age1 - Age7);
    format MeanAge 8.2;
    AgeCat = /*I want '1' to equal 'young' and '2' to equal 'old'*/;
run;

I would greatly appreciate any insight.

Thanks,

Ted

ballardw · Posted 01-14-2019 03:49 PM

One thing you may consider later is that you don't always need to create a separate variable to create disjoint value groups (categories that don't overlap), and often multiple variables, as in this case adds to complexity.

Custom formats can be used to create groups that are honored by almost all the analysis, reporting and graphing procedures. Example using the SASHELP.CLASS data set that should be installed in your system:

proc format library=work;
value classage
11-13 = 'Young'
14-16 = 'Older'
;
run;


proc freq data=sashelp.class;
   tables age;
   format age classage.;
   tables age*sex/chisq;
run;

Note that the format applied to the age variable works both to create the simple frequency table and to create a group used in the chisq test with sex. This is not a good set for testing chisq as the cell counts are small but for a syntax example it is small enough to run quickly and not produce a lot of output.

A big advantage of not having to add variables is that the exact same data set can be used for somewhat different analysis just by changing the format. I have data sets that use as many as 10 different age grouping formats: teen/not teen; 5 year age categories, 10 year age categories, categories that may correspond to other important ages such as minimum recommended age for a medical test or program service model.

By having the data in a one record per person per clinic you have lots of granularity for summaries. If you had summarized the data by age (not group) to get count per clinic and age then an actual mean age would be pretty trivial for each clinic:

Proc means data=clinicagesummary mean std;
   class clinic;
   var age;
   freq count; /* count of a single age at the clinic*/
run;

A separate distribution could be done by using age in the class statement not var and applying an age grouping format. Or in proc freq with the chisq;

View solution in original post

PaigeMiller · Posted 01-12-2019 08:26 PM

I have scratched my head at this description, and it's just not clear to me what you want the output to be.

So, please show us the output you want given this data. (Which, by the way, is something you should do on all similar questions ... which will lead to you getting faster answers.)

--
Paige Miller

LEINAARE · Posted 01-12-2019 08:58 PM

Hi @PaigeMiller

Thanks for taking a look at this. I need to think about how to display or describe this better.

LEINAARE · Posted 01-12-2019 09:55 PM

For my current experiment, I am using two clinics. The dataset 'P2.BaselineAge2017 ' created in my original post looks like this.

Site Year Age1 Age2 Age3 Age4 Age5 Age6 Age7 Clinic young old MeanAge

Clinic A 2017 163 288 911 876 617 398 195 ClinicName1 1362 2086 492.57

Clinic B 2017 56 145 557 561 325 196 96 ClinicName2 758 1178 276.57

Basically, my research group is recruiting participants from different clinics into a longitudinal study. Over time, we want to monitor different demographic characteristics of participants enrolled from each clinic to ensure they are representative of the clinic population they are recruited from. My strategy to accomplish this is to use annual summary data for each clinic from the year before to estimate a null hypothesized clinic population parameter. I have a raw data file that contains N observations for recruited participants from two clinics.

I am trying to use cross-sectional data from the previous year to create estimates for use as null values. For a t-test of average age, I want to do a one-sample t-test, using the value of 'MeanAge' (above) as the null value for each clinic. However, I am trying to do chi-square tests for each clinic comparing frequencies of 'young' and 'old' from our actual survey sample data to dichotomized summary values 'young' and 'old' obtained from the cross-sectional data from the previous year (shown above). I am trying to obtain a contingency table that looks like this (below).

| Baseline | Current |

Young | 1362 |(N Young)|

Old | 2086 | (N Old) |

| | |

I'm pretty well stuck right now. I don't know if this helps, or if I'm way off base.

LEINAARE · Posted 01-12-2019 10:27 PM

Hi @PaigeMiller

I just ran what I was thinking about doing, and the logic doesn't work at all. I'm completely off base. I need to go back to the drawing board. Thanks for looking at this, but I think I need to scrap the whole strategy.

Thanks,

Ted

ballardw · Posted 01-14-2019 11:38 AM

@LEINAARE wrote:

Hello,

I have a dataset that contains frequencies of patients within specified age categories for multiple clinics as shown below.

Site            Year           Age1       Age2      Age3      Age4       Age5      Age6       Age7

Clinic A      2017             17       163        288          911        876         617         398

Clinic B       2017              6            56        145          557        561         325         196

Clinic C       2017             12           84         216         764        714         581         224

........

I need to create an age variable 'AgeCat' that dichotomizes frequencies of clinic patients as either 'Younger' or 'Older'. The 'Younger' category needs to contain the sum of variables 'Age1' + 'Age2' + 'Age3' and the 'Older' category needs to contain the sum of variables 'Age4' + 'Age5' + 'Age6'. So far, I have been able to sum the values of age variables I would like to categorize. I just don't know how to create a dichotomous variable by assigning the values of 'Younger' and 'Older'. Below is my code so far.
data P2.BaselineAge2017 (keep=site year clinic age: MeanAge young old);
    set P1.FPAR_Age_Gender (rename=(FemAge15_17=Age1 FemAge18_19=Age2 FemAge20_24=Age3 
	FemAge25_29=Age4 FemAge30_34=Age5 FemAge35_39=Age6 FemAge40_44=Age7)); 
    Clinic = propcase(scan(site,1));
    if Clinic in (&clinics) and (year = '2017');
    young = sum(of Age1 - Age3);
    old = sum (of Age4 - Age7);
    MeanAge = mean (of Age1 - Age7);
    format MeanAge 8.2;
    AgeCat = /*I want '1' to equal 'young' and '2' to equal 'old'*/;
run;
I would greatly appreciate any insight.

Thanks,

Ted

If instead of age1, age2 etc you had assigned the middle of your age categories such as Age16 you might have had a chance but the femage18_19 throws that approach off. Was the data that generated the FemAge18_19 and related read that way or created by summarizing another data set. If the result was from summarization then you might want to consider using the individual records if available.

Do you actually want the mean count of the age groups? Or an approximate actual mean age of the persons involved?

For a large number of purposes you may be better off with a single variable "age" and "count", often referred to as "long" data instead of variables that contain data in their names across or "wide" data.

LEINAARE · Posted 01-14-2019 01:44 PM

Hi @ballardw,

The input data came from a wide dataset created for a different purpose. I used the actual age group values in the variable names because I created the original dataset with the intention of sharing it with a lot of people, and figured that would make it easier for unfamiliar users to understand what each variable represents. At first, I though I would use the 'mean' count to get a baseline average. But when I sat down and looked at it, I remembered it was frequencies for age categories, so I wouldn't be able to get a true average age. I've decided to use chi-square instead of t-test to get around that.

I ended up using means and transpose procedures to create a dataset that looks like this:

Clinic AgeCat Time Count

clinic1 1 1 1364

clinic1 1 2 33

clinic1 2 1 2086

clinic1 2 2 28

clinic2 1 1 758

clinic2 1 2 36

clinic2 2 1 1178

clinic2 2 2 42

Now I am doing chi-square analyses as below.

proc freq data=agecats;
     weight Count;
     table AgeCat*Time / chisq;
     by Clinic;
run;

ballardw · Posted 01-14-2019 03:49 PM

One thing you may consider later is that you don't always need to create a separate variable to create disjoint value groups (categories that don't overlap), and often multiple variables, as in this case adds to complexity.

Custom formats can be used to create groups that are honored by almost all the analysis, reporting and graphing procedures. Example using the SASHELP.CLASS data set that should be installed in your system:

proc format library=work;
value classage
11-13 = 'Young'
14-16 = 'Older'
;
run;


proc freq data=sashelp.class;
   tables age;
   format age classage.;
   tables age*sex/chisq;
run;

Note that the format applied to the age variable works both to create the simple frequency table and to create a group used in the chisq test with sex. This is not a good set for testing chisq as the cell counts are small but for a syntax example it is small enough to run quickly and not produce a lot of output.

A big advantage of not having to add variables is that the exact same data set can be used for somewhat different analysis just by changing the format. I have data sets that use as many as 10 different age grouping formats: teen/not teen; 5 year age categories, 10 year age categories, categories that may correspond to other important ages such as minimum recommended age for a medical test or program service model.

By having the data in a one record per person per clinic you have lots of granularity for summaries. If you had summarized the data by age (not group) to get count per clinic and age then an actual mean age would be pretty trivial for each clinic:

Proc means data=clinicagesummary mean std;
   class clinic;
   var age;
   freq count; /* count of a single age at the clinic*/
run;

A separate distribution could be done by using age in the class statement not var and applying an age grouping format. Or in proc freq with the chisq;

LEINAARE · Posted 01-14-2019 04:49 PM

Hi @ballardw,

Wow! I didn't know that proc formats could be used to group observations for analyses. I though it only modified the appearance of variables. This will definitely come in handy in the future.

Thank you,

Ted

Ready to join fellow brilliant minds for the SAS Hackathon?