Re: computing means of a categorical variable

pammers · Posted 07-30-2018 03:54 PM

I am still a novice ... so easily confused! I was told to compute the means for all explanatory variables (SAS 9.4). But, the variables are all categorical variables. For example, I tried to get the means for the two health reasons (physical and mental health), coded as 1 and 2. I used the following code:

proc means data=onespell mean, std;

class=healthreason;

var=spell_length;

run;

This is not giving me what I want. Apparently, because the data is stacked, it includes all observations on a person. I just want a one-person file. My co-investigator responded thus

"Are you looking for the average for persons in your sample? I am not sure what data is going into the procedure. It appears that the data you are inputting is all the observations on the first spell, so many of those people have multiple observations because they have many months in that spell. If that is the case, then the average will be biased by people who have longer first spells and more observation months. Also, the sample count for the mean will not be people but observation months (many more observation months than people). Lastly, I though you coded health reason as a categorical variable with three categories (0, 1, and 2 or something like that). So that would mean that you are better to look at the proportion of the sample in each category rather an mean value. An average of 0, 1, and 2 values would not give you a meaningful mean value."

Serving only to confuse me more.

Can someone help me unravel this?

PaigeMiller · Posted 07-30-2018 04:01 PM

Since you want to compute a mean of a categorical variable, would you please explain in your own words what that means? Don't even use SAS or other computer language in your explanation, just try to articulate what the mean of a categorical variable really means.

--
Paige Miller

pammers · Posted 07-30-2018 04:20 PM

To be clear, I don't see the point in computing means ... However, since I must ...

The means is an average of all counts within an observation. So let's take health reason. I want to know how many people in my sample have physical illness vs. those with mental illness. A proc freq provides a descriptive table and tells me that (making up a number here) 122, 000 have PI and 132, 000 have MI; the percentage is provided in a SAS output so I know my sample consists of 46% (again only for example sake). However, I need to also do this for the regression model in which I have 10 explanatory variables. The means and STD do not get automatically generated in a proc gen. How do I get the same information with the model as I did with the descriptive stat?

PaigeMiller · Posted 07-30-2018 04:27 PM

Okay, good, we agree that it doesn't make sense to compute a mean of a categorical variable. The only descriptive statistic that we can compute is the percent in each category.

However, I need to also do this for the regression model in which I have 10 explanatory variables.

I don't know what this means, or why you need to do this for a regression model.

The means and STD do not get automatically generated in a proc gen. How do I get the same information with the model as I did with the descriptive stat?

Do you mean proc reg? Why do you need means and standard deviations here, if we have already established that these can't be computed for categorical variables. Explain.

--
Paige Miller

pammers · Posted 07-30-2018 04:34 PM

Oh good. So I am not a complete idiot! This is the answer I received when I asked that very question:

"Generally, for explanatory variable descriptives one would do means and standard deviations for continuous variables. For categorical variables it is best to present the proportion of the sample in each category. You will need to also make a distinction between a person, versus observation months on a person within a spell, versus multiple spells of a person in the program. The data is currently structured so that observation months are stacked. If you simple do a proc mean (or other stat) for a variable, you will get a mean value for all the observations month across multiple spells within and across persons. To have a person file, you need to keep only one observation per person and then do a mean (or other stat). To have a spell file, you need to keep only one observation per spell and then do mean (or other stat). What observation to keep for a person or a spell is also critical. For example, only the last observation in a spell has info about exit"

Again. I am using the proc gen procedure with a clog-log model.

PaigeMiller · Posted 07-30-2018 05:50 PM

@pammers wrote:

"Generally, for explanatory variable descriptives one would do means and standard deviations for continuous variables. For categorical variables it is best to present the proportion of the sample in each category. You will need to also make a distinction between a person, versus observation months on a person within a spell, versus multiple spells of a person in the program. The data is currently structured so that observation months are stacked. If you simple do a proc mean (or other stat) for a variable, you will get a mean value for all the observations month across multiple spells within and across persons. To have a person file, you need to keep only one observation per person and then do a mean (or other stat). To have a spell file, you need to keep only one observation per spell and then do mean (or other stat). What observation to keep for a person or a spell is also critical. For example, only the last observation in a spell has info about exit"

Without knowing your data, a lot of the above is gibberish. I thought only witches and warlocks had multiple spells. But in general, I have not come across the noun "spell" in any particular field of endeavor other than writing and casting spells over people. Most of the above quote from your professor (?) is meaningless to me in the statistical context of fitting a model.

--
Paige Miller

Reeza · Posted 07-30-2018 08:56 PM

Your data has repeated measures. So you need to account for that as well. You cannot do a PROC MEANS or FREQ Without accounting for it otherwise you’re counting the duplicates.

pammers · Posted 07-31-2018 08:36 AM

HI Paige,

Thank you. I realize that but not sure how to do that. I used
by first.X (x being my variable). Anyway exhausted last night so maybe I will figure it out today. I used proc genmod to generate the model. Sorry, I confused you on that. So to get the means of each variable I have been using this code:

Proc means Data=onespell (the constructed variable)
Class X
Var (all my list of dependent variables)

Thank you for your patience and kindness.

Reeza · Posted 07-30-2018 04:36 PM

@pammers wrote:

To be clear, I don't see the point in computing means ... However, since I must ...

You would never compute the mean of a categorical variable so don't bother.

If you recoded them as 0/1 and then calculate the mean you'll get the EXACT same information as the percentage. Check it. If they're off you likely haven't accounted for missing the same in both procedures.

pammers · Posted 07-30-2018 04:40 PM

Sorry, I don't understand what you by "you likely haven't accounted for missing the same in both procedures." At any rate, what code do i use to get this?

ballardw · Posted 07-30-2018 05:15 PM

Provide a worked small example of what you want. A small input set and the desired result.

Otherwise you are talking around a bunch of next to nonsense.

IF you have an order to your categorical variable, which you have not stated or shown in any way, the concept of "median" as the middle value might apply.

If I have values of a, b, c, a, c, d, p, d, q where the "order" is normal alphabetical order then the data could be reordered to

a,a,b,c,c,d,d,p,q.

Of the 9 elements shown then the "median" would be the second c as 4 values come before and after.

If you have an even number such as 18 elements, then the 9th and 10th would have to be considered and tie breaking becomes an issue if both values are different. If the 9th were "m" and the 10th were "s" what value to pick as the middle might be more problematic.

And your apparent quote helps not as it appears to be a response to something without context.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!