Hello,
In my CEO compensation data set I have a variable "gender" which indicates if a CEO is a man or a woman. I would like to find the means between these two groups. How can I untangle these values so that I can observe the two different averages?
This is what the data look like:
56 508.007 0 883.694 175.511 FEMALE 57 530.478 0 0 0 MALE 58 522 0 0 0 MALE 59 618.135 0 0 0 MALE 60 540.385 0 0 0 FEMALE 61 569.8 0 0 0 MALE 62 541.087 0 4803.789 0 MALE 63 591.911 0 4199.761 0 MALE 64 551.193 0 4185.64 0 FEMALE 65 581.196 0 4803.789 MALE 66 33.622 0 6606.551 MALE 67 35.724 0 6606.551 MALE 68 42.308 0 15415.27 MALE 69 42.868 0 11010.911 MALE 70 37.826 0 7707.623 MALE 71 563.86 0 2500 0 MALE 72 566.067 0 2500 0 MALE 73 687.884 0 7000 0 MALE 74 642.512 0 4500 0 MALE 75 591.254 0 3000 0 MALE 76 584.178 0 2920 0 MALE 77 584.178 0 2920 0 MALE 78 231.538 0 10330 0 MALE 79 660.375 0 5250 0 MALE 80 609.577 0 3500 0 MALE 81 600.936 0 2575 0 MALE
Thanks for the help!
@sastuck wrote:
The MEANS Procedure
Analysis Variable : SALARYN N Miss
74300 0
Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?
If this relates to my comment about values of 0 included in your means this does not refute it. My concern is that you have values of zero.
Consider 5 ceos with salaries of 10000 10000 10000 0 0. Do you want 10000 or 6000 as the "mean"? You have not shown us any way to determine the names of your variables for that given "example" data but 3 of your 4 likely numeric variables have multiple 0 values. (Two of the data columns basically only having 0 or missing). The concern is interpretation if 0 is included in the salary. But as I say again, we do not know which column of your data represents salary.
Data is best presented in the form of a data step so there is not problem identifying which variables are numeric or character, we have the names of the variables and a data set we can test code against to verify that the results are correct for an example that is small enough to calculate by hand. Such as:
data have; input id salary sex $; datalines; 1 50000 male 2 25000 male 3 44444 female 4 0 female ; run; proc means data=have; class sex; var salary; run;
So in the above, since there are only two females would desire the "mean" salary to be 44444 or 22222? If 22222 seems wrong and should be 44444 then you need to exclude the values of 0 from the mean.
@PaigeMiller I basically just want to know how to find the average of both genders separately even though the data is in one variable.
You put your grouping variable in your CLASS statement in PROC MEANS/UNIVARIATE/SUMMARY.
See the example here - you can run it yourself to see how it works:
https://github.com/statgeek/SAS-Tutorials/blob/master/proc_means_basic.sas
What is going on in the var line?
here's an example from the page you linked:
*Create summary data;
proc means data=have noprint;
by id;
var feature1-feature3;
output out=want median= var= mean= /autoname;
run;
feature1-feature3 ? What is this?
You have not mentioned the names of your data variables.
Assuming names :
ID VAR1 VAR2 VAR3 VAR4 GENDER 56 508.007 0 883.694 175.511 FEMALE 57 530.478 0 0 0 MALE 58 522 0 0 0 MALE
Your code should be like:
proc means data=have noprint;
class gender;
var var1 var3 ; /* or var1 - var4 << for all 4 numeric variables */
output out=want <statistics> / <options> ;
run;
Pay attention: VAR!-VAR4 is equal to VAR1 VAR2 VAR3 VAR4.
The data set before - the sample data had the variables listed as Features1, features2, features3.
A short cut way to reference all three at once is features1-features3.
@sastuck wrote:
What is going on in the var line?
here's an example from the page you linked:
*Create summary data; proc means data=have noprint; by id; var feature1-feature3; output out=want median= var= mean= /autoname; run;
feature1-feature3 ? What is this?
Warning Will Robinson:
Use of zero for "missing" data will artificially lower your mean salary analysis.
Since I really doubt that any of those CEOs work for zero salary you probably should either 1)take a pass through a data step to set zero valued salary fields to missing or 2) make sure that you subset the data to records with salary greater than zero in Proc Means/Summary or any analysis.
If you have multiple salary fields with some 0 and others not then the first approach is likely the better one.
A second consideration might be time if the data has the salary for multiple years for a single individual.
@ballardw Someone's been watching Lost in Space....:D
The MEANS Procedure
Analysis Variable : SALARYN N Miss
74300 | 0 |
Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?
Your sample data shows 0. Are 0, missing, or truly zero. If it's missing, then using zero will deflate your numbers and it will not show up on the PROC MEANS output because 0 is not missing according to SAS.
yes, those are from the bonus, stock options, etc variables (sorry I didnt include headers). I couldnt find any helpful documentation for this . . . but since there is a salary reported and a (true) zero bonus is possible, I might just assume that they dont represent missing values. Regardless, I still am unsure of how to find the mean for men and women separately. Thanks for the heads up though.
@sastuck wrote:
The MEANS Procedure
Analysis Variable : SALARYN N Miss
74300 0
Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?
If this relates to my comment about values of 0 included in your means this does not refute it. My concern is that you have values of zero.
Consider 5 ceos with salaries of 10000 10000 10000 0 0. Do you want 10000 or 6000 as the "mean"? You have not shown us any way to determine the names of your variables for that given "example" data but 3 of your 4 likely numeric variables have multiple 0 values. (Two of the data columns basically only having 0 or missing). The concern is interpretation if 0 is included in the salary. But as I say again, we do not know which column of your data represents salary.
Data is best presented in the form of a data step so there is not problem identifying which variables are numeric or character, we have the names of the variables and a data set we can test code against to verify that the results are correct for an example that is small enough to calculate by hand. Such as:
data have; input id salary sex $; datalines; 1 50000 male 2 25000 male 3 44444 female 4 0 female ; run; proc means data=have; class sex; var salary; run;
So in the above, since there are only two females would desire the "mean" salary to be 44444 or 22222? If 22222 seems wrong and should be 44444 then you need to exclude the values of 0 from the mean.
Regardless of the missing 0's, could you show me how to show the difference in means given the way my data look? I will consider the 0's later, right now I would just like to see how I can show the difference in means between men and women. Since they are in the same column its not as simple as the proc means I am used to. If you can help, thanks!
@sastuck did you run the code included in @ballardw post from MONDAY? It is exactly what you asked for.
@sastuck wrote:
Regardless of the missing 0's, could you show me how to show the difference in means given the way my data look? I will consider the 0's later, right now I would just like to see how I can show the difference in means between men and women. Since they are in the same column its not as simple as the proc means I am used to. If you can help, thanks!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.