BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sastuck
Pyrite | Level 9

Hello,

 

In my CEO compensation data set I have a variable "gender" which indicates if a CEO is a man or a woman. I would like to find the means between these two groups. How can I untangle these values so that I can observe the two different averages? 

 

This is what the data look like:

	

56	508.007	0	883.694	175.511	FEMALE
57	530.478	0	0	0	MALE
58	522	0	0	0	MALE
59	618.135	0	0	0	MALE
60	540.385	0	0	0	FEMALE
61	569.8	0	0	0	MALE
62	541.087	0	4803.789	0	MALE
63	591.911	0	4199.761	0	MALE
64	551.193	0	4185.64	0	FEMALE
65	581.196	0	4803.789	MALE
66	33.622	0	6606.551	MALE
67	35.724	0	6606.551	MALE
68	42.308	0	15415.27	MALE
69	42.868	0	11010.911	MALE
70	37.826	0	7707.623	MALE
71	563.86	0	2500	0	MALE
72	566.067	0	2500	0	MALE
73	687.884	0	7000	0	MALE
74	642.512	0	4500	0	MALE
75	591.254	0	3000	0	MALE
76	584.178	0	2920	0	MALE
77	584.178	0	2920	0	MALE
78	231.538	0	10330	0	MALE
79	660.375	0	5250	0	MALE
80	609.577	0	3500	0	MALE
81	600.936	0	2575	0	MALE

Thanks for the help!

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

@sastuck wrote:

The MEANS Procedure

 Analysis Variable : SALARYN N Miss

74300 0

 

Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?


If this relates to my comment about values of 0 included in your means this does not refute it. My concern is that you have values of zero.

 

Consider 5 ceos with salaries of 10000 10000 10000 0 0. Do you want 10000 or 6000 as the "mean"? You have not shown us any way to determine the names of your variables for that given "example" data but 3 of your 4 likely numeric variables have multiple 0 values. (Two of the data columns basically only having 0 or missing). The concern is interpretation if 0 is included in the salary. But as I say again, we do not know which column of your data represents salary.

 

Data is best presented in the form of a data step so there is not problem identifying which variables are numeric or character, we have the names of the variables and a data set we can test code against to verify that the results are correct for an example that is small enough to calculate by hand. Such as:

data have;
   input id salary sex $;
datalines;
1  50000 male
2  25000 male
3  44444 female
4  0     female
;
run;

proc means data=have;
   class sex;
   var salary;
run;

So in the above, since there are only two females would desire the "mean" salary to be 44444 or 22222? If 22222 seems wrong and should be 44444 then you need to exclude the values of 0 from the mean.

 

View solution in original post

30 REPLIES 30
sastuck
Pyrite | Level 9

@PaigeMiller I basically just want to know how to find the average of both genders separately even though the data is in one variable. 

Reeza
Super User

You put your grouping variable in your CLASS statement in PROC MEANS/UNIVARIATE/SUMMARY. 

 

See the example here - you can run it yourself to see how it works:

https://github.com/statgeek/SAS-Tutorials/blob/master/proc_means_basic.sas

sastuck
Pyrite | Level 9

What is going on in the var line?

 

here's an example from the page you linked:

 

*Create summary data;
proc means data=have noprint;
	by id;
	var feature1-feature3;
	output out=want median= var= mean= /autoname;
run;

feature1-feature3 ? What is this?

Shmuel
Garnet | Level 18

You have not mentioned the names of your data variables.

Assuming names :

ID    VAR1      VAR2   VAR3     VAR4     GENDER
56	508.007	0	883.694	175.511	 FEMALE
57	530.478	0	  0	  0	 MALE
58	522	0	  0	  0	 MALE

Your code should be like:

proc means data=have noprint;
   class gender;
    var var1 var3 ;   /* or var1 - var4 << for all 4 numeric variables */
    output out=want <statistics> / <options> ;
run;

Pay attention:  VAR!-VAR4 is equal to  VAR1 VAR2 VAR3 VAR4.

Reeza
Super User

The data set before - the sample data had the variables listed as Features1, features2, features3. 

 

A short cut way to reference all three at once is features1-features3.

 


@sastuck wrote:

What is going on in the var line?

 

here's an example from the page you linked:

 

*Create summary data;
proc means data=have noprint;
	by id;
	var feature1-feature3;
	output out=want median= var= mean= /autoname;
run;

feature1-feature3 ? What is this?


 

ballardw
Super User

Warning Will Robinson:

Use of zero for "missing" data will artificially lower your mean salary analysis.

Since I really doubt that any of those CEOs work for zero salary you probably should either 1)take a pass through a data step to set zero valued salary fields to missing or 2) make sure that you subset the data to records with salary greater than zero in Proc Means/Summary or any analysis.

If you have multiple salary fields with some 0 and others not then the first approach is likely the better one.

 

A second consideration might be time if the data has the salary for multiple years for a single individual.

Reeza
Super User

@ballardw Someone's been watching Lost in Space....:D

ballardw
Super User

@Reeza wrote:

@ballardw Someone's been watching Lost in Space....:D


I used to have to stay up late to watch it when it was first broadcast ...

sastuck
Pyrite | Level 9

The MEANS Procedure

 Analysis Variable : SALARYN N Miss

743000

 

Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?

Reeza
Super User

Your sample data shows 0. Are 0, missing, or truly zero. If it's missing, then using zero will deflate your numbers and it will not show up on the PROC MEANS output because 0 is not missing according to SAS. 

sastuck
Pyrite | Level 9

yes, those are from the bonus, stock options, etc variables (sorry I didnt include headers). I couldnt find any helpful documentation for this . . . but since there is a salary reported and a (true) zero bonus is possible, I might just assume that they dont represent missing values. Regardless, I still am unsure of how to find the mean for men and women separately. Thanks for the heads up though. 

ballardw
Super User

@sastuck wrote:

The MEANS Procedure

 Analysis Variable : SALARYN N Miss

74300 0

 

Looks like I don't have missing or 0 observations for salary. There are plenty for the other forms of compensation though, but perhaps these should be reflected in the mean?


If this relates to my comment about values of 0 included in your means this does not refute it. My concern is that you have values of zero.

 

Consider 5 ceos with salaries of 10000 10000 10000 0 0. Do you want 10000 or 6000 as the "mean"? You have not shown us any way to determine the names of your variables for that given "example" data but 3 of your 4 likely numeric variables have multiple 0 values. (Two of the data columns basically only having 0 or missing). The concern is interpretation if 0 is included in the salary. But as I say again, we do not know which column of your data represents salary.

 

Data is best presented in the form of a data step so there is not problem identifying which variables are numeric or character, we have the names of the variables and a data set we can test code against to verify that the results are correct for an example that is small enough to calculate by hand. Such as:

data have;
   input id salary sex $;
datalines;
1  50000 male
2  25000 male
3  44444 female
4  0     female
;
run;

proc means data=have;
   class sex;
   var salary;
run;

So in the above, since there are only two females would desire the "mean" salary to be 44444 or 22222? If 22222 seems wrong and should be 44444 then you need to exclude the values of 0 from the mean.

 

sastuck
Pyrite | Level 9

@ballardw

 

Regardless of the missing 0's, could you show me how to show the difference in means given the way my data look? I will consider the 0's later, right now I would just like to see how I can show the difference in means between men and women. Since they are in the same column its not as simple as the proc means I am used to. If you can help, thanks!

Reeza
Super User

@sastuck did you run the code included in @ballardw post from MONDAY? It is exactly what you asked for. 

 


@sastuck wrote:

@ballardw

 

Regardless of the missing 0's, could you show me how to show the difference in means given the way my data look? I will consider the 0's later, right now I would just like to see how I can show the difference in means between men and women. Since they are in the same column its not as simple as the proc means I am used to. If you can help, thanks!


 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 30 replies
  • 2657 views
  • 12 likes
  • 4 in conversation