BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dunga
Obsidian | Level 7

/* Hi Forum,

 

I have a dataset like below. It provides the income of 2 households that occur in different time periods.

Q: I wanted to get the min, max and mean household income of the entire sample.

I have used the Approach I and II below which generate very different 2 answeres.

Could you please tell me which approach is correct to get the mean income of households in the sample?

*/

data data1;

input HOUSE_ID Date Income;

cards;

111 20170101 25

111 20170208 30

111 20170617 .

333 20170623 400

333 20170705 -0.001

333 20170718 4000

;

run;

/*

Approach I:*/

Proc means data = data1;

Var income;

Run;

/*Answer: Min =-0.001

mAX=4000

Mean = 890*/

/*Approach II*/

 

 

proc means data=data1 noprint nway; /*nway keyword is necessary*/

class House_ID;

var Income;

output out=data2 mean=Income_mean;

run;

 

proc means data=data2;

var Income_mean;

run;

/*Answer: Min= 27.5

Max = 1466.67

Mean =747*/

/*Thansk*/

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

 

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

 

And I would be very tempted to discard records with negative income.

View solution in original post

2 REPLIES 2
WarrenKuhfeld
Ammonite | Level 13

The second approach is wrong.  You can't take means of a data set of means and in general get anything meaningful.

ballardw
Super User

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

 

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

 

And I would be very tempted to discard records with negative income.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1235 views
  • 3 likes
  • 3 in conversation