BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dunga
Obsidian | Level 7

/* Hi Forum,

 

I have a dataset like below. It provides the income of 2 households that occur in different time periods.

Q: I wanted to get the min, max and mean household income of the entire sample.

I have used the Approach I and II below which generate very different 2 answeres.

Could you please tell me which approach is correct to get the mean income of households in the sample?

*/

data data1;

input HOUSE_ID Date Income;

cards;

111 20170101 25

111 20170208 30

111 20170617 .

333 20170623 400

333 20170705 -0.001

333 20170718 4000

;

run;

/*

Approach I:*/

Proc means data = data1;

Var income;

Run;

/*Answer: Min =-0.001

mAX=4000

Mean = 890*/

/*Approach II*/

 

 

proc means data=data1 noprint nway; /*nway keyword is necessary*/

class House_ID;

var Income;

output out=data2 mean=Income_mean;

run;

 

proc means data=data2;

var Income_mean;

run;

/*Answer: Min= 27.5

Max = 1466.67

Mean =747*/

/*Thansk*/

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

 

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

 

And I would be very tempted to discard records with negative income.

View solution in original post

2 REPLIES 2
WarrenKuhfeld
Rhodochrosite | Level 12

The second approach is wrong.  You can't take means of a data set of means and in general get anything meaningful.

ballardw
Super User

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

 

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

 

And I would be very tempted to discard records with negative income.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 963 views
  • 3 likes
  • 3 in conversation