/* Hi Forum,
I have a dataset like below. It provides the income of 2 households that occur in different time periods.
Q: I wanted to get the min, max and mean household income of the entire sample.
I have used the Approach I and II below which generate very different 2 answeres.
Could you please tell me which approach is correct to get the mean income of households in the sample?
*/
data data1;
input HOUSE_ID Date Income;
cards;
111 20170101 25
111 20170208 30
111 20170617 .
333 20170623 400
333 20170705 -0.001
333 20170718 4000
;
run;
/*
Approach I:*/
Proc means data = data1;
Var income;
Run;
/*Answer: Min =-0.001
mAX=4000
Mean = 890*/
/*Approach II*/
proc means data=data1 noprint nway; /*nway keyword is necessary*/
class House_ID;
var Income;
output out=data2 mean=Income_mean;
run;
proc means data=data2;
var Income_mean;
run;
/*Answer: Min= 27.5
Max = 1466.67
Mean =747*/
/*Thansk*/
Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.
The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.
And I would be very tempted to discard records with negative income.
The second approach is wrong. You can't take means of a data set of means and in general get anything meaningful.
Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.
The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.
And I would be very tempted to discard records with negative income.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.