Solved: Re: Which "min". "Max" and "average" are correct?

dunga · Posted 08-10-2017 12:03 PM

/* Hi Forum,

I have a dataset like below. It provides the income of 2 households that occur in different time periods.

Q: I wanted to get the min, max and mean household income of the entire sample.

I have used the Approach I and II below which generate very different 2 answeres.

Could you please tell me which approach is correct to get the mean income of households in the sample?

*/

data data1;

input HOUSE_ID Date Income;

cards;

111 20170101 25

111 20170208 30

111 20170617 .

333 20170623 400

333 20170705 -0.001

333 20170718 4000

;

run;

/*

Approach I:*/

Proc means data = data1;

Var income;

Run;

/*Answer: Min =-0.001

mAX=4000

Mean = 890*/

/*Approach II*/

proc means data=data1 noprint nway; /*nway keyword is necessary*/

class House_ID;

var Income;

output out=data2 mean=Income_mean;

run;

proc means data=data2;

var Income_mean;

run;

/*Answer: Min= 27.5

Max = 1466.67

Mean =747*/

/*Thansk*/

ballardw · Posted 08-10-2017 12:16 PM

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

And I would be very tempted to discard records with negative income.

View solution in original post

WarrenKuhfeld · Posted 08-10-2017 12:11 PM

The second approach is wrong. You can't take means of a data set of means and in general get anything meaningful.

ballardw · Posted 08-10-2017 12:16 PM

Actually I'm going to first throw a wrench: Perhaps you only want to include the "latest" income for each household, or perhaps household incomes within a specified time frame.

The "correct" one would depend on what kind of question you want to answer. If you want to discuss differences across households then some form of reduce to household (latest, earliest, mean within time frame) first and then summarize similar to your approach II. If the question is just within sample then the first. I also would tend to want N and standard deviations just to let me know if there's something unexpected about the data.

And I would be very tempted to discard records with negative income.

Which "min". "Max" and "average" are correct?

Re: Which "min". "Max" and "average" are correct?

Re: Which "min". "Max" and "average" are correct?

Re: Which "min". "Max" and "average" are correct?

Ready to join fellow brilliant minds for the SAS Hackathon?