04-03-2014 06:58 AM
If you want to calculate the average value of a variable (which contains a few missing values ) in a proc SQL query, would it be enough and correct?
Cause i know that the function avg or mean doesn't take into consideration the rows with missing values for the calculation, is it a way to bypass that in a sql query?
So in general of i had 5 observatios with 2 missing what would be the correct assumption ? divide by 5 or 3?
thank you in advance
04-03-2014 07:18 AM
Divide by 3. Imagine this:
6 6 6 . 6 6
What would you want to have as average?
30 / 5 = 6 or 30 / 6 = 5 ?
04-03-2014 07:30 AM
Im unclear what your need is ... the natural behavior is to exclude missing values from both the numerator and the denominator in the average. If you wanted to treat them as Zeros you could just easily calculate the average through code and not use the function:
select var1, (sum(num) / count(*)) as average
group by var1
This assumes you want to group by a variable instead of average across the entire dataset.
Hope this helps!
04-03-2014 08:05 AM
I can't image a situation where you want to replace missings with zeros. Doesn't make sense to me. Maybe there are very narrow, very rare, very specific cases where it makes sense (in which case you would have to justify the reasons for doing so) — but in general, do not replace missings with zeros.
04-03-2014 07:25 PM
This is really more of a stats problem than a programming question, but then I'm more of a statistician than a programmer...
Without context, there's no universal answer to this question of how to calculate the average. There are scenarios in which missing values should be excluded from calculation, others where they should be treated as zeroes, and yet others where missingness makes it impossible to get a sound answer. You need to think about what missing data means in your context, and how the average will be used.
One situation where I commonly encounter missing-as-zero is when merging counts created by PROC SUMMARY/MERGE. For instance, I might have a list of 'flu cases reported, each with a date attached. I can get day-by-day counts:
proc summary data=flu_reports nway;
I can then merge onto a list of other illnesses reported (assumed to cover all days of the year):
merge illnesses_except_flu flu_reports_by_date;
If there were no flu cases reported on April 1, then flu_reports_by_date will have no entry for April 1. When I merge it to create all_illnesses_by_date it will have a missing value for flu - but that should be treated as zero. Ideally it'd be changed to zero before calculating averages, but if not, then the missing entry needs to be included in the count when calculating an average.
04-04-2014 02:01 AM
I like that, but in such cases where missing is "legally zero" I prefer to avoid missing values by doing it like that:
merge illnesses_except_flu flu_reports_by_date (in=in_days);
if not in_days then flu_cases = 0;
04-04-2014 02:14 AM
I agree that it's best to convert missing values to zeroes where that's what they represent, but that code won't give the result you're expecting.
RENAME statements are executed after the body of the DATA step is run. So your code will create a new variable named "flu_cases" (which will be zero for cases which are missing from flu_reports_by_date, and missing for all others) before attempting to rename _freq_ to flu_cases. This will generate a warning ("Variable flu_cases already exists") and as far as I can tell, you end up with the same values that were in _freq_, complete with missingness.
Probably safer to do the renaming up in the MERGE statement so you can use the new name through the rest of the data step - I just couldn't remember the exact syntax and didn't have time to look it up.
04-04-2014 02:28 AM
I am used to do the rename as data set option, where it is clear if it happens before or after. Good catch.
04-07-2014 07:37 PM
Yeah, renaming as a data set option is usually a more intuitive way to go, I just got lazy in my demo code :-)
(When I first worked in SAS, nobody mentioned to me that data step commands aren't all executed in order of appearance, which made the learning process... interesting.)
04-06-2014 05:11 AM
It depends on what you want. SQL defaultly will neglect the missing value when using avg() or mean(). if that is not you suppose, you can change these missing value into zero .
data have; input x @@; cards; 3 4 5 . . 4 5 4 ; run; proc stdize data=have out=want reponly missing=0; var x; run;