Help using Base SAS procedures

The correct average please help

Reply
Frequent Contributor
Posts: 126

The correct average please help

Hi all,

If you want to calculate the average value of a variable (which contains a few missing values ) in a proc SQL query, would it be enough and correct?

Cause i know that the function avg or mean doesn't take into consideration the rows with  missing values for the calculation, is it a way to bypass that in a sql query?

So in general of i had 5 observatios with 2 missing what would be the correct assumption ? divide by 5 or 3?

thank you in advance

Super User
Posts: 7,857

Re: The correct average please help

Posted in reply to chemicalab

Divide by 3. Imagine this:

6 6 6 . 6 6

What would you want to have as average?

30 / 5 = 6 or 30 / 6 = 5 ?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 334

Re: The correct average please help

Posted in reply to chemicalab

Im unclear what your need is ... the natural behavior is to exclude missing values from both the numerator and the denominator in the average. If you wanted to treat them as Zeros you could just easily calculate the average through code and not use the function:

proc sql;

select var1, (sum(num) / count(*)) as average

from data

group by var1

;

quit;

This assumes you want to group by a variable instead of average across the entire dataset.

Hope this helps!

EJ

Frequent Contributor
Posts: 126

Re: The correct average please help

Posted in reply to chemicalab

Thank you for the replies, what I am unclear about is what is correct statistically to do?

Trusted Advisor
Posts: 1,932

Re: The correct average please help

Posted in reply to chemicalab

I can't image a situation where you want to replace missings with zeros. Doesn't make sense to me. Maybe there are very narrow, very rare, very specific cases where it makes sense (in which case you would have to justify the reasons for doing so)  — but in general, do not replace missings with zeros.

Frequent Contributor
Posts: 126

Re: The correct average please help

Posted in reply to PaigeMiller

thank you for he clarification

Contributor
Posts: 30

Re: The correct average please help

Posted in reply to chemicalab

This is really more of a stats problem than a programming question, but then I'm more of a statistician than a programmer...

Without context, there's no universal answer to this question of how to calculate the average. There are scenarios in which missing values should be excluded from calculation, others where they should be treated as zeroes, and yet others where missingness makes it impossible to get a sound answer. You need to think about what missing data means in your context, and how the average will be used.

One situation where I commonly encounter missing-as-zero is when merging counts created by PROC SUMMARY/MERGE. For instance, I might have a list of 'flu cases reported, each with a date attached. I can get day-by-day counts:

proc summary data=flu_reports nway;

class day_reported;

output out=flu_reports_by_date;

quit;

I can then merge onto a list of other illnesses reported (assumed to cover all days of the year):

data all_illnesses_by_date;

merge illnesses_except_flu flu_reports_by_date;

by day_reported;

rename _freq_=flu_cases;

run;

If there were no flu cases reported on April 1, then flu_reports_by_date will have no entry for April 1. When I merge it to create all_illnesses_by_date it will have a missing value for flu - but that should be treated as zero. Ideally it'd be changed to zero before calculating averages, but if not, then the missing entry needs to be included in the count when calculating an average.

Super User
Posts: 19,869

Re: The correct average please help

Posted in reply to GeoffreyBrent

I agree with GeoffreyBent, it depends on the context of the missing variables.

Super User
Posts: 7,857

Re: The correct average please help

Posted in reply to GeoffreyBrent

I like that, but in such cases where missing is "legally zero" I prefer to avoid missing values by doing it like that:

data all_illnesses_by_date;

merge illnesses_except_flu flu_reports_by_date (in=in_days);

by day_reported;

rename _freq_=flu_cases;

if not in_days then flu_cases = 0;

run;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 30

Re: The correct average please help

Posted in reply to KurtBremser

I agree that it's best to convert missing values to zeroes where that's what they represent, but that code won't give the result you're expecting.

RENAME statements are executed after the body of the DATA step is run. So your code will create a new variable named "flu_cases" (which will be zero for cases which are missing from flu_reports_by_date, and missing for all others) before attempting to rename _freq_ to flu_cases. This will generate a warning ("Variable flu_cases already exists") and as far as I can tell, you end up with the same values that were in _freq_, complete with missingness.

Probably safer to do the renaming up in the MERGE statement so you can use the new name through the rest of the data step - I just couldn't remember the exact syntax and didn't have time to look it up.

Super User
Posts: 7,857

Re: The correct average please help

Posted in reply to GeoffreyBrent

I am used to do the rename as data set option, where it is clear if it happens before or after. Good catch.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 30

Re: The correct average please help

Posted in reply to KurtBremser

Yeah, renaming as a data set option is usually a more intuitive way to go, I just got lazy in my demo code :-)

(When I first worked in SAS, nobody mentioned to me that data step commands aren't all executed in order of appearance, which made the learning process... interesting.)

Super User
Posts: 10,044

Re: The correct average please help

Posted in reply to chemicalab

It depends on what you want. SQL defaultly will neglect the missing value when using avg() or mean(). if that is not you suppose, you can change these missing value into zero .

data have;
 input x @@;
cards;
3 4 5 . . 4 5 4
;
run;
proc stdize data=have out=want reponly missing=0;
var x;
run;

Ksharp

Ask a Question
Discussion stats
  • 12 replies
  • 445 views
  • 4 likes
  • 7 in conversation