BookmarkSubscribeRSS Feed
chemicalab
Fluorite | Level 6

Hi all,

If you want to calculate the average value of a variable (which contains a few missing values ) in a proc SQL query, would it be enough and correct?

Cause i know that the function avg or mean doesn't take into consideration the rows with  missing values for the calculation, is it a way to bypass that in a sql query?

So in general of i had 5 observatios with 2 missing what would be the correct assumption ? divide by 5 or 3?

thank you in advance

12 REPLIES 12
esjackso
Quartz | Level 8

Im unclear what your need is ... the natural behavior is to exclude missing values from both the numerator and the denominator in the average. If you wanted to treat them as Zeros you could just easily calculate the average through code and not use the function:

proc sql;

select var1, (sum(num) / count(*)) as average

from data

group by var1

;

quit;

This assumes you want to group by a variable instead of average across the entire dataset.

Hope this helps!

EJ

chemicalab
Fluorite | Level 6

Thank you for the replies, what I am unclear about is what is correct statistically to do?

PaigeMiller
Diamond | Level 26

I can't image a situation where you want to replace missings with zeros. Doesn't make sense to me. Maybe there are very narrow, very rare, very specific cases where it makes sense (in which case you would have to justify the reasons for doing so)  — but in general, do not replace missings with zeros.

--
Paige Miller
chemicalab
Fluorite | Level 6

thank you for he clarification

GeoffreyBrent
Calcite | Level 5

This is really more of a stats problem than a programming question, but then I'm more of a statistician than a programmer...

Without context, there's no universal answer to this question of how to calculate the average. There are scenarios in which missing values should be excluded from calculation, others where they should be treated as zeroes, and yet others where missingness makes it impossible to get a sound answer. You need to think about what missing data means in your context, and how the average will be used.

One situation where I commonly encounter missing-as-zero is when merging counts created by PROC SUMMARY/MERGE. For instance, I might have a list of 'flu cases reported, each with a date attached. I can get day-by-day counts:

proc summary data=flu_reports nway;

class day_reported;

output out=flu_reports_by_date;

quit;

I can then merge onto a list of other illnesses reported (assumed to cover all days of the year):

data all_illnesses_by_date;

merge illnesses_except_flu flu_reports_by_date;

by day_reported;

rename _freq_=flu_cases;

run;

If there were no flu cases reported on April 1, then flu_reports_by_date will have no entry for April 1. When I merge it to create all_illnesses_by_date it will have a missing value for flu - but that should be treated as zero. Ideally it'd be changed to zero before calculating averages, but if not, then the missing entry needs to be included in the count when calculating an average.

Reeza
Super User

I agree with GeoffreyBent, it depends on the context of the missing variables.

Kurt_Bremser
Super User

I like that, but in such cases where missing is "legally zero" I prefer to avoid missing values by doing it like that:

data all_illnesses_by_date;

merge illnesses_except_flu flu_reports_by_date (in=in_days);

by day_reported;

rename _freq_=flu_cases;

if not in_days then flu_cases = 0;

run;

GeoffreyBrent
Calcite | Level 5

I agree that it's best to convert missing values to zeroes where that's what they represent, but that code won't give the result you're expecting.

RENAME statements are executed after the body of the DATA step is run. So your code will create a new variable named "flu_cases" (which will be zero for cases which are missing from flu_reports_by_date, and missing for all others) before attempting to rename _freq_ to flu_cases. This will generate a warning ("Variable flu_cases already exists") and as far as I can tell, you end up with the same values that were in _freq_, complete with missingness.

Probably safer to do the renaming up in the MERGE statement so you can use the new name through the rest of the data step - I just couldn't remember the exact syntax and didn't have time to look it up.

GeoffreyBrent
Calcite | Level 5

Yeah, renaming as a data set option is usually a more intuitive way to go, I just got lazy in my demo code 🙂

(When I first worked in SAS, nobody mentioned to me that data step commands aren't all executed in order of appearance, which made the learning process... interesting.)

Ksharp
Super User

It depends on what you want. SQL defaultly will neglect the missing value when using avg() or mean(). if that is not you suppose, you can change these missing value into zero .

data have;
 input x @@;
cards;
3 4 5 . . 4 5 4
;
run;
proc stdize data=have out=want reponly missing=0;
var x;
run;

Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1628 views
  • 4 likes
  • 7 in conversation