Re: How to exclude the blank OBS while calculating avg values

knveraraju91 · Posted 02-14-2017 11:21 AM

Dear,

I am using the following code to get the avg values. I got the out put I need but please suggest any other function I can use.

Without this statement "if bp ^=. then n+1" I am getting the output "avg=55.33" . With this statement added I got the out put I need avg=83. Please suggest any other soloution. Thank you

data one;
input id bp;
datalines;
1 80
1 86
1 .
;

data two;
set one;
by id;

vsum+bp;
if bp ^=. then n+1;
if first.id then do;
n=1; vsum=bp;
end;
output;
if last.id then do;
test='AVG';
bp=vsum/n;

output;
end;
run;

PeterClemmensen · Posted 02-14-2017 11:31 AM

So basically you want the easiest way to get the mean value of the values in BP with missing values ignored right? 🙂

If this is the case, simply do

proc means data = one mean;
   var bp;
run;

The DATA to DATA Step Macro
Blog: SASnrd

knveraraju91 · Posted 02-14-2017 11:32 AM

yes

PeterClemmensen · Posted 02-14-2017 11:33 AM

Look at my edited reply 🙂 That will give you avg = 83. Do you want a data step solution or does that work for you?

The DATA to DATA Step Macro
Blog: SASnrd

knveraraju91 · Posted 02-14-2017 11:43 AM

Thanks for the code. But I need to use my code as there are sveral other variables to be populated for this avg obs . Please suggest in my code.

eg: for my data output needed;

id bp test

1 86 db

1 80 db

1 . db

1 83 dbavg

PeterClemmensen · Posted 02-14-2017 11:47 AM

Ah ok. In that case, @mkeintzs code is pretty solid 🙂

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 02-14-2017 11:57 AM

@knveraraju91 wrote:

Thanks for the code. But I need to use my code as there are sveral other variables to be populated for this avg obs . Please suggest in my code.

eg: for my data output needed;

id bp test

1 86 db

1 80 db

1 . db

1 83 dbavg

If the sole purpose of this is for a report you may find that this can be done in a procedure such as Proc report.

If you create data sets where you mix values with mean/total values in the same column then you should 1) have a flag on the record somewhere to indicate the row is a summary or not and 2) make sure there is some sort of warning about the data set. Otherwise that average/total or other summary value may be accidentaly used when not intended. Consider the case of you are developing your process and are using:

Data have;

set have;

run;

and then decide you need to do this with another variable. So you run:

Data have;

set have;

run;

Your accumulated value from the first pass has now been used to create a new accumulated value.

And yes, I have seen this exact thing happen in real world data and had to reconcile why final results didn't match intermediate results.

mkeintz · Posted 02-14-2017 11:42 AM

Of course, this is what proc means/proc summary is design to do for you.

But if you want to do it in a data step, just don't read in cases with missing bp values. Then you can automatically count all the other vars:

data two;
   set one;
   by id;
   where bp^=.;

   if first.id then vsum=0;
   vsum+bp;

   if last.id then do;
      bp=vsum/ ifn(lag(_N_)=.,_n_,dif(_n_));
      test='avg';
      output;
   end;
run;

Notes:

There is no need to actually generate an N variables for the divisor. Instead, at the end of each ID group, you can use the function
ifn(lag(_N_)=.,_N_,dif(_N_))
which says if the first argument is true, return the 2nd arg, otherwsie return the 3rd arg.

The 1st arg "lag(_N_)=." asks whether this is effectively the first time the lag queue has been updated (in which case the lagged value of _N_ is missing.
If this is the end of the first id group, the second argument returns the current observation number (counting only those that passed the WHERE test).
But if this is the end of later id groups, the third argument returns the difference between the current _N_ and the last time the dif function was executed. Since these are executed only at the end of each id group, the dif function is the number of observations between the end of the last id group and the curent id group.

Editted addition: I didn't realize you wanted to keep the individual observations as well as the average, in which case:

data two (drop=vsum);

set one;

by id;

where bp^=.;

output;

if first.id then vsum=0;

vsum+bp;

if last.id then do;

bp=vsum/ifn(lag(_N_)=.,_n_,dif(_n_));

test=cats('AVG',test); *generates "AVGDB" if the test var is long enough and test="DB"*;

output;

end;

run;

No this is not particularly good if your are getting the average of multiple variables, which might have differing patterns of missing values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

knveraraju91 · Posted 02-14-2017 11:59 AM

Hi,

Your code working but it gave output with last OBS only. I added output statement in the code, but it missed the OBS with missing value.

data two;
set one;
by id;
where bp^=.;

if first.id then vsum=0;
vsum+bp;
output;

if last.id then do;
bp=vsum/ ifn(lag(_N_)=.,_n_,dif(_n_));
test='avg';
output;
end;
run;

output getting:

id bp test

1 86 db

1 80 db

1 83 dbavg

output needed;

id bp test

1 86 db

1 80 db

1 . db

1 83 dbavg

mkeintz · Posted 02-14-2017 12:08 PM

Ah you want to keep the missing observations too. Then you'll probably have to revert to explicit tracking of the N value.

data two (drop=vsum n);
  do until (last.id);
    set one;
    by id;
    output;
    vsum=sum(vsum,bp);
    n=sum(n,n(bp));
 end;

 test=cats('AVG',test);
 bp=vsum/n;
 output;
run;

The N function, and is N(bp) returns a zero if bp is missing, a one otherwise. It counts the number of non-missing values in the argument - as in N(x,y,z).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 02-14-2017 09:47 PM

Holding skill would be help.

data one;
input id bp;
datalines;
1 80
1 86
1 .
2 80
2 86
2 .
;
 
data two;
set one;
by id;
array x{99999} _temporary_;
if first.id then do; n=0; call missing(of x{*});end;
n+1;
x{n}=bp;
output;
if last.id then do;
 test='AVG';
 bp=mean(of x{*});
 output;
end;
run;

mkeintz · Posted 02-14-2017 10:06 PM

Yes, ... but ...

Every calculation requires clearing 99999 values ("call missing of x{*}"), and later scanning 99999 values ("mean(of x{*}") - even if 98999 of them are missing. With a large dataset, where the 99999 is used as protection against a large id group, it can cost in comparison to keeping a running sum.

I think the use the temporary arrays like this is probably more advisable when complex manipulation of the values would be needed, say like finding quantiles.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away