DATA Step, Macro, Functions and more

How to exclude the blank OBS while calculating avg values

Reply
Super Contributor
Posts: 272

How to exclude the blank OBS while calculating avg values

Dear,

 

I am using the following code to get the avg values. I got the out put I need but please suggest any other function I can use.

 

Without this statement   "if bp ^=. then n+1" I am getting the output "avg=55.33" . With this statement added I got the out put I need avg=83. Please suggest any other soloution. Thank you

 

data one;
input id bp;
datalines;
1 80
1 86
1 .
;

 

data two;
set one;
by id;

vsum+bp;
if bp ^=. then n+1;
if first.id then do;
n=1; vsum=bp;
end;
output;
if last.id then do;
test='AVG';
bp=vsum/n;

output;
end;
run;

PROC Star
Posts: 552

Re: How to exclude the blank OBS while calculating avg values

[ Edited ]

So basically you want the easiest way to get the mean value of the values in BP with missing values ignored right? Smiley Happy

 

If this is the case, simply do

 

proc means data = one mean;
   var bp;
run;
Super Contributor
Posts: 272

Re: How to exclude the blank OBS while calculating avg values

yes

PROC Star
Posts: 552

Re: How to exclude the blank OBS while calculating avg values

Look at my edited reply Smiley Happy That will give you avg = 83. Do you want a data step solution or does that work for you?

Super Contributor
Posts: 272

Re: How to exclude the blank OBS while calculating avg values

Thanks for the code. But I need to use my code as there are sveral other variables to be populated for this avg obs . Please suggest in my code.

 

eg: for my data output needed;

 

id    bp          test

1     86           db

1     80          db

1      .            db

1      83          dbavg

PROC Star
Posts: 552

Re: How to exclude the blank OBS while calculating avg values

Ah ok. In that case, @mkeintzs code is pretty solid Smiley Happy

Super User
Posts: 10,500

Re: How to exclude the blank OBS while calculating avg values


knveraraju91 wrote:

Thanks for the code. But I need to use my code as there are sveral other variables to be populated for this avg obs . Please suggest in my code.

 

eg: for my data output needed;

 

id    bp          test

1     86           db

1     80          db

1      .            db

1      83          dbavg


If the sole purpose of this is for a report you may find that this can be done in a procedure such as Proc report.

If you create data sets where you mix values with mean/total values in the same column then you should 1) have a flag on the record somewhere to indicate the row is a summary or not and 2) make sure there is some sort of warning about the data set. Otherwise that average/total or other summary value may be accidentaly used when not intended. Consider the case of you are developing your process and are using:

 

Data have;

    set have;

<your accumulator>

run;

and then decide you need to do this with another variable. So you run:

Data have;

    set have;

<your accumulator>

<new accumulator code>

run;

 

Your accumulated value from the first pass has now been used to create a new accumulated value.

And yes, I have seen this exact thing happen in real world data and had to reconcile why final results didn't match intermediate results.

Valued Guide
Posts: 797

Re: How to exclude the blank OBS while calculating avg values

[ Edited ]

Of course, this is what proc means/proc summary is design to do for you.

 

But if you want to do it in a data step, just don't read in cases with missing bp values.  Then you can automatically count all the other vars:

 

data two;
   set one;
   by id;
   where bp^=.;

   if first.id then vsum=0;
   vsum+bp;

   if last.id then do;
      bp=vsum/ ifn(lag(_N_)=.,_n_,dif(_n_));
      test='avg';
      output;
   end;
run;

 

Notes:

  1. There is no need to actually generate an N variables for the divisor.  Instead, at the end of each ID group, you can use the function
        ifn(lag(_N_)=.,_N_,dif(_N_))
    which says if the first argument is true, return the 2nd arg, otherwsie return the 3rd arg. 

    1. The 1st arg    "lag(_N_)=."  asks whether this is effectively the first time the lag queue has been updated (in which case the lagged value of _N_ is missing.
    2. If this is the end of the first id group, the second argument returns the current observation number (counting only those that passed the WHERE test).
    3. But if this is the end of later id groups, the third argument returns the difference between the current _N_ and the last time the dif function was executed.  Since these are executed only at the end of each id group, the dif function is the number of observations between the end of the last id group and the curent id group.

Editted addition:  I didn't realize you wanted to keep the individual observations as well as the average, in which case:

 

data two (drop=vsum);

  set one;

  by id;

  where bp^=.;

  output;

 

  if first.id then vsum=0;

  vsum+bp;

 

  if last.id then do;

    bp=vsum/ifn(lag(_N_)=.,_n_,dif(_n_));

    test=cats('AVG',test);  *generates "AVGDB" if the test var is long enough and test="DB"*;

    output;

 end;

run;

 

No this is not particularly good if your are getting the average of multiple variables, which might have differing patterns of missing values.

Super Contributor
Posts: 272

Re: How to exclude the blank OBS while calculating avg values

Hi,

 

Your code working but it gave output with  last OBS only. I added output statement in the code, but it missed the OBS with missing value.

 

data two;
set one;
by id;
where bp^=.;

if first.id then vsum=0;
vsum+bp;
output;

if last.id then do;
bp=vsum/ ifn(lag(_N_)=.,_n_,dif(_n_));
test='avg';
output;
end;
run;

 

output getting:

 

id    bp          test

1     86           db

1     80          db

1      83          dbavg

 

output needed;

 

id    bp          test

1     86           db

1     80          db

1      .            db

1      83          dbavg

 

 

Valued Guide
Posts: 797

Re: How to exclude the blank OBS while calculating avg values

Ah you want to keep the missing observations too.  Then you'll probably have to revert to explicit tracking of the N value.

 

data two (drop=vsum n);
  do until (last.id);
    set one;
    by id;
    output;
    vsum=sum(vsum,bp);
    n=sum(n,n(bp));
 end;

 test=cats('AVG',test);
 bp=vsum/n;
 output;
run;

 

 

The N function, and is N(bp) returns a zero if bp is missing, a one otherwise.  It counts the number of non-missing values in the argument - as in    N(x,y,z).

Super User
Posts: 9,681

Re: How to exclude the blank OBS while calculating avg values

Holding skill would be help.

 

data one;
input id bp;
datalines;
1 80
1 86
1 .
2 80
2 86
2 .
;
 
data two;
set one;
by id;
array x{99999} _temporary_;
if first.id then do; n=0; call missing(of x{*});end;
n+1;
x{n}=bp;
output;
if last.id then do;
 test='AVG';
 bp=mean(of x{*});
 output;
end;
run;
Valued Guide
Posts: 797

Re: How to exclude the blank OBS while calculating avg values

Yes, ... but ...

 

Every calculation requires clearing 99999 values ("call missing of x{*}"), and later scanning 99999 values ("mean(of x{*}") - even if 98999 of them are missing.  With a large dataset, where the 99999 is used as protection against a large id group, it can cost in comparison to keeping a running sum.

 

I think the use the temporary arrays like this is probably more advisable when complex manipulation of the values would be needed, say like finding quantiles.

Ask a Question
Discussion stats
  • 11 replies
  • 216 views
  • 4 likes
  • 5 in conversation