Solved: Replacing Missing values with average

georgel · Posted 01-03-2021 04:45 AM

Hi to all , Wish you Happy New Year 2021!

I would like to replace missing values with average and the rest that have missing till the end to leave them as are.

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

data want;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   7.15
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   3.65
10343453 Astra 05/04/2011 main   3.65
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  6.35
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .

;
run;

Many thanks in advance

mkeintz · Posted 01-03-2021 01:04 PM

You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

novinosrin · Posted 01-03-2021 12:06 PM

Hi @georgel Can you please explain average of what?

1. Is it average of prior and after non missing value?

2. If yes, how did you get

10343453 Astra 04/04/2011 main   5.15
10343453 Astra 05/04/2011 main   5.15

georgel · Posted 01-03-2021 12:23 PM

Hi novinosrin,

1. Yes . 2. (4.7+2.6)/2=3.65 I made a mistake in my initial table.

Many thanks in advance

Regards

George

mkeintz · Posted 01-03-2021 12:32 PM

Let's clarify what you are asking for.

You apparently want to replace only missing values that fall into a gap between non-missing values of ret.

At first, I thought you wanted to merely assign the mid-point between the closest preceding and closest following non-missing values. That would work for rows 4 (gets ret=7.15 from 6.6 and 7.7) and 15 (6.35 from 4.7 and 8). But it doesn't explain rows 12 and 13 (they get 5.15 - but are preceded by 2.6 and followed by 4.7).

Please explain.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

georgel · Posted 01-03-2021 12:36 PM

Hi Keintz

I have corrected my initial post to be the average 3.65

Thanks

LeonidBatkhan · Posted 01-03-2021 12:49 PM

Hi georgel,

Here is one possible solution (I am not sure if it is the most efficient one):

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

proc sort data=HAVE;
   by cusip_id;
run;

data PASS1;
   set HAVE;
   by cusip_id;
   retain R1;
   Order = _n_;
   if first.cusip_id then R1 = .;
   if Ret ne . then R1 = Ret;
run;

proc sort data=PASS1 out=PASS2;
   by descending Order;
run;

data PASS3;
   set PASS2;
   by descending cusip_id;
   retain R2;
   if first.cusip_id then R2 = .;
   if Ret ne . then R2 = Ret;
run;

proc sort data=PASS3 out=PASS4;
   by Order;
run;

data WANT (drop=R1 R2 Order);
   set PASS4;
   if (Ret=.) and (R1 ne .) and (R2 ne .) then Ret = (R1+R2)/2;
run;

Please note, that I changed one line in your raw data to be

10343453 Astra 08/04/2017 main 8.0

If you have 8 instead of 8.0 your data will be read in as 0.8 .

Hope this helps.

➤ Leonid's SAS blog

novinosrin · Posted 01-03-2021 12:57 PM


data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;


data want;
 if _n_ = 1 then do ;
   dcl hash h() ;
   h.definekey ("_n") ;
   h.definedata ("ret" ) ;
   h.definedone () ;
 end ; 
 do __n=1 by 1 until(last.name);
  set have;
  by cusip_id name notsorted;
  if ret=. and not _k then  _k=__n;
  if ret then do;
   if _k>. then if __n-_k>=1 then do;
     _ret=mean(ret,_ret);
     do _n=_k to __n-1;
	  h.add(key:_n,data:_ret);
	 end;
	 _k=.;
   end;
   _ret=ret;
  end;
 end;
 do __n=1 to __n;
  set have;
  _n_=h.find(key:__n);
  output;
 end;
 h.clear();
 drop _:;
run;

georgel · Posted 01-03-2021 01:13 PM

Many thanks novinosrin. I have tried it but one observation has been removed

novinosrin · Posted 01-03-2021 01:18 PM

Oh sorry, which one did it drop or remove?

georgel · Posted 01-03-2021 01:26 PM

novinosrin, Accept my apologies I rerun and it works perfectly!.

I admit that your solution is more efficient since the accepted solution I have to resize the matrix

Perfect work! Thanks a million novinosrin

Regards

George

novinosrin · Posted 01-03-2021 01:28 PM

No my solution is not efficient than @mkeintz 's solution. Temp ARRAY is much faster. As a matter of fact, I don't think any of my solutions can come close to Mark's and in fact he was the one who taught me many stuff in SAS over the last 5 years. Oh gosh, Mark must have gotten much older lol hahaha

georgel · Posted 01-03-2021 01:15 PM

Thanks Leonid

mkeintz · Posted 01-03-2021 01:04 PM

You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

georgel · Posted 01-03-2021 01:11 PM

Impressive! Thanks a million Keintz

mkeintz · Posted 01-03-2021 02:10 PM

Question: why are your data not ordered chronologically? For instance the 3rd-5th rows are as below (note dates are dd/mm/yyyy):

10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7

So you are estimating data for 4/4/2011 (04apr2011) from 03apr2012 and 05april2017. Really? 2011 from 2012 and 2017?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away