SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
georgel
Quartz | Level 8

Hi to all , Wish you Happy New Year 2021!

I would like to replace missing values with average and the rest that have missing till the end to leave them as are. 

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

data want;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   7.15
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   3.65
10343453 Astra 05/04/2011 main   3.65
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  6.35
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .

;
run;

Many thanks in advance 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You need code to capture the most recent non-missing and the upcoming non-missing.  You can do this by passing through all the records for a given cusip_id and storing them in an array.  Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

 

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

 

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

  1. The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing.  This means that the lag(ret) will also always be non-missing (except at the beginning).  That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID.  I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

21 REPLIES 21
novinosrin
Tourmaline | Level 20

Hi @georgel  Can you please explain average of what?

 

1. Is it average of prior and after non missing value?

2. If yes, how did you get 

10343453 Astra 04/04/2011 main   5.15
10343453 Astra 05/04/2011 main   5.15
georgel
Quartz | Level 8

Hi novinosrin, 

 

1.  Yes . 2. (4.7+2.6)/2=3.65 I made a mistake in my initial table. 

 

Many thanks in advance

 

Regards

George

mkeintz
PROC Star

Let's clarify what you are asking for.

 

You apparently want to replace only missing values that fall into a gap between non-missing values of ret.

 

At first, I thought you wanted to merely assign the mid-point between the closest preceding and closest following non-missing values.  That would work for rows 4 (gets ret=7.15 from 6.6 and 7.7)  and 15 (6.35 from 4.7 and 8).  But it doesn't explain rows 12 and 13 (they get 5.15 - but are preceded by 2.6 and followed by 4.7).

 

Please explain.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
georgel
Quartz | Level 8

Hi Keintz

 

I have corrected my initial post to be the average 3.65

 

Thanks

 

LeonidBatkhan
Lapis Lazuli | Level 10

Hi georgel,

Here is one possible solution (I am not sure if it is the most efficient one):

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

proc sort data=HAVE;
   by cusip_id;
run;

data PASS1;
   set HAVE;
   by cusip_id;
   retain R1;
   Order = _n_;
   if first.cusip_id then R1 = .;
   if Ret ne . then R1 = Ret;
run;

proc sort data=PASS1 out=PASS2;
   by descending Order;
run;

data PASS3;
   set PASS2;
   by descending cusip_id;
   retain R2;
   if first.cusip_id then R2 = .;
   if Ret ne . then R2 = Ret;
run;

proc sort data=PASS3 out=PASS4;
   by Order;
run;

data WANT (drop=R1 R2 Order);
   set PASS4;
   if (Ret=.) and (R1 ne .) and (R2 ne .) then Ret = (R1+R2)/2;
run;

Please note, that I changed one line in your raw data to be

10343453 Astra 08/04/2017 main 8.0

If you have 8 instead of 8.0 your data will be read in as 0.8 .

Hope this helps.

novinosrin
Tourmaline | Level 20

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;


data want;
 if _n_ = 1 then do ;
   dcl hash h() ;
   h.definekey ("_n") ;
   h.definedata ("ret" ) ;
   h.definedone () ;
 end ; 
 do __n=1 by 1 until(last.name);
  set have;
  by cusip_id name notsorted;
  if ret=. and not _k then  _k=__n;
  if ret then do;
   if _k>. then if __n-_k>=1 then do;
     _ret=mean(ret,_ret);
     do _n=_k to __n-1;
	  h.add(key:_n,data:_ret);
	 end;
	 _k=.;
   end;
   _ret=ret;
  end;
 end;
 do __n=1 to __n;
  set have;
  _n_=h.find(key:__n);
  output;
 end;
 h.clear();
 drop _:;
run;
georgel
Quartz | Level 8

Many thanks novinosrin. I have tried it but one observation has been removed

novinosrin
Tourmaline | Level 20

Oh sorry, which one did it drop or remove?

georgel
Quartz | Level 8

novinosrin, Accept my apologies I rerun and it works perfectly!. 

I admit that your solution is more efficient since  the accepted solution I have to resize the matrix

Perfect work! Thanks a million novinosrin

 

Regards

George

 

 

novinosrin
Tourmaline | Level 20

No my solution is not efficient than @mkeintz  's solution. Temp ARRAY is much faster. As a matter of fact, I don't think any of my solutions can come close to Mark's and in fact he was the one who taught me many stuff in SAS over the last 5 years. Oh gosh, Mark must have gotten much older lol hahaha

georgel
Quartz | Level 8

Thanks Leonid

mkeintz
PROC Star

You need code to capture the most recent non-missing and the upcoming non-missing.  You can do this by passing through all the records for a given cusip_id and storing them in an array.  Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

 

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

 

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

  1. The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing.  This means that the lag(ret) will also always be non-missing (except at the beginning).  That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID.  I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
georgel
Quartz | Level 8

Impressive! Thanks a million Keintz

mkeintz
PROC Star

Question:  why are your data not ordered chronologically?  For instance the 3rd-5th rows are as below (note dates are dd/mm/yyyy):

 

10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7

 

So you are estimating data for 4/4/2011  (04apr2011) from   03apr2012  and 05april2017.     Really?   2011 from 2012 and 2017?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 2631 views
  • 10 likes
  • 6 in conversation