Hi to all , Wish you Happy New Year 2021!
I would like to replace missing values with average and the rest that have missing till the end to leave them as are.
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
data want;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main 7.15
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main 3.65
10343453 Astra 05/04/2011 main 3.65
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main 6.35
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
Many thanks in advance
You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.
Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0 /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
run;
data want (drop=_:);
array rets {30} _temporary_;
call missing(of rets{*});
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if ret^=. then do;
_mean_ret=mean(lag(ret),ret);
if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
rets{_j}=_mean_ret;
end;
end;
end;
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if missing(ret) then ret=rets{_i};
output;
end;
run;
A note:
Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.
Hi @georgel Can you please explain average of what?
1. Is it average of prior and after non missing value?
2. If yes, how did you get
10343453 Astra 04/04/2011 main 5.15
10343453 Astra 05/04/2011 main 5.15
Hi novinosrin,
1. Yes . 2. (4.7+2.6)/2=3.65 I made a mistake in my initial table.
Many thanks in advance
Regards
George
Let's clarify what you are asking for.
You apparently want to replace only missing values that fall into a gap between non-missing values of ret.
At first, I thought you wanted to merely assign the mid-point between the closest preceding and closest following non-missing values. That would work for rows 4 (gets ret=7.15 from 6.6 and 7.7) and 15 (6.35 from 4.7 and 8). But it doesn't explain rows 12 and 13 (they get 5.15 - but are preceded by 2.6 and followed by 4.7).
Please explain.
Hi Keintz
I have corrected my initial post to be the average 3.65
Thanks
Hi georgel,
Here is one possible solution (I am not sure if it is the most efficient one):
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
proc sort data=HAVE;
by cusip_id;
run;
data PASS1;
set HAVE;
by cusip_id;
retain R1;
Order = _n_;
if first.cusip_id then R1 = .;
if Ret ne . then R1 = Ret;
run;
proc sort data=PASS1 out=PASS2;
by descending Order;
run;
data PASS3;
set PASS2;
by descending cusip_id;
retain R2;
if first.cusip_id then R2 = .;
if Ret ne . then R2 = Ret;
run;
proc sort data=PASS3 out=PASS4;
by Order;
run;
data WANT (drop=R1 R2 Order);
set PASS4;
if (Ret=.) and (R1 ne .) and (R2 ne .) then Ret = (R1+R2)/2;
run;
Please note, that I changed one line in your raw data to be
10343453 Astra 08/04/2017 main 8.0
If you have 8 instead of 8.0 your data will be read in as 0.8 .
Hope this helps.
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
data want;
if _n_ = 1 then do ;
dcl hash h() ;
h.definekey ("_n") ;
h.definedata ("ret" ) ;
h.definedone () ;
end ;
do __n=1 by 1 until(last.name);
set have;
by cusip_id name notsorted;
if ret=. and not _k then _k=__n;
if ret then do;
if _k>. then if __n-_k>=1 then do;
_ret=mean(ret,_ret);
do _n=_k to __n-1;
h.add(key:_n,data:_ret);
end;
_k=.;
end;
_ret=ret;
end;
end;
do __n=1 to __n;
set have;
_n_=h.find(key:__n);
output;
end;
h.clear();
drop _:;
run;
Many thanks novinosrin. I have tried it but one observation has been removed
Oh sorry, which one did it drop or remove?
novinosrin, Accept my apologies I rerun and it works perfectly!.
I admit that your solution is more efficient since the accepted solution I have to resize the matrix
Perfect work! Thanks a million novinosrin
Regards
George
No my solution is not efficient than @mkeintz 's solution. Temp ARRAY is much faster. As a matter of fact, I don't think any of my solutions can come close to Mark's and in fact he was the one who taught me many stuff in SAS over the last 5 years. Oh gosh, Mark must have gotten much older lol hahaha
Thanks Leonid
You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.
Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0 /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
run;
data want (drop=_:);
array rets {30} _temporary_;
call missing(of rets{*});
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if ret^=. then do;
_mean_ret=mean(lag(ret),ret);
if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
rets{_j}=_mean_ret;
end;
end;
end;
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if missing(ret) then ret=rets{_i};
output;
end;
run;
A note:
Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.
Impressive! Thanks a million Keintz
Question: why are your data not ordered chronologically? For instance the 3rd-5th rows are as below (note dates are dd/mm/yyyy):
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
So you are estimating data for 4/4/2011 (04apr2011) from 03apr2012 and 05april2017. Really? 2011 from 2012 and 2017?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.