- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi to all , Wish you Happy New Year 2021!
I would like to replace missing values with average and the rest that have missing till the end to leave them as are.
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
data want;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main 7.15
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main 3.65
10343453 Astra 05/04/2011 main 3.65
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main 6.35
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
Many thanks in advance
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.
Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0 /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
run;
data want (drop=_:);
array rets {30} _temporary_;
call missing(of rets{*});
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if ret^=. then do;
_mean_ret=mean(lag(ret),ret);
if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
rets{_j}=_mean_ret;
end;
end;
end;
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if missing(ret) then ret=rets{_i};
output;
end;
run;
A note:
- The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.
Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @georgel Can you please explain average of what?
1. Is it average of prior and after non missing value?
2. If yes, how did you get
10343453 Astra 04/04/2011 main 5.15
10343453 Astra 05/04/2011 main 5.15
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi novinosrin,
1. Yes . 2. (4.7+2.6)/2=3.65 I made a mistake in my initial table.
Many thanks in advance
Regards
George
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Let's clarify what you are asking for.
You apparently want to replace only missing values that fall into a gap between non-missing values of ret.
At first, I thought you wanted to merely assign the mid-point between the closest preceding and closest following non-missing values. That would work for rows 4 (gets ret=7.15 from 6.6 and 7.7) and 15 (6.35 from 4.7 and 8). But it doesn't explain rows 12 and 13 (they get 5.15 - but are preceded by 2.6 and followed by 4.7).
Please explain.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Keintz
I have corrected my initial post to be the average 3.65
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi georgel,
Here is one possible solution (I am not sure if it is the most efficient one):
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
proc sort data=HAVE;
by cusip_id;
run;
data PASS1;
set HAVE;
by cusip_id;
retain R1;
Order = _n_;
if first.cusip_id then R1 = .;
if Ret ne . then R1 = Ret;
run;
proc sort data=PASS1 out=PASS2;
by descending Order;
run;
data PASS3;
set PASS2;
by descending cusip_id;
retain R2;
if first.cusip_id then R2 = .;
if Ret ne . then R2 = Ret;
run;
proc sort data=PASS3 out=PASS4;
by Order;
run;
data WANT (drop=R1 R2 Order);
set PASS4;
if (Ret=.) and (R1 ne .) and (R2 ne .) then Ret = (R1+R2)/2;
run;
Please note, that I changed one line in your raw data to be
10343453 Astra 08/04/2017 main 8.0
If you have 8 instead of 8.0 your data will be read in as 0.8 .
Hope this helps.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
;
run;
data want;
if _n_ = 1 then do ;
dcl hash h() ;
h.definekey ("_n") ;
h.definedata ("ret" ) ;
h.definedone () ;
end ;
do __n=1 by 1 until(last.name);
set have;
by cusip_id name notsorted;
if ret=. and not _k then _k=__n;
if ret then do;
if _k>. then if __n-_k>=1 then do;
_ret=mean(ret,_ret);
do _n=_k to __n-1;
h.add(key:_n,data:_ret);
end;
_k=.;
end;
_ret=ret;
end;
end;
do __n=1 to __n;
set have;
_n_=h.find(key:__n);
output;
end;
h.clear();
drop _:;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Many thanks novinosrin. I have tried it but one observation has been removed
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Oh sorry, which one did it drop or remove?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
novinosrin, Accept my apologies I rerun and it works perfectly!.
I admit that your solution is more efficient since the accepted solution I have to resize the matrix
Perfect work! Thanks a million novinosrin
Regards
George
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No my solution is not efficient than @mkeintz 's solution. Temp ARRAY is much faster. As a matter of fact, I don't think any of my solutions can come close to Mark's and in fact he was the one who taught me many stuff in SAS over the last 5 years. Oh gosh, Mark must have gotten much older lol hahaha
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Leonid
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.
Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:
data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main 3.2
10343452 Zilas 02/04/2014 main 4.5
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
10343452 Zilas 06/04/2017 main .
10343452 Zilas 07/04/2017 main .
10343452 Zilas 08/04/2017 main .
10343453 Astra 01/04/2016 main 6.2
10343453 Astra 02/04/2014 main 5.5
10343453 Astra 03/04/2012 main 2.6
10343453 Astra 04/04/2011 main .
10343453 Astra 05/04/2011 main .
10343453 Astra 06/04/2017 main 4.7
10343453 Astra 07/04/2017 main .
10343453 Astra 08/04/2017 main 8.0 /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main .
10343453 Astra 10/04/2017 main .
10343453 Astra 11/04/2017 main .
10343453 Astra 12/04/2017 main .
10343453 Astra 13/04/2017 main .
run;
data want (drop=_:);
array rets {30} _temporary_;
call missing(of rets{*});
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if ret^=. then do;
_mean_ret=mean(lag(ret),ret);
if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
rets{_j}=_mean_ret;
end;
end;
end;
do _i=1 by 1 until (last.cusip_id);
set have;
by cusip_id;
if missing(ret) then ret=rets{_i};
output;
end;
run;
A note:
- The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.
Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Impressive! Thanks a million Keintz
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Question: why are your data not ordered chronologically? For instance the 3rd-5th rows are as below (note dates are dd/mm/yyyy):
10343452 Zilas 03/04/2012 main 6.6
10343452 Zilas 04/04/2011 main .
10343452 Zilas 05/04/2017 main 7.7
So you are estimating data for 4/4/2011 (04apr2011) from 03apr2012 and 05april2017. Really? 2011 from 2012 and 2017?
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------