Solved: Re: Replacing Missing values with average - Page 2

georgel · Posted 01-03-2021 02:54 PM

Dear Mark,

Thank you very much for paying attention to details!. This is a great support from you and I highly appreciated it.

No, this is just a random sample non-real one!! I have my real data. I havent notice this!. Basically to telll the story. I want to have benchmarks for each of my own set of companies all of them from US but I find that some of my benchmarks have only one missing return. The algortihm that I created it finds another benchmark-company only for the missing day and I dont want this. (real example from CRSP: WRDS:20030815 CUSIP: 88800230 Missing Return )

Regards

George

mkeintz · Posted 01-03-2021 06:16 PM

If your data are already sorted, then there is a somewhat cleaner way to replace internal missing values:

data retmeans (keep=cusip_id date_issue _midpoint) / view=retmeans;
  merge have (where=(ret^=.)                 keep=cusip_id ret date_issue)
        have (where=(_nxt_ret^=.) firstobs=2 keep=cusip_id ret rename=(cusip_id=_nxt_cusip ret=_nxt_ret));
  if cusip_id=_nxt_cusip then _midpoint=mean(ret,_nxt_ret);
run;

data want (drop=_:);
  merge have  retmeans (keep=cusip_id date_issue in=inmid);
  by cusip_id date_issue;
  if inmid then set retmeans ; /* Get _MIDPOINT */
  if ret=. then ret=_midpoint;
run;

Note the data set RETMEANS is a data set view, not a data set file. That means it will not be written to disk, and that it will only be implemented when called for in a later data step. Think of it as a synchronized process that is run only when used in the DATA WANT step. So while the program is structed in two steps, there is only one step that actually causes data processing - i.e. it's efficient.

The retmeans data set will have one record per non-missing RET. It will contain the id, date and the _MIDPOINT, the mean of the current RET and the following RET (note the FIRSTOBS=2). Except for the last non-missing RET of each ID where _MIDPOINT will be left missing.

The second data step refers to RETMEANS once in a MERGE statement (excluding the _MIDPOINT variable) and once the SET statement. Any variable in the (unconditional) MERGE statement will be replaced with ever iteration of the data step. But any variable associated only with the (conditional) SET statement will not be replaced/reset until the next execution of the set statement. So _MIDPOINT will be retained and available for all dates between two non-missing RET's.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

georgel · Posted 01-04-2021 12:39 PM

Dear Mark,

Thank you very much for your great support and explanations.

Regards

George

Ksharp · Posted 01-04-2021 07:04 AM

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret ;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;
data want;
if 0 then set have;
retain lag_ret .;
do until(last.cusip_id or not missing(ret));
 set have;
 by cusip_id;
 if first.cusip_id then call missing(lag_ret);
end;
mean=(lag_ret+ret)/2;
do until(last.cusip_id or not missing(ret));
 set have;
 by cusip_id;
 new_ret=coalesce(ret,mean);
 output;
end;
lag_ret=ret;
drop mean lag_ret ret;
run;

Rick_SAS · Posted 01-04-2021 01:37 PM

Just use PROC STDIZE, as described in this article. But also be aware that there are disadvantages to replacing missing values by the mean of the nonmissing values.

mkeintz · Posted 01-04-2021 02:35 PM

@Rick_SAS

I think the OP is not really asking for a within-id-group mean. The request is to get the mean of the nearest-preceding and nearest-following valid values. I don't think PROC STDIZE can handle that.

But I do like the comments on the advisability of mean-imputation.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Rick_SAS · Posted 01-04-2021 02:41 PM

Thanks. I did not read the thread very carefully, so I appreciate the correction. My apologies. Best wishes.

SAS Innovate 2025: Call for Content

Classroom Training Available!