SAS Programming

georgel · Posted 01-03-2021 04:45 AM

Hi to all , Wish you Happy New Year 2021!

I would like to replace missing values with average and the rest that have missing till the end to leave them as are.

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

data want;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   7.15
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   3.65
10343453 Astra 05/04/2011 main   3.65
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  6.35
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .

;
run;

Many thanks in advance

mkeintz · Posted 01-03-2021 01:04 PM

You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

novinosrin · Posted 01-03-2021 12:06 PM

Hi @georgel Can you please explain average of what?

1. Is it average of prior and after non missing value?

2. If yes, how did you get

10343453 Astra 04/04/2011 main   5.15
10343453 Astra 05/04/2011 main   5.15

georgel · Posted 01-03-2021 12:23 PM

Hi novinosrin,

1. Yes . 2. (4.7+2.6)/2=3.65 I made a mistake in my initial table.

Many thanks in advance

Regards

George

mkeintz · Posted 01-03-2021 12:32 PM

Let's clarify what you are asking for.

You apparently want to replace only missing values that fall into a gap between non-missing values of ret.

At first, I thought you wanted to merely assign the mid-point between the closest preceding and closest following non-missing values. That would work for rows 4 (gets ret=7.15 from 6.6 and 7.7) and 15 (6.35 from 4.7 and 8). But it doesn't explain rows 12 and 13 (they get 5.15 - but are preceded by 2.6 and followed by 4.7).

Please explain.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

georgel · Posted 01-03-2021 12:36 PM

Hi Keintz

I have corrected my initial post to be the average 3.65

Thanks

LeonidBatkhan · Posted 01-03-2021 12:49 PM

Hi georgel,

Here is one possible solution (I am not sure if it is the most efficient one):

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;

proc sort data=HAVE;
   by cusip_id;
run;

data PASS1;
   set HAVE;
   by cusip_id;
   retain R1;
   Order = _n_;
   if first.cusip_id then R1 = .;
   if Ret ne . then R1 = Ret;
run;

proc sort data=PASS1 out=PASS2;
   by descending Order;
run;

data PASS3;
   set PASS2;
   by descending cusip_id;
   retain R2;
   if first.cusip_id then R2 = .;
   if Ret ne . then R2 = Ret;
run;

proc sort data=PASS3 out=PASS4;
   by Order;
run;

data WANT (drop=R1 R2 Order);
   set PASS4;
   if (Ret=.) and (R1 ne .) and (R2 ne .) then Ret = (R1+R2)/2;
run;

Please note, that I changed one line in your raw data to be

10343453 Astra 08/04/2017 main 8.0

If you have 8 instead of 8.0 your data will be read in as 0.8 .

Hope this helps.

➤ Leonid's SAS blog

novinosrin · Posted 01-03-2021 12:57 PM


data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
;
run;


data want;
 if _n_ = 1 then do ;
   dcl hash h() ;
   h.definekey ("_n") ;
   h.definedata ("ret" ) ;
   h.definedone () ;
 end ; 
 do __n=1 by 1 until(last.name);
  set have;
  by cusip_id name notsorted;
  if ret=. and not _k then  _k=__n;
  if ret then do;
   if _k>. then if __n-_k>=1 then do;
     _ret=mean(ret,_ret);
     do _n=_k to __n-1;
	  h.add(key:_n,data:_ret);
	 end;
	 _k=.;
   end;
   _ret=ret;
  end;
 end;
 do __n=1 to __n;
  set have;
  _n_=h.find(key:__n);
  output;
 end;
 h.clear();
 drop _:;
run;

georgel · Posted 01-03-2021 01:13 PM

Many thanks novinosrin. I have tried it but one observation has been removed

novinosrin · Posted 01-03-2021 01:18 PM

Oh sorry, which one did it drop or remove?

georgel · Posted 01-03-2021 01:26 PM

novinosrin, Accept my apologies I rerun and it works perfectly!.

I admit that your solution is more efficient since the accepted solution I have to resize the matrix

Perfect work! Thanks a million novinosrin

Regards

George

novinosrin · Posted 01-03-2021 01:28 PM

No my solution is not efficient than @mkeintz 's solution. Temp ARRAY is much faster. As a matter of fact, I don't think any of my solutions can come close to Mark's and in fact he was the one who taught me many stuff in SAS over the last 5 years. Oh gosh, Mark must have gotten much older lol hahaha

georgel · Posted 01-03-2021 01:15 PM

Thanks Leonid

mkeintz · Posted 01-03-2021 01:04 PM

You need code to capture the most recent non-missing and the upcoming non-missing. You can do this by passing through all the records for a given cusip_id and storing them in an array. Whenever a non-missing is encountered, fill in any preceding element of the array that are missing using the average of the current and most recent non-missing.

Then re-read the same cusip_id. whenever a missing ret if found, retrieve values from the array:

data have;
input cusip_id $8. Name $6. date_issue :ddmmyy10. type $5. Ret  dollar8.1;
format date_issue ddmmyy10. ;
datalines;
10343452 Zilas 01/04/2016 main  3.2
10343452 Zilas 02/04/2014 main     4.5
10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7
10343452 Zilas 06/04/2017 main  .
10343452 Zilas 07/04/2017 main  .
10343452 Zilas 08/04/2017 main  .
10343453 Astra 01/04/2016 main  6.2
10343453 Astra 02/04/2014 main     5.5
10343453 Astra 03/04/2012 main   2.6
10343453 Astra 04/04/2011 main   .
10343453 Astra 05/04/2011 main   .
10343453 Astra 06/04/2017 main  4.7
10343453 Astra 07/04/2017 main  .
10343453 Astra 08/04/2017 main  8.0    /*changed 8 to 8.0 to accomodate dollar8.1 format*/
10343453 Astra 09/04/2017 main  .
10343453 Astra 10/04/2017 main  .
10343453 Astra 11/04/2017 main  .
10343453 Astra 12/04/2017 main  .
10343453 Astra 13/04/2017 main  .
run;


data want (drop=_:);
  array rets {30} _temporary_;
  call missing(of rets{*});
  do _i=1 by 1 until (last.cusip_id);
    set have;
    by cusip_id;
    if ret^=. then do;
      _mean_ret=mean(lag(ret),ret);
      if _i> 1 then do _j=_i-1 to 1 by -1 while(rets{_j}=.);
	    rets{_j}=_mean_ret;
	  end;
    end;
  end;

  do _i=1 by 1 until (last.cusip_id);
    set have;
	by cusip_id;
    if missing(ret) then ret=rets{_i};
    output;
  end;
run;

A note:

The mean(lag(ret),ret) only retrieves the lagged value when the current ret is not missing. This means that the lag(ret) will also always be non-missing (except at the beginning). That's because the LAG function is actually a fifo queue, which is only updated when ret is not missing.

Also, make the array size at least as large as the most populated CUSIP_ID. I used 30 above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

georgel · Posted 01-03-2021 01:11 PM

Impressive! Thanks a million Keintz

mkeintz · Posted 01-03-2021 02:10 PM

Question: why are your data not ordered chronologically? For instance the 3rd-5th rows are as below (note dates are dd/mm/yyyy):

10343452 Zilas 03/04/2012 main   6.6
10343452 Zilas 04/04/2011 main   .
10343452 Zilas 05/04/2017 main  7.7

So you are estimating data for 4/4/2011 (04apr2011) from 03apr2012 and 05april2017. Really? 2011 from 2012 and 2017?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Programming

Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

Re: Replacing Missing values with average

SAS Visual Analytics: Sum Missing Values

how to replace missing value by last value

Conditionally remove missing values

Proc expand missing values in moving average

Impute missing value with mean of other available variable values if o...

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...