Need Logic in Imputing records using Average of Prior and After records?

Accepted Solution Solved
Reply
Contributor
Posts: 59
Accepted Solution

Need Logic in Imputing records using Average of Prior and After records?

Dear Folks,

I need a SAS datastep logic in imputing records using average of prior and after records.

I have a large dataset totalling 20 million records for different ids the following columns-

         ID                       DATE (mmddyyyy)                      VALUE

        S12                     07/21/2011                                    -30

        S12                     07/23/2011                                    -50

        S13                     07/15/2011                                    -40

        S12                     07/16/2011                                     60 is positive

        S12                     07/17/2011                                    -100

       -so on-                 -so on-        

DESIRED OUTPUT DATASET- 

ID                       DATE (mmddyyyy)                      VALUE 

        S12                     07/21/2011                                    -30

        S12                     07/23/2011                                    -50

        S13                     07/15/2011                                    -40

        S12                     07/16/2011                                    -70 is negative obtained using average of -40 above and -100 below-/*correct wanted output*/

        S12                     07/17/2011                                    -100

       -so on-                 -so on-        

The dataset is supposed to have only negative values, and I want to identify where the records have positive values and impute by averaging the previous negative record value and following(after) negative value for all ids. Thanks

Have a Nice day,

Andy


Accepted Solutions
Solution
‎10-26-2012 08:29 AM
Regular Contributor
Posts: 151

Re: Need Logic in Imputing records using Average of Prior and After records?

Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances).  It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

cards;

S12 07/21/2011 -20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

run;

/* Identify row numbers with positive values */

data invalid (keep=recno);

set have (keep=value);

if value>=0 then do;

  recno=_n_;

  output;

end;

run;

/* Directly access source data to calculate mean of values either side of positives */

data mean_value;

set invalid;

do i=1 to nobs until (value1<0 or recno_prev=1);

  recno_prev=max(1,recno-i);

  set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;

end;

value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */

do j=1 to nobs until (value2<0 or recno_next=nobs);

  recno_next=min(nobs,recno+j);

  set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;

end;

value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */

mean_value=mean(value1,value2);

run;

/* Replace missing values with calculated means */

/* Direct access method used */

data have;

set mean_value;

modify have point=recno;

value=mean_value;

run;

View solution in original post


All Replies
Regular Contributor
Posts: 151

Re: Need Logic in Imputing records using Average of Prior and After records?

Just a quick question, is ID relevant in your question?  You mention it a couple of times, yet your example seems to imply that you want to take the previous and next values irrespective of whether they are the same ID or not.  Also your sample data does not follow a logical sort order, is this intended or should it be sorted by ID and/or date?

Super Contributor
Posts: 644

Re: Need Logic in Imputing records using Average of Prior and After records?

Please clarify:

Is there a typo in your data?  Shouldn't the last 2 records have ID = S13 ?

What to do if the first or last records of a group have invalid values? Or 2 invalid values in sequence?

Is 0 a valid value?

Do you want the imputed value to be the mean of all the valid values for a group, or just the adjacent previous and next valid values?

Richard in Oz

Contributor
Posts: 59

Re: Need Logic in Imputing records using Average of Prior and After records?

Posted in reply to RichardinOz

Hi Richard,

Oops jeez!, I am extremely sorry about the typo. Yes the id in the last 3 records is S13. Well, anywhere a record has positive values anything greater than zero, it should be considered invalid. Thats when we need to replace with taking the average of the previous negative value and the following negative value. For example, if 15 follows -10 5 times(5 subsequent records), I'd need to take the average of -10 and the next negative record and pluck them in all those records where there is 15.

To clarify keith, values are corresponding values of ids. I guess they are only relevant if you wanna sort them by id and date.

Super Contributor
Posts: 644

Re: Need Logic in Imputing records using Average of Prior and After records?

OK

What do you want to happen for the following hypothetical group

S99     1/1/2011     20

S99     1/2/2011     -20

S99     1/3/2011     -30

S99     1/4/2011     -40

S99     1/5/2011     50

Richard in Oz

Super Contributor
Posts: 474

Re: Need Logic in Imputing records using Average of Prior and After records?

And also, is this possible?

S12     7/21/2011     -10

S12     7/21/2011     10

S12     7/21/2011     20

S12     7/21/2011     -30


If yes what is to be expected?


Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Contributor
Posts: 59

Re: Need Logic in Imputing records using Average of Prior and After records?

Posted in reply to DanielSantos


Hi Santos, Yes certainly possible but there is only one value for one date meaning there are no repeats of dates. The desired result would be -20 being the average of -10 and -30 replacing the 10 and 20 in the dataset.

So the desired dataset from your input dataset assuming different continuous dates should be like:

s12        7/21/2011        -10

s12        7/22/2011         -20

s12        7/23/2011         -20

s12        7/24/2011         -30

I hope that makes some senseSmiley Happy. Thanks

Solution
‎10-26-2012 08:29 AM
Regular Contributor
Posts: 151

Re: Need Logic in Imputing records using Average of Prior and After records?

Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances).  It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

cards;

S12 07/21/2011 -20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

run;

/* Identify row numbers with positive values */

data invalid (keep=recno);

set have (keep=value);

if value>=0 then do;

  recno=_n_;

  output;

end;

run;

/* Directly access source data to calculate mean of values either side of positives */

data mean_value;

set invalid;

do i=1 to nobs until (value1<0 or recno_prev=1);

  recno_prev=max(1,recno-i);

  set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;

end;

value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */

do j=1 to nobs until (value2<0 or recno_next=nobs);

  recno_next=min(nobs,recno+j);

  set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;

end;

value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */

mean_value=mean(value1,value2);

run;

/* Replace missing values with calculated means */

/* Direct access method used */

data have;

set mean_value;

modify have point=recno;

value=mean_value;

run;

Contributor
Posts: 59

Re: Need Logic in Imputing records using Average of Prior and After records?

Hi Keith,

Thanks very much indeed. Worked Perfectly well.

Cheers,

Andy

Super Contributor
Posts: 644

Re: Need Logic in Imputing records using Average of Prior and After records?

I have to go now but here is my solution

There is probably a statistical proc that will do this for you, provided that you set all invalid values to missing

(If Value >= 0 then call missing (value)).

However, if you do not want to or cannot go down that track, here is a solution, assuming

•    Consecutive non valid responses can be encountered

•    Optionally, if first or last response for a group is invalid then value will be set to missing

It's a bit long winded but you should be able to see what is happening.

I'd suggest a solution transposing the data but I have no idea how many reponses per id group you could have and this might limit the number of rows transposed.

Delete the commented line if first nonvalid should be set to the next valid response encountered, and last nonvalid should be set to the last previous valid response (-20 and -40 respectively in my hypothetical group)

Richard in Oz

Data nonvalid ;

    Set    have ;

    By    Id    Date ;

    Where    Value > 0 ;

    Call missing (value) ;

Run ;

Data    Prevs

        Posts

        ;

    Merge    nonvalid    (in = non

                        rename    =    (date = datex)

                        )

            have        (where    = (Value <= 0))

            ;

        By    Id ;

        If    non ;

        If     date    <    datex

          Then

              Output Prevs ;

          Else

              Output Posts ;

    Drop    Date ;

Run ;

Data    Prevs ;

    Set    Prevs ;

    By    Id

        Datex

        ;

    If    Last.Id ;

    Rename    Datex    =    Date

            Value    =    Prev

            ;

    ;

Run ;

Data    Posts ;

    Set    Posts ;

    By    Id

        Datex

        ;

    If    First.Id ;

    Rename    Datex    =    Date

            Value    =    Post

            ;

    ;

Run ;

Data    Means ;

    Merge    Prevs

            Posts

            ;

    By        Id

            Date

            ;

    Length     Value    8 ;

    *    Assuming first or last value should not be interpolated if invalid ;

    If         Prev    =    .

        Or    Post    =    .

      Then    Call Missing (value) ;

      Else    Value    =    Mean (Prev, Post) ;

    Drop    Prev

            Post

            ;

Run ;

   

Data want ;

    Set    Have     (Where = (Value <= 0))

        Means

        ;

    By    Id

        Date

        ;

Run ;

Respected Advisor
Posts: 3,156

Re: Need Logic in Imputing records using Average of Prior and After records?

The key is to get the values of upper and lower "wrappers". To do that, the simplest solution I could think of is to introduce a 'sign' variable. After having that, everything is mechanical, 2XDOW will do:

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

sign=ifn(value<0,1,0);

cards;

S12 07/21/2011 -20

S12 07/21/2011  20

S12 07/21/2011  20

S12 07/21/2011  20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

data want;

  do until (last.sign);

     set have;

       by id sign notsorted;

      set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);

      if last.id then call missing(_v);

      _up=ifn(first.sign,lag(value),.);

      _up=ifn(first.id,.,_up);

      _low=ifn(last.sign,_v,.);

end;

  do until (last.sign);

     set have;

       by id sign notsorted;

      set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);

      value=ifn(value<0,value, mean(_up,_low));

      output;

end;

drop _: sign;

run;

proc print;run;

Haikuo

Super Contributor
Posts: 474

Re: Need Logic in Imputing records using Average of Prior and After records?

Question as been already answered (nice and simple btw), still I'll post here my own suggestion which is aimed for performance as it was said that the original dataset was 20 millions large. To be fast I would avoid any approach which is IO intensive (DOW, index, or anything that involves re-merging).

Knowing that the logic of the problem is essentially a mix of looking ahead and memorizing a set of records, this can be achieved with hashing. So, it's just a matter of reading a record ahead and storing the sequence of record in which first and last are negative. Then output the whole sequence averaging VALUE for the "middle" records.

data want;

set have end=_EOF;

    retain _N 1 _VALUE 0;

    drop _:;

    if _N_ eq 1 then do;

       declare Hash HT ();

       HT.defineKey('_N');

       HT.defineData('ID','DATE','VALUE'); * dataset variables to keep;

       HT.defineDone();

    end; * init hash;

    _RC=HT.replace(); * store into hash;

    if (_N gt 1 and VALUE le 0) or _EOF then do; * look ahead negative or end of file;

       _VALUE=VALUE; * store last negative;

       if HT.find(key: 1) then stop; * retrieve first negative;   

       _VALUE=mean(ifn(VALUE>0,.,VALUE),ifn(_VALUE>0,.,_VALUE));

       if VALUE gt 0 then VALUE=_VALUE;

       output; * output first in sequence;

       do _I=2 to _N-1; * output looks ahead;      

          if HT.find(key: _I) then stop; if HT.remove(key: _I) then stop;

          if VALUE gt 0 then VALUE=_VALUE; * if positive set negative average;

          output; * output sequence;

       end;

       if not HT.find(key: 1) then _VALUE=VALUE; else stop; * retrieve first negative;

       if HT.find()then stop; if HT.remove() then stop; * retrieve last negative;

       _N=1;      

       if _EOF then do; * end of file?;

          if VALUE gt 0 then VALUE=_VALUE; * last record is positive?;

          output; * output last in sequence;

       end;       

       if HT.replace() then stop; * store as first negative for next look ahead;

    end;

    _N+1; * count looks ahead;

     

run;

Positive value in the first or last record will be set the next or last negative value in sequence.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Contributor
Posts: 59

Re: Need Logic in Imputing records using Average of Prior and After records?

Posted in reply to DanielSantos

Hi Dearest list of Gentleman-- Danielsantos, Hai kuo, Keith, RichardinOZ.

I can't thank you enough for the amazingly overwhelming response. Splendid and Great approach. Simply Fantastic. Super awesome

Cheers,

Andy

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 459 views
  • 5 likes
  • 5 in conversation