BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Andygray
Quartz | Level 8

Dear Folks,

I need a SAS datastep logic in imputing records using average of prior and after records.

I have a large dataset totalling 20 million records for different ids the following columns-

         ID                       DATE (mmddyyyy)                      VALUE

        S12                     07/21/2011                                    -30

        S12                     07/23/2011                                    -50

        S13                     07/15/2011                                    -40

        S12                     07/16/2011                                     60 is positive

        S12                     07/17/2011                                    -100

       -so on-                 -so on-        

DESIRED OUTPUT DATASET- 

ID                       DATE (mmddyyyy)                      VALUE 

        S12                     07/21/2011                                    -30

        S12                     07/23/2011                                    -50

        S13                     07/15/2011                                    -40

        S12                     07/16/2011                                    -70 is negative obtained using average of -40 above and -100 below-/*correct wanted output*/

        S12                     07/17/2011                                    -100

       -so on-                 -so on-        

The dataset is supposed to have only negative values, and I want to identify where the records have positive values and impute by averaging the previous negative record value and following(after) negative value for all ids. Thanks

Have a Nice day,

Andy

1 ACCEPTED SOLUTION

Accepted Solutions
Keith
Obsidian | Level 7

Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances).  It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

cards;

S12 07/21/2011 -20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

run;

/* Identify row numbers with positive values */

data invalid (keep=recno);

set have (keep=value);

if value>=0 then do;

  recno=_n_;

  output;

end;

run;

/* Directly access source data to calculate mean of values either side of positives */

data mean_value;

set invalid;

do i=1 to nobs until (value1<0 or recno_prev=1);

  recno_prev=max(1,recno-i);

  set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;

end;

value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */

do j=1 to nobs until (value2<0 or recno_next=nobs);

  recno_next=min(nobs,recno+j);

  set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;

end;

value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */

mean_value=mean(value1,value2);

run;

/* Replace missing values with calculated means */

/* Direct access method used */

data have;

set mean_value;

modify have point=recno;

value=mean_value;

run;

View solution in original post

12 REPLIES 12
Keith
Obsidian | Level 7

Just a quick question, is ID relevant in your question?  You mention it a couple of times, yet your example seems to imply that you want to take the previous and next values irrespective of whether they are the same ID or not.  Also your sample data does not follow a logical sort order, is this intended or should it be sorted by ID and/or date?

RichardinOz
Quartz | Level 8

Please clarify:

Is there a typo in your data?  Shouldn't the last 2 records have ID = S13 ?

What to do if the first or last records of a group have invalid values? Or 2 invalid values in sequence?

Is 0 a valid value?

Do you want the imputed value to be the mean of all the valid values for a group, or just the adjacent previous and next valid values?

Richard in Oz

Andygray
Quartz | Level 8

Hi Richard,

Oops jeez!, I am extremely sorry about the typo. Yes the id in the last 3 records is S13. Well, anywhere a record has positive values anything greater than zero, it should be considered invalid. Thats when we need to replace with taking the average of the previous negative value and the following negative value. For example, if 15 follows -10 5 times(5 subsequent records), I'd need to take the average of -10 and the next negative record and pluck them in all those records where there is 15.

To clarify keith, values are corresponding values of ids. I guess they are only relevant if you wanna sort them by id and date.

RichardinOz
Quartz | Level 8

OK

What do you want to happen for the following hypothetical group

S99     1/1/2011     20

S99     1/2/2011     -20

S99     1/3/2011     -30

S99     1/4/2011     -40

S99     1/5/2011     50

Richard in Oz

DanielSantos
Barite | Level 11

And also, is this possible?

S12     7/21/2011     -10

S12     7/21/2011     10

S12     7/21/2011     20

S12     7/21/2011     -30


If yes what is to be expected?


Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Andygray
Quartz | Level 8


Hi Santos, Yes certainly possible but there is only one value for one date meaning there are no repeats of dates. The desired result would be -20 being the average of -10 and -30 replacing the 10 and 20 in the dataset.

So the desired dataset from your input dataset assuming different continuous dates should be like:

s12        7/21/2011        -10

s12        7/22/2011         -20

s12        7/23/2011         -20

s12        7/24/2011         -30

I hope that makes some sense:). Thanks

Keith
Obsidian | Level 7

Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances).  It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

cards;

S12 07/21/2011 -20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

run;

/* Identify row numbers with positive values */

data invalid (keep=recno);

set have (keep=value);

if value>=0 then do;

  recno=_n_;

  output;

end;

run;

/* Directly access source data to calculate mean of values either side of positives */

data mean_value;

set invalid;

do i=1 to nobs until (value1<0 or recno_prev=1);

  recno_prev=max(1,recno-i);

  set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;

end;

value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */

do j=1 to nobs until (value2<0 or recno_next=nobs);

  recno_next=min(nobs,recno+j);

  set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;

end;

value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */

mean_value=mean(value1,value2);

run;

/* Replace missing values with calculated means */

/* Direct access method used */

data have;

set mean_value;

modify have point=recno;

value=mean_value;

run;

Andygray
Quartz | Level 8

Hi Keith,

Thanks very much indeed. Worked Perfectly well.

Cheers,

Andy

RichardinOz
Quartz | Level 8

I have to go now but here is my solution

There is probably a statistical proc that will do this for you, provided that you set all invalid values to missing

(If Value >= 0 then call missing (value)).

However, if you do not want to or cannot go down that track, here is a solution, assuming

•    Consecutive non valid responses can be encountered

•    Optionally, if first or last response for a group is invalid then value will be set to missing

It's a bit long winded but you should be able to see what is happening.

I'd suggest a solution transposing the data but I have no idea how many reponses per id group you could have and this might limit the number of rows transposed.

Delete the commented line if first nonvalid should be set to the next valid response encountered, and last nonvalid should be set to the last previous valid response (-20 and -40 respectively in my hypothetical group)

Richard in Oz

Data nonvalid ;

    Set    have ;

    By    Id    Date ;

    Where    Value > 0 ;

    Call missing (value) ;

Run ;

Data    Prevs

        Posts

        ;

    Merge    nonvalid    (in = non

                        rename    =    (date = datex)

                        )

            have        (where    = (Value <= 0))

            ;

        By    Id ;

        If    non ;

        If     date    <    datex

          Then

              Output Prevs ;

          Else

              Output Posts ;

    Drop    Date ;

Run ;

Data    Prevs ;

    Set    Prevs ;

    By    Id

        Datex

        ;

    If    Last.Id ;

    Rename    Datex    =    Date

            Value    =    Prev

            ;

    ;

Run ;

Data    Posts ;

    Set    Posts ;

    By    Id

        Datex

        ;

    If    First.Id ;

    Rename    Datex    =    Date

            Value    =    Post

            ;

    ;

Run ;

Data    Means ;

    Merge    Prevs

            Posts

            ;

    By        Id

            Date

            ;

    Length     Value    8 ;

    *    Assuming first or last value should not be interpolated if invalid ;

    If         Prev    =    .

        Or    Post    =    .

      Then    Call Missing (value) ;

      Else    Value    =    Mean (Prev, Post) ;

    Drop    Prev

            Post

            ;

Run ;

   

Data want ;

    Set    Have     (Where = (Value <= 0))

        Means

        ;

    By    Id

        Date

        ;

Run ;

Haikuo
Onyx | Level 15

The key is to get the values of upper and lower "wrappers". To do that, the simplest solution I could think of is to introduce a 'sign' variable. After having that, everything is mechanical, 2XDOW will do:

data have;

input id $ date :mmddyy10. value;

format date mmddyy10.;

sign=ifn(value<0,1,0);

cards;

S12 07/21/2011 -20

S12 07/21/2011  20

S12 07/21/2011  20

S12 07/21/2011  20

S12 07/23/2011 -50

S13 07/15/2011 -40

S13 07/16/2011  60

S13 07/17/2011 -100

S14 07/24/2011  10

S14 07/24/2011  20

S14 07/24/2011 -30

S14 07/24/2011 -20

;

data want;

  do until (last.sign);

     set have;

       by id sign notsorted;

      set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);

      if last.id then call missing(_v);

      _up=ifn(first.sign,lag(value),.);

      _up=ifn(first.id,.,_up);

      _low=ifn(last.sign,_v,.);

end;

  do until (last.sign);

     set have;

       by id sign notsorted;

      set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);

      value=ifn(value<0,value, mean(_up,_low));

      output;

end;

drop _: sign;

run;

proc print;run;

Haikuo

DanielSantos
Barite | Level 11

Question as been already answered (nice and simple btw), still I'll post here my own suggestion which is aimed for performance as it was said that the original dataset was 20 millions large. To be fast I would avoid any approach which is IO intensive (DOW, index, or anything that involves re-merging).

Knowing that the logic of the problem is essentially a mix of looking ahead and memorizing a set of records, this can be achieved with hashing. So, it's just a matter of reading a record ahead and storing the sequence of record in which first and last are negative. Then output the whole sequence averaging VALUE for the "middle" records.

data want;

set have end=_EOF;

    retain _N 1 _VALUE 0;

    drop _:;

    if _N_ eq 1 then do;

       declare Hash HT ();

       HT.defineKey('_N');

       HT.defineData('ID','DATE','VALUE'); * dataset variables to keep;

       HT.defineDone();

    end; * init hash;

    _RC=HT.replace(); * store into hash;

    if (_N gt 1 and VALUE le 0) or _EOF then do; * look ahead negative or end of file;

       _VALUE=VALUE; * store last negative;

       if HT.find(key: 1) then stop; * retrieve first negative;   

       _VALUE=mean(ifn(VALUE>0,.,VALUE),ifn(_VALUE>0,.,_VALUE));

       if VALUE gt 0 then VALUE=_VALUE;

       output; * output first in sequence;

       do _I=2 to _N-1; * output looks ahead;      

          if HT.find(key: _I) then stop; if HT.remove(key: _I) then stop;

          if VALUE gt 0 then VALUE=_VALUE; * if positive set negative average;

          output; * output sequence;

       end;

       if not HT.find(key: 1) then _VALUE=VALUE; else stop; * retrieve first negative;

       if HT.find()then stop; if HT.remove() then stop; * retrieve last negative;

       _N=1;      

       if _EOF then do; * end of file?;

          if VALUE gt 0 then VALUE=_VALUE; * last record is positive?;

          output; * output last in sequence;

       end;       

       if HT.replace() then stop; * store as first negative for next look ahead;

    end;

    _N+1; * count looks ahead;

     

run;

Positive value in the first or last record will be set the next or last negative value in sequence.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Andygray
Quartz | Level 8

Hi Dearest list of Gentleman-- Danielsantos, Hai kuo, Keith, RichardinOZ.

I can't thank you enough for the amazingly overwhelming response. Splendid and Great approach. Simply Fantastic. Super awesome

Cheers,

Andy

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 1750 views
  • 5 likes
  • 5 in conversation