Dear Folks,
I need a SAS datastep logic in imputing records using average of prior and after records.
I have a large dataset totalling 20 million records for different ids the following columns-
ID DATE (mmddyyyy) VALUE
S12 07/21/2011 -30
S12 07/23/2011 -50
S13 07/15/2011 -40
S12 07/16/2011 60 is positive
S12 07/17/2011 -100
-so on- -so on-
DESIRED OUTPUT DATASET-
ID DATE (mmddyyyy) VALUE
S12 07/21/2011 -30
S12 07/23/2011 -50
S13 07/15/2011 -40
S12 07/16/2011 -70 is negative obtained using average of -40 above and -100 below-/*correct wanted output*/
S12 07/17/2011 -100
-so on- -so on-
The dataset is supposed to have only negative values, and I want to identify where the records have positive values and impute by averaging the previous negative record value and following(after) negative value for all ids. Thanks
Have a Nice day,
Andy
Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances). It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.
data have;
input id $ date :mmddyy10. value;
format date mmddyy10.;
cards;
S12 07/21/2011 -20
S12 07/23/2011 -50
S13 07/15/2011 -40
S13 07/16/2011 60
S13 07/17/2011 -100
S14 07/24/2011 10
S14 07/24/2011 20
S14 07/24/2011 -30
S14 07/24/2011 -20
;
run;
/* Identify row numbers with positive values */
data invalid (keep=recno);
set have (keep=value);
if value>=0 then do;
recno=_n_;
output;
end;
run;
/* Directly access source data to calculate mean of values either side of positives */
data mean_value;
set invalid;
do i=1 to nobs until (value1<0 or recno_prev=1);
recno_prev=max(1,recno-i);
set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;
end;
value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */
do j=1 to nobs until (value2<0 or recno_next=nobs);
recno_next=min(nobs,recno+j);
set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;
end;
value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */
mean_value=mean(value1,value2);
run;
/* Replace missing values with calculated means */
/* Direct access method used */
data have;
set mean_value;
modify have point=recno;
value=mean_value;
run;
Just a quick question, is ID relevant in your question? You mention it a couple of times, yet your example seems to imply that you want to take the previous and next values irrespective of whether they are the same ID or not. Also your sample data does not follow a logical sort order, is this intended or should it be sorted by ID and/or date?
Please clarify:
Is there a typo in your data? Shouldn't the last 2 records have ID = S13 ?
What to do if the first or last records of a group have invalid values? Or 2 invalid values in sequence?
Is 0 a valid value?
Do you want the imputed value to be the mean of all the valid values for a group, or just the adjacent previous and next valid values?
Richard in Oz
Hi Richard,
Oops jeez!, I am extremely sorry about the typo. Yes the id in the last 3 records is S13. Well, anywhere a record has positive values anything greater than zero, it should be considered invalid. Thats when we need to replace with taking the average of the previous negative value and the following negative value. For example, if 15 follows -10 5 times(5 subsequent records), I'd need to take the average of -10 and the next negative record and pluck them in all those records where there is 15.
To clarify keith, values are corresponding values of ids. I guess they are only relevant if you wanna sort them by id and date.
OK
What do you want to happen for the following hypothetical group
S99 1/1/2011 20
S99 1/2/2011 -20
S99 1/3/2011 -30
S99 1/4/2011 -40
S99 1/5/2011 50
Richard in Oz
And also, is this possible?
S12 7/21/2011 -10
S12 7/21/2011 10
S12 7/21/2011 20
S12 7/21/2011 -30
If yes what is to be expected?
Cheers from Portugal.
Daniel Santos @ www.cgd.pt
Hi Santos, Yes certainly possible but there is only one value for one date meaning there are no repeats of dates. The desired result would be -20 being the average of -10 and -30 replacing the 10 and 20 in the dataset.
So the desired dataset from your input dataset assuming different continuous dates should be like:
s12 7/21/2011 -10
s12 7/22/2011 -20
s12 7/23/2011 -20
s12 7/24/2011 -30
I hope that makes some sense:). Thanks
Here's a way of dealing with consecutive positive values, as well as positive values in the first or last record of the dataset (it will use the first or last negative value in these instances). It requires a full pass of the data to identify the row numbers with positive values, then uses the direct access method (POINT=) to calculate the mean of negative values either side, then finally modifies the original dataset (again using POINT=) to replace the incorrect values. There may be a more efficient method, but try it if you wish.
data have;
input id $ date :mmddyy10. value;
format date mmddyy10.;
cards;
S12 07/21/2011 -20
S12 07/23/2011 -50
S13 07/15/2011 -40
S13 07/16/2011 60
S13 07/17/2011 -100
S14 07/24/2011 10
S14 07/24/2011 20
S14 07/24/2011 -30
S14 07/24/2011 -20
;
run;
/* Identify row numbers with positive values */
data invalid (keep=recno);
set have (keep=value);
if value>=0 then do;
recno=_n_;
output;
end;
run;
/* Directly access source data to calculate mean of values either side of positives */
data mean_value;
set invalid;
do i=1 to nobs until (value1<0 or recno_prev=1);
recno_prev=max(1,recno-i);
set have (keep=value rename=(value=value1)) point=recno_prev nobs=nobs;
end;
value1=ifn(value1>=0,.,value1); /* if first value in dataset is positive, set it to missing */
do j=1 to nobs until (value2<0 or recno_next=nobs);
recno_next=min(nobs,recno+j);
set have (keep=value rename=(value=value2)) point=recno_next nobs=nobs;
end;
value2=ifn(value2>=0,.,value2); /* if last value in dataset is positive, set it to missing */
mean_value=mean(value1,value2);
run;
/* Replace missing values with calculated means */
/* Direct access method used */
data have;
set mean_value;
modify have point=recno;
value=mean_value;
run;
Hi Keith,
Thanks very much indeed. Worked Perfectly well.
Cheers,
Andy
I have to go now but here is my solution
There is probably a statistical proc that will do this for you, provided that you set all invalid values to missing
(If Value >= 0 then call missing (value)).
However, if you do not want to or cannot go down that track, here is a solution, assuming
• Consecutive non valid responses can be encountered
• Optionally, if first or last response for a group is invalid then value will be set to missing
It's a bit long winded but you should be able to see what is happening.
I'd suggest a solution transposing the data but I have no idea how many reponses per id group you could have and this might limit the number of rows transposed.
Delete the commented line if first nonvalid should be set to the next valid response encountered, and last nonvalid should be set to the last previous valid response (-20 and -40 respectively in my hypothetical group)
Richard in Oz
Data nonvalid ;
Set have ;
By Id Date ;
Where Value > 0 ;
Call missing (value) ;
Run ;
Data Prevs
Posts
;
Merge nonvalid (in = non
rename = (date = datex)
)
have (where = (Value <= 0))
;
By Id ;
If non ;
If date < datex
Then
Output Prevs ;
Else
Output Posts ;
Drop Date ;
Run ;
Data Prevs ;
Set Prevs ;
By Id
Datex
;
If Last.Id ;
Rename Datex = Date
Value = Prev
;
;
Run ;
Data Posts ;
Set Posts ;
By Id
Datex
;
If First.Id ;
Rename Datex = Date
Value = Post
;
;
Run ;
Data Means ;
Merge Prevs
Posts
;
By Id
Date
;
Length Value 8 ;
* Assuming first or last value should not be interpolated if invalid ;
If Prev = .
Or Post = .
Then Call Missing (value) ;
Else Value = Mean (Prev, Post) ;
Drop Prev
Post
;
Run ;
Data want ;
Set Have (Where = (Value <= 0))
Means
;
By Id
Date
;
Run ;
The key is to get the values of upper and lower "wrappers". To do that, the simplest solution I could think of is to introduce a 'sign' variable. After having that, everything is mechanical, 2XDOW will do:
data have;
input id $ date :mmddyy10. value;
format date mmddyy10.;
sign=ifn(value<0,1,0);
cards;
S12 07/21/2011 -20
S12 07/21/2011 20
S12 07/21/2011 20
S12 07/21/2011 20
S12 07/23/2011 -50
S13 07/15/2011 -40
S13 07/16/2011 60
S13 07/17/2011 -100
S14 07/24/2011 10
S14 07/24/2011 20
S14 07/24/2011 -30
S14 07/24/2011 -20
;
data want;
do until (last.sign);
set have;
by id sign notsorted;
set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);
if last.id then call missing(_v);
_up=ifn(first.sign,lag(value),.);
_up=ifn(first.id,.,_up);
_low=ifn(last.sign,_v,.);
end;
do until (last.sign);
set have;
by id sign notsorted;
set have (firstobs=2 keep=value rename=value=_v) have (obs=1 drop=_all_);
value=ifn(value<0,value, mean(_up,_low));
output;
end;
drop _: sign;
run;
proc print;run;
Haikuo
Question as been already answered (nice and simple btw), still I'll post here my own suggestion which is aimed for performance as it was said that the original dataset was 20 millions large. To be fast I would avoid any approach which is IO intensive (DOW, index, or anything that involves re-merging).
Knowing that the logic of the problem is essentially a mix of looking ahead and memorizing a set of records, this can be achieved with hashing. So, it's just a matter of reading a record ahead and storing the sequence of record in which first and last are negative. Then output the whole sequence averaging VALUE for the "middle" records.
data want;
set have end=_EOF;
retain _N 1 _VALUE 0;
drop _:;
if _N_ eq 1 then do;
declare Hash HT ();
HT.defineKey('_N');
HT.defineData('ID','DATE','VALUE'); * dataset variables to keep;
HT.defineDone();
end; * init hash;
_RC=HT.replace(); * store into hash;
if (_N gt 1 and VALUE le 0) or _EOF then do; * look ahead negative or end of file;
_VALUE=VALUE; * store last negative;
if HT.find(key: 1) then stop; * retrieve first negative;
_VALUE=mean(ifn(VALUE>0,.,VALUE),ifn(_VALUE>0,.,_VALUE));
if VALUE gt 0 then VALUE=_VALUE;
output; * output first in sequence;
do _I=2 to _N-1; * output looks ahead;
if HT.find(key: _I) then stop; if HT.remove(key: _I) then stop;
if VALUE gt 0 then VALUE=_VALUE; * if positive set negative average;
output; * output sequence;
end;
if not HT.find(key: 1) then _VALUE=VALUE; else stop; * retrieve first negative;
if HT.find()then stop; if HT.remove() then stop; * retrieve last negative;
_N=1;
if _EOF then do; * end of file?;
if VALUE gt 0 then VALUE=_VALUE; * last record is positive?;
output; * output last in sequence;
end;
if HT.replace() then stop; * store as first negative for next look ahead;
end;
_N+1; * count looks ahead;
run;
Positive value in the first or last record will be set the next or last negative value in sequence.
Cheers from Portugal.
Daniel Santos @ www.cgd.pt
Hi Dearest list of Gentleman-- Danielsantos, Hai kuo, Keith, RichardinOZ.
I can't thank you enough for the amazingly overwhelming response. Splendid and Great approach. Simply Fantastic. Super awesome
Cheers,
Andy
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.