Hi,
Assume this is my data. I would like to delete rows with missing duration and add the score for that observation to the previous one. Anything else stays the same for the previous observation.
Up to now, I was able only to delete the row if duration is missing. I have tried using array without success. Any help is appreciated.
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 9 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 6 20
Alic 3 3 12
Alic 4 . 17
;
run;
data first;
set kids;
if duration=. then nmiss=1; else nmiss=0;
run;
data second;
set first;
if nmiss=1 then delete;
run;
The final desired data would be
Beth 1 12 30
Beth 2 7 10
Beth 3 9 100
Barb 1 3 20
Barb 2 5 90
Alic 1 6 50
Alic 2 6 20
Alic 3 3 29
Thanks.
As stated, if this is the only possible scenario in your real life data, a simple 'look ahead' will be sufficient. However, look at the following modified raw data:
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 . 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 . 20
Alic 3 3 12
Alic 4 . 17
;
So Beth has 2 missing in a row and Alic has 2 missing apart. A possible Proc SQL solution may take care of Beth (let me know if you are interested in the code) but fail on Alic, while 'Look Ahead' is able to tackle Alic but fail on Beth. Array, which OP happened to mention, becomes a strong candidate to reconcile both scenario at the same time, in 1X pass data step:
data want;
do until (last.kidname);
set kids;
by kidname notsorted;
array in(3) _temporary_;
array out(3) _temporary_;
array t(3) time--score;
if missing(duration) then in(3)+score;
if not missing (duration) or last.kidname then
do;
do i=1 to 3;
out(i)=in(i);
in(i)=t(i);
t(i)=out(i);
end;
if not missing(duration) then
output;
end;
end;
call missing (of in(*), of out(*));
drop i;
run;
Haikuo
Sounds like a look-ahead problem to me.
Here is one approach.
218 options mergenoby=nowarn; 219 data want; 220 merge kids 221 kids (firstobs=2 222 keep=duration score kidname 223 rename=(duration=nextduration score=nextscore kidname=nextkidname) 224 ) 225 ; 226 if missing(nextduration) then do; 227 score=sum(score,nextscore); 228 if kidname ne nextkidname and not missing(nextkidname) then put "ER" "ROR: crossed name 228! boundary cuz missing first duration " kidname=; 229 end; 230 if missing(duration) then delete; 231 232 drop next:; 233 put KidName Time Duration Score; 234 run; Beth 1 12 30 Beth 2 7 10 Beth 3 9 100 Barb 1 3 20 Barb 2 5 90 Alic 1 6 50 Alic 2 6 20 Alic 3 3 29 NOTE: There were 11 observations read from the data set WORK.KIDS. NOTE: There were 10 observations read from the data set WORK.KIDS. NOTE: The data set WORK.WANT has 8 observations and 4 variables. 235 options mergenoby=error;
Investigate an approach that uses PROC SUMMARY to aggregate important variables where you have BY/CLASS variable list (kidname and maybe time?), executed after a DATA step that does the interrogation and sets appropriate SAS variables to a missing value; also, use LAG(...) to keep track of most-recent 'keeper' row, if consider you might have two consecutive (or more) obs with a missing value for DURATION.
Scott Barry
SBBWorks, Inc.
A solution is the following:
data want(keep=kidname time2 dur2 sc2 rename=(time2=time dur2=duration sc2=score));
set kids;
retain ix2 time2 dur2 sc2;
by kidname notsorted;
if first.kidname then do;
ix2=0;
time2=0;
dur2=0;
sc2=0;
end;
ix2+1;
if not(last.kidname) then do;
if (ix2>1) then output;
time2=time;
dur2= duration;
sc2=score;
end;
if last.kidname then do;
sc2+score;
output;
end;
run;
As stated, if this is the only possible scenario in your real life data, a simple 'look ahead' will be sufficient. However, look at the following modified raw data:
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 . 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 . 20
Alic 3 3 12
Alic 4 . 17
;
So Beth has 2 missing in a row and Alic has 2 missing apart. A possible Proc SQL solution may take care of Beth (let me know if you are interested in the code) but fail on Alic, while 'Look Ahead' is able to tackle Alic but fail on Beth. Array, which OP happened to mention, becomes a strong candidate to reconcile both scenario at the same time, in 1X pass data step:
data want;
do until (last.kidname);
set kids;
by kidname notsorted;
array in(3) _temporary_;
array out(3) _temporary_;
array t(3) time--score;
if missing(duration) then in(3)+score;
if not missing (duration) or last.kidname then
do;
do i=1 to 3;
out(i)=in(i);
in(i)=t(i);
t(i)=out(i);
end;
if not missing(duration) then
output;
end;
end;
call missing (of in(*), of out(*));
drop i;
run;
Haikuo
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 9 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 6 20
Alic 3 3 12
Alic 4 . 17
;
run;
data want(drop=score1 duration1 time1);
set kids;
by kidname notsorted;
retain score1 duration1 time1;
if not last.kidname then do;
score1=score;
duration1=duration;
time1=time;
end;
if last.kidname and duration=. then do;
score=sum(score,score1);
duration=duration1;
time=time1;
end;
run;
data final;
set want;
by time notsorted;
if not last.time then delete;
run;
Assuming there were multi-missing value followed it .
DATA kids; LENGTH kidname $ 4; INPUT kidname time duration score; CARDS; Beth 1 12 30 Beth 2 7 10 Beth 3 . 60 Beth 4 . 40 Barb 1 3 20 Barb 2 5 80 Barb 3 . 10 Alic 1 6 50 Alic 2 . 20 Alic 3 3 12 Alic 4 . 17 ; run; data temp; set kids; by kidname notsorted; if not missing(duration) or first.kidname then group+1; run; proc sql; create table want as select group,kidname, time ,duration,sum(score) as score from temp group by group having duration is not missing; quit;
Xia Keshan
DATA kids;/*haikuo's raw data*/
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 . 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 . 20
Alic 3 3 12
Alic 4 . 17
;
data want1;
set kids;
by kidname notsorted;
retain score1 duration1 time1 ;
if first.kidname then call missing(score1,duration1,time1);
if duration ne . then do;
score1=score;
duration1=duration;
time1=time;
end;
if duration eq . then do;
score1+score ;
score=score1;
duration=duration1;
time=time1;
end;
run;
data final(drop=score1 duration1 time1);
set want1;
by time notsorted;
if not last.time then delete;
run;
As long as in your real data there are also no consecutive times with a missing duration then below SQL should work.
proc sql;
create table want as
select
kidname,
case(duration) when(.) then time-1 else time end as time,
sum(duration) as duration,
sum(score) as score
from kids
group by kidname, calculated time
;
quit;
Trying to keep it simple and also handle unexpected cases properly :
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 9 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 6 20
Alic 3 3 12
Alic 4 . 17
Tric 1 . 11
Tric 2 5 13
Null 1 . 7
Null 2 . 19
;
data want;
do until(last.kidname);
set kids; by kidname notsorted;
if missing(duration) then do;
s = sum(s, score);
t = coalesce(t, time);
end;
else do;
if not first.kidname then output;
t = time;
d = duration;
s = score;
end;
end;
output;
keep kidname t d s;
rename t=time d=duration s=score;
run;
proc print data=want noobs; run;
PG
Thanks PG.
Thanks to PG, It turns out my previous code still has some glitches. The following one seems to be more robust:
DATA kids;
LENGTH kidname $ 4;
INPUT kidname time duration score;
CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 9 60
Beth 4 5 40
Barb 1 3 20
Barb 2 . 80
Barb 3 . 10
Alic 1 6 50
Alic 2 . 20
Alic 3 3 12
Alic 4 . 17
Tric 1 . 11
Tric 2 5 13
Null 1 . 7
Null 2 . 19
;
data want;
do until (last.kidname);
set kids;
by kidname notsorted;
array in(3) _temporary_;
array out(3) _temporary_;
array t(3) time--score;
if missing(duration) then
in(3)+score;
if not missing (duration) or (last.kidname) then
do;
do i=1 to 3;
out(i)=in(i);
in(i)=t(i);
t(i)=out(i);
end;
if not missing(duration) then
output;
if last.kidname then
do;
do i=1 to 3;
t(i)=in(i);
end;
if not missing(duration) then
output;
end;
end;
end;
call missing (of in(*), of out(*));
drop i;
run;
Thanks Hai.kuo.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.