BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Marcusliat
Calcite | Level 5

Hi,

Assume this is my data. I would like to delete rows with missing duration and add the score for that observation to the previous one. Anything else stays the same for the previous observation.

Up to now, I was able only to delete the row if duration is missing. I have tried using array without success. Any help is appreciated.

DATA kids;

  LENGTH kidname $ 4;

  INPUT kidname time duration score;

CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 9 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 6 20

Alic 3 3 12

Alic 4 . 17

;

run;

data first;

set kids;

if duration=. then nmiss=1; else nmiss=0;

run;

data second;

  set first;

  if nmiss=1 then delete;

run;

The final desired data would be

Beth 1 12 30

Beth 2 7 10

Beth 3 9 100

Barb 1 3 20

Barb 2 5 90

Alic 1 6 50

Alic 2 6 20

Alic 3 3 29

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

As stated, if this is the only possible scenario in your real life data, a simple 'look ahead' will be sufficient. However, look at the following modified raw data:

DATA kids;

     LENGTH kidname $ 4;

     INPUT kidname time duration score;

     CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 . 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 . 20

Alic 3 3 12

Alic 4 . 17

;

So Beth has 2 missing in a row and Alic has 2 missing apart. A possible Proc SQL solution may take care of Beth (let me know if you are interested in the code) but fail on Alic, while 'Look Ahead' is able to tackle Alic but fail on Beth. Array, which OP happened to mention, becomes a strong candidate to reconcile both scenario at the same time, in 1X pass data step:

data want;

     do until (last.kidname);

           set kids;

           by kidname notsorted;

           array in(3) _temporary_;

           array out(3) _temporary_;

           array t(3) time--score;

       if missing(duration) then in(3)+score;

           if not missing (duration) or last.kidname then

                do;

                     do i=1 to 3;

                           out(i)=in(i);

                           in(i)=t(i);

                           t(i)=out(i);

                     end;

                     if not missing(duration) then

                           output;

                end;

          

     end;

     call missing (of in(*), of out(*));

     drop i;

run;


Haikuo

View solution in original post

12 REPLIES 12
Quentin
Super User

Sounds like a look-ahead problem to me.

Here is one approach.

218  options mergenoby=nowarn;
219  data want;
220    merge kids
221          kids (firstobs=2
222                keep=duration score kidname
223                rename=(duration=nextduration score=nextscore kidname=nextkidname)
224                )
225    ;
226    if missing(nextduration) then do;
227      score=sum(score,nextscore);
228      if kidname ne nextkidname and not missing(nextkidname) then put "ER" "ROR: crossed name
228! boundary cuz missing first duration " kidname=;
229    end;
230    if missing(duration) then delete;
231
232    drop next:;
233    put KidName Time Duration Score;
234  run;

Beth 1 12 30
Beth 2 7 10
Beth 3 9 100
Barb 1 3 20
Barb 2 5 90
Alic 1 6 50
Alic 2 6 20
Alic 3 3 29
NOTE: There were 11 observations read from the data set WORK.KIDS.
NOTE: There were 10 observations read from the data set WORK.KIDS.
NOTE: The data set WORK.WANT has 8 observations and 4 variables.

235  options mergenoby=error;


BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10

Investigate an approach that uses PROC SUMMARY to aggregate important variables where you have BY/CLASS variable list (kidname and maybe time?), executed after a DATA step that does the interrogation and sets appropriate SAS variables to a missing value; also, use LAG(...) to keep track of most-recent 'keeper' row, if consider you might have two consecutive (or more) obs with a missing value for DURATION.

Scott Barry
SBBWorks, Inc.

billfish
Quartz | Level 8

A solution is the following:

data want(keep=kidname time2 dur2 sc2 rename=(time2=time dur2=duration sc2=score));
  set kids;
  retain ix2 time2 dur2 sc2;
  by kidname notsorted;

  if first.kidname then do;
     ix2=0;
     time2=0;
     dur2=0;
     sc2=0;
  end;
  ix2+1;
  if not(last.kidname) then do;
      if (ix2>1) then output;
      time2=time;
      dur2= duration;
      sc2=score;
  end;

  if last.kidname then do;
     sc2+score;
     output;
  end;
run;

Haikuo
Onyx | Level 15

As stated, if this is the only possible scenario in your real life data, a simple 'look ahead' will be sufficient. However, look at the following modified raw data:

DATA kids;

     LENGTH kidname $ 4;

     INPUT kidname time duration score;

     CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 . 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 . 20

Alic 3 3 12

Alic 4 . 17

;

So Beth has 2 missing in a row and Alic has 2 missing apart. A possible Proc SQL solution may take care of Beth (let me know if you are interested in the code) but fail on Alic, while 'Look Ahead' is able to tackle Alic but fail on Beth. Array, which OP happened to mention, becomes a strong candidate to reconcile both scenario at the same time, in 1X pass data step:

data want;

     do until (last.kidname);

           set kids;

           by kidname notsorted;

           array in(3) _temporary_;

           array out(3) _temporary_;

           array t(3) time--score;

       if missing(duration) then in(3)+score;

           if not missing (duration) or last.kidname then

                do;

                     do i=1 to 3;

                           out(i)=in(i);

                           in(i)=t(i);

                           t(i)=out(i);

                     end;

                     if not missing(duration) then

                           output;

                end;

          

     end;

     call missing (of in(*), of out(*));

     drop i;

run;


Haikuo

naveen_srini
Quartz | Level 8

DATA kids;

  LENGTH kidname $ 4;

  INPUT kidname time duration score;

CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 9 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 6 20

Alic 3 3 12

Alic 4 . 17

;

run;

data want(drop=score1 duration1 time1);

set kids;

by kidname notsorted;

retain score1 duration1 time1;

if not last.kidname then do;

score1=score;

duration1=duration;

time1=time;

end;

if last.kidname and duration=. then do;

score=sum(score,score1);

duration=duration1;

time=time1;

end;

run;

data final;

set want;

by time notsorted;

if not last.time then delete;

run;

Ksharp
Super User

Assuming  there were multi-missing value followed it .

DATA kids;
     LENGTH kidname $ 4;
     INPUT kidname time duration score;
     CARDS;
Beth 1 12 30
Beth 2 7 10
Beth 3 . 60
Beth 4 . 40
Barb 1 3 20
Barb 2 5 80
Barb 3 . 10
Alic 1 6 50
Alic 2 . 20
Alic 3 3 12
Alic 4 . 17
;
run;
data temp;
  set kids;
  by kidname notsorted;
  if not missing(duration) or first.kidname then group+1;
run;
proc sql;
 create table want as
  select group,kidname, time ,duration,sum(score) as score
   from temp 
    group by group
      having duration is not missing;
quit;


Xia Keshan

naveen_srini
Quartz | Level 8

DATA kids;/*haikuo's raw data*/

     LENGTH kidname $ 4;

     INPUT kidname time duration score;

     CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 . 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 . 20

Alic 3 3 12

Alic 4 . 17

;

data want1;

set kids;

by kidname notsorted;

retain score1 duration1 time1 ;

if first.kidname then call missing(score1,duration1,time1);

if duration ne . then do;

score1=score;

duration1=duration;

time1=time;

end;

if duration eq . then do;

score1+score ;

score=score1;

duration=duration1;

time=time1;

end;

run;

data final(drop=score1 duration1 time1);

set want1;

by time notsorted;

if not last.time then delete;

run;

Patrick
Opal | Level 21

As long as in your real data there are also no consecutive times with a missing duration then below SQL should work.

proc sql;

  create table want as

  select

    kidname,

    case(duration) when(.) then time-1 else time end as time,

    sum(duration) as duration,

    sum(score) as score

  from kids

  group by kidname, calculated time

  ;

quit;

PGStats
Opal | Level 21

Trying to keep it simple and also handle unexpected cases properly :

DATA kids;

  LENGTH kidname $ 4;

  INPUT kidname time duration score;

CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 9 60

Beth 4 . 40

Barb 1 3 20

Barb 2 5 80

Barb 3 . 10

Alic 1 6 50

Alic 2 6 20

Alic 3 3 12

Alic 4 . 17

Tric 1 . 11

Tric 2 5 13

Null 1 . 7

Null 2 . 19

;

data want;

do until(last.kidname);

    set kids; by kidname notsorted;

    if missing(duration) then do;

        s = sum(s, score);

        t = coalesce(t, time);

        end;

    else do;

        if not first.kidname then output;

        t = time;

        d = duration;

        s = score;

        end;

    end;

output;

keep kidname t d s;

rename t=time d=duration s=score;

run;

proc print data=want noobs; run;

PG

PG
Marcusliat
Calcite | Level 5

Thanks PG.

Haikuo
Onyx | Level 15

Thanks to PG, It turns out my previous code still has some glitches. The following one seems to be more robust:

DATA kids;

     LENGTH kidname $ 4;

     INPUT kidname time duration score;

     CARDS;

Beth 1 12 30

Beth 2 7 10

Beth 3 9 60

Beth 4 5 40

Barb 1 3 20

Barb 2 . 80

Barb 3 . 10

Alic 1 6 50

Alic 2 . 20

Alic 3 3 12

Alic 4 . 17

Tric 1 . 11

Tric 2 5 13

Null 1 . 7

Null 2 . 19

;

data want;

     do until (last.kidname);

           set kids;

           by kidname notsorted;

           array in(3) _temporary_;

           array out(3) _temporary_;

           array t(3) time--score;

           if missing(duration) then

                in(3)+score;

           if not missing (duration) or (last.kidname) then

                do;

                     do i=1 to 3;

                          out(i)=in(i);

                           in(i)=t(i);

                           t(i)=out(i);

                     end;

                     if not missing(duration) then

                           output;

                     if last.kidname then

                           do;

                                do i=1 to 3;

                                     t(i)=in(i);

                                end;

                                if not missing(duration) then

                                     output;

                           end;

                end;

     end;

     call missing (of in(*), of out(*));

     drop i;

run;

Marcusliat
Calcite | Level 5

Thanks Hai.kuo.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1697 views
  • 15 likes
  • 9 in conversation