Solved: Replacing values based on others rows value

ASHISH2525 · Posted 06-19-2019 04:28 AM

Hi All,

I need to update one columns value based on other rows specific value.

in below data set i want to update Mvalue as "Progress" for all the rows of a type if any single row contains "Progress" in Mvalue.

eg: if type C contains Mvalue as "Progress" then for all the rows of type "C" should be tagged as "Progress" in Mvalue.

data one;

input type $  Mvalue $12.;


datalines;

A closed

B Progress

C Closed

C Progress

C Closed

E closed

E Closed

E Progress

;

run;

Regards,

Ashish

FreelanceReinh · Posted 06-19-2019 05:06 AM

Hi @ASHISH2525,

One way to achieve this is a double "DOW loop":

data want(drop=flag);
do until(last.type);
  set one;
  by type;
  if mvalue='Progress' then flag=1;
end;
do until(last.type);
  set one;
  by type;
  if flag then mvalue='Progress';
  output;
end;
run;

View solution in original post

FreelanceReinh · Posted 06-19-2019 05:06 AM

Hi @ASHISH2525,

One way to achieve this is a double "DOW loop":

data want(drop=flag);
do until(last.type);
  set one;
  by type;
  if mvalue='Progress' then flag=1;
end;
do until(last.type);
  set one;
  by type;
  if flag then mvalue='Progress';
  output;
end;
run;

PeterClemmensen · Posted 06-19-2019 05:07 AM

@FreelanceReinh beat me by 30 sec and a few lines 😉

The DATA to DATA Step Macro
Blog: SASnrd

FreelanceReinh · Posted 06-19-2019 05:24 AM

@PeterClemmensen wrote:

@FreelanceReinh beat me by 30 sec and a few lines 😉

It's not easy to beat someone as fast as you. 🙂

@ASHISH2525: So, this solution has been confirmed independently. Other possible approaches include PROC SQL (using group by type and max(mvalue='Progress'), but with the drawback that the order of observations within a TYPE value is not maintained) and a "self merge" DATA step (with the drawback that the ugly note "MERGE statement has more than one data set with repeats of BY values" may appear in the log, which could be avoided with a preliminary step, though).

PeterClemmensen · Posted 06-19-2019 05:06 AM

One way

data one;
input type $  Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
run;

data two;
   flag=0;
   do until (last.type);
      set one;
      by type;
      if Mvalue='Progress' then flag=1;
   end;
   do until (last.type);
      set one;
      by type;
      if flag=1 then Mvalue='Progress';
      output;
   end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Ksharp · Posted 06-19-2019 08:48 AM

data one;
input type $  Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
data want;
 merge one one(keep=type  Mvalue where=(_Mvalue='Progress') rename=(Mvalue=_Mvalue));
 by type;
 want_Mvalue=coalescec(_Mvalue,Mvalue);
 drop _Mvalue Mvalue;
run;

FreelanceReinh · Posted 06-19-2019 09:02 AM

Yes, this is the "self merge" approach, my version of which was this:

data want(drop=_m);
merge one(rename=(mvalue=_m))
      one(where=(mvalue='Progress') in=p);
by type;
if ~p then mvalue=_m;
run;

As mentioned in my previous post, it has the disadvantage of potentially producing that unwanted note about "more than one data set with repeats of BY values" if there is a BY group containing more than one observation with mvalue='Progress' in dataset ONE. But it works properly in spite of that note.

Ksharp · Posted 06-19-2019 09:09 AM

OK. @FreelanceReinh ,

I just want put it further for fun.

data one;
input type $  Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
proc sql;
create table want as
select 	*,case when sum(Mvalue='Progress') then  'Progress' else Mvalue end as Want_Mvalue
 from one 
  group by type
;
quit;

FreelanceReinh · Posted 06-19-2019 10:17 AM

Now that we have "all" three major approaches on the table, it would be interesting to see how they perform on a larger input dataset. So, let's create one:

/* Create a larger test dataset */

data one(drop=i j);
length type $8 mvalue $12;
call streaminit(27182818);
do i=1 to 1e7;
  type=put(i, hex8.);
  do j=1 to rand('integer',10);
    mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other');
    output;
  end;
end;
run; /* 54995649 obs. */

The log under the spoiler below shows the run times ("real time") on my workstation:

DOW loop approach: 8.19 s
Self merge approach: 9.83 s
PROC SQL approach: 17.15 s

The above ranking was confirmed in a second run (8.57 s, 9.82 s, 16.47 s) and in a third run with the order of the three steps reversed (8.65 s, 9.73 s, 16.74 s).

To be fair, it should be noted that the PROC SQL approach does not require a sorted (or indexed) input dataset (by type), unlike the other two approaches. With a randomly sorted version of input dataset ONE it took about 32 - 33 s.

Spoiler

1    data one(drop=i j);
2    length type $8 mvalue $12;
3    call streaminit(27182818);
4    do i=1 to 1e7;
5      type=put(i, hex8.);
6      do j=1 to rand('integer',10);
7        mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other');
8        output;
9      end;
10   end;
11   run;

NOTE: The data set WORK.ONE has 54995649 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           5.29 seconds
      cpu time            5.28 seconds


12
13   /* DOW loop approach */
14
15   data want1(drop=flag);
16   do until(last.type);
17     set one;
18     by type;
19     if mvalue='Progress' then flag=1;
20   end;
21   do until(last.type);
22     set one;
23     by type;
24     if flag then mvalue='Progress';
25     output;
26   end;
27   run;

NOTE: There were 54995649 observations read from the data set WORK.ONE.
NOTE: There were 54995649 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT1 has 54995649 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           8.19 seconds
      cpu time            8.20 seconds


28
29   /* Self merge approach */
30
31   data want2(drop=_m);
32   merge one(rename=(mvalue=_m))
33         one(where=(mvalue='Progress') in=p);
34   by type;
35   if ~p then mvalue=_m;
36   run;

NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 54995649 observations read from the data set WORK.ONE.
NOTE: There were 13752908 observations read from the data set WORK.ONE.
      WHERE mvalue='Progress';
NOTE: The data set WORK.WANT2 has 54995649 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           9.83 seconds
      cpu time            9.82 seconds


37
38   /* PROC SQL approach */
39
40   proc sql;
41   create table want3 as
42   select type, case when max(mvalue='Progress')
43                  then 'Progress'
44                  else mvalue
45                end as mvalue
46   from one
47   group by type;
NOTE: The query requires remerging summary statistics back with the original data.
NOTE: SAS threaded sort was used.
NOTE: Table WORK.WANT3 created, with 54995649 rows and 2 columns.

48   quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           17.15 seconds
      cpu time            28.00 seconds

1 data one(drop=i j); 2 length type $8 mvalue $12; 3 call streaminit(27182818); 4 do i=1 to 1e7; 5 type=put(i, hex8.); 6 do j=1 to rand('integer',10); 7 mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other'); 8 output; 9 end; 10 end; 11 run; NOTE: The data set WORK.ONE has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 5.29 seconds cpu time 5.28 seconds 12 13 /* DOW loop approach */ 14 15 data want1(drop=flag); 16 do until(last.type); 17 set one; 18 by type; 19 if mvalue='Progress' then flag=1; 20 end; 21 do until(last.type); 22 set one; 23 by type; 24 if flag then mvalue='Progress'; 25 output; 26 end; 27 run; NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: The data set WORK.WANT1 has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 8.19 seconds cpu time 8.20 seconds 28 29 /* Self merge approach */ 30 31 data want2(drop=_m); 32 merge one(rename=(mvalue=_m)) 33 one(where=(mvalue='Progress') in=p); 34 by type; 35 if ~p then mvalue=_m; 36 run; NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: There were 13752908 observations read from the data set WORK.ONE. WHERE mvalue='Progress'; NOTE: The data set WORK.WANT2 has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 9.83 seconds cpu time 9.82 seconds 37 38 /* PROC SQL approach */ 39 40 proc sql; 41 create table want3 as 42 select type, case when max(mvalue='Progress') 43 then 'Progress' 44 else mvalue 45 end as mvalue 46 from one 47 group by type; NOTE: The query requires remerging summary statistics back with the original data. NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT3 created, with 54995649 rows and 2 columns. 48 quit; NOTE: PROCEDURE SQL used (Total process time): real time 17.15 seconds cpu time 28.00 seconds

PeterClemmensen · Posted 06-20-2019 01:52 AM

@FreelanceReinh cool stuff 🙂

Here are my results on the large data. Just for fun I added a Hash Object approach

data one(drop=i j);
length type $8 mvalue $12;
call streaminit(27182818);
do i=1 to 1e7;
  type=put(i, hex8.);
  do j=1 to rand('integer',10);
    mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other');
    output;
  end;
end;
run; /* 54995649 obs. */

/* 10.72 sec */
data want1(drop=flag);
do until(last.type);
  set one;
  by type;
  if mvalue='Progress' then flag=1;
end;
do until(last.type);
  set one;
  by type;
  if flag then mvalue='Progress';
  output;
end;
run;

/* 11.8 sec */
data want2(drop=_m);
merge one(rename=(mvalue=_m))
      one(where=(mvalue='Progress') in=p);
by type;
if ~p then mvalue=_m;
run;

/* 37.3 sec */
proc sql;
create table want3 as
select type, case when max(mvalue='Progress')
               then 'Progress'
               else mvalue
             end as mvalue
from one
group by type;
quit;

/* 14.2 sec */
data want4;
   if _N_ = 1 then do;
      declare hash h(dataset:"one(where=(Mvalue='Progress'))", hashexp:20);
      h.defineKey('type');
      h.defineDone();
   end;

   set one;
   if h.check()=0 then Mvalue='Progress';
run;

The DATA to DATA Step Macro
Blog: SASnrd

Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Re: Replacing values based on others rows value

Classroom Training Available!