Hi All,
I need to update one columns value based on other rows specific value.
in below data set i want to update Mvalue as "Progress" for all the rows of a type if any single row contains "Progress" in Mvalue.
eg: if type C contains Mvalue as "Progress
" then for all the rows of type "C" should be tagged as "Progress
" in Mvalue.
data one;
input type $ Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
run;
Regards,
Ashish
Hi @ASHISH2525,
One way to achieve this is a double "DOW loop":
data want(drop=flag);
do until(last.type);
set one;
by type;
if mvalue='Progress' then flag=1;
end;
do until(last.type);
set one;
by type;
if flag then mvalue='Progress';
output;
end;
run;
Hi @ASHISH2525,
One way to achieve this is a double "DOW loop":
data want(drop=flag);
do until(last.type);
set one;
by type;
if mvalue='Progress' then flag=1;
end;
do until(last.type);
set one;
by type;
if flag then mvalue='Progress';
output;
end;
run;
@FreelanceReinh beat me by 30 sec and a few lines 😉
@PeterClemmensen wrote:
@FreelanceReinh beat me by 30 sec and a few lines 😉
It's not easy to beat someone as fast as you. 🙂
@ASHISH2525: So, this solution has been confirmed independently. Other possible approaches include PROC SQL (using group by type and max(mvalue='Progress'), but with the drawback that the order of observations within a TYPE value is not maintained) and a "self merge" DATA step (with the drawback that the ugly note "MERGE statement has more than one data set with repeats of BY values" may appear in the log, which could be avoided with a preliminary step, though).
One way
data one;
input type $ Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
run;
data two;
flag=0;
do until (last.type);
set one;
by type;
if Mvalue='Progress' then flag=1;
end;
do until (last.type);
set one;
by type;
if flag=1 then Mvalue='Progress';
output;
end;
run;
data one;
input type $ Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
data want;
merge one one(keep=type Mvalue where=(_Mvalue='Progress') rename=(Mvalue=_Mvalue));
by type;
want_Mvalue=coalescec(_Mvalue,Mvalue);
drop _Mvalue Mvalue;
run;
Yes, this is the "self merge" approach, my version of which was this:
data want(drop=_m);
merge one(rename=(mvalue=_m))
one(where=(mvalue='Progress') in=p);
by type;
if ~p then mvalue=_m;
run;
As mentioned in my previous post, it has the disadvantage of potentially producing that unwanted note about "more than one data set with repeats of BY values" if there is a BY group containing more than one observation with mvalue='Progress' in dataset ONE. But it works properly in spite of that note.
OK. @FreelanceReinh ,
I just want put it further for fun.
data one;
input type $ Mvalue $12.;
datalines;
A closed
B Progress
C Closed
C Progress
C Closed
E closed
E Closed
E Progress
;
proc sql;
create table want as
select *,case when sum(Mvalue='Progress') then 'Progress' else Mvalue end as Want_Mvalue
from one
group by type
;
quit;
Now that we have "all" three major approaches on the table, it would be interesting to see how they perform on a larger input dataset. So, let's create one:
/* Create a larger test dataset */
data one(drop=i j);
length type $8 mvalue $12;
call streaminit(27182818);
do i=1 to 1e7;
type=put(i, hex8.);
do j=1 to rand('integer',10);
mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other');
output;
end;
end;
run; /* 54995649 obs. */
The log under the spoiler below shows the run times ("real time") on my workstation:
The above ranking was confirmed in a second run (8.57 s, 9.82 s, 16.47 s) and in a third run with the order of the three steps reversed (8.65 s, 9.73 s, 16.74 s).
To be fair, it should be noted that the PROC SQL approach does not require a sorted (or indexed) input dataset (by type), unlike the other two approaches. With a randomly sorted version of input dataset ONE it took about 32 - 33 s.
1 data one(drop=i j); 2 length type $8 mvalue $12; 3 call streaminit(27182818); 4 do i=1 to 1e7; 5 type=put(i, hex8.); 6 do j=1 to rand('integer',10); 7 mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other'); 8 output; 9 end; 10 end; 11 run; NOTE: The data set WORK.ONE has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 5.29 seconds cpu time 5.28 seconds 12 13 /* DOW loop approach */ 14 15 data want1(drop=flag); 16 do until(last.type); 17 set one; 18 by type; 19 if mvalue='Progress' then flag=1; 20 end; 21 do until(last.type); 22 set one; 23 by type; 24 if flag then mvalue='Progress'; 25 output; 26 end; 27 run; NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: The data set WORK.WANT1 has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 8.19 seconds cpu time 8.20 seconds 28 29 /* Self merge approach */ 30 31 data want2(drop=_m); 32 merge one(rename=(mvalue=_m)) 33 one(where=(mvalue='Progress') in=p); 34 by type; 35 if ~p then mvalue=_m; 36 run; NOTE: MERGE statement has more than one data set with repeats of BY values. NOTE: There were 54995649 observations read from the data set WORK.ONE. NOTE: There were 13752908 observations read from the data set WORK.ONE. WHERE mvalue='Progress'; NOTE: The data set WORK.WANT2 has 54995649 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 9.83 seconds cpu time 9.82 seconds 37 38 /* PROC SQL approach */ 39 40 proc sql; 41 create table want3 as 42 select type, case when max(mvalue='Progress') 43 then 'Progress' 44 else mvalue 45 end as mvalue 46 from one 47 group by type; NOTE: The query requires remerging summary statistics back with the original data. NOTE: SAS threaded sort was used. NOTE: Table WORK.WANT3 created, with 54995649 rows and 2 columns. 48 quit; NOTE: PROCEDURE SQL used (Total process time): real time 17.15 seconds cpu time 28.00 seconds
@FreelanceReinh cool stuff 🙂
Here are my results on the large data. Just for fun I added a Hash Object approach
data one(drop=i j);
length type $8 mvalue $12;
call streaminit(27182818);
do i=1 to 1e7;
type=put(i, hex8.);
do j=1 to rand('integer',10);
mvalue=choosec(rand('integer',4),'closed','Progress','Closed','other');
output;
end;
end;
run; /* 54995649 obs. */
/* 10.72 sec */
data want1(drop=flag);
do until(last.type);
set one;
by type;
if mvalue='Progress' then flag=1;
end;
do until(last.type);
set one;
by type;
if flag then mvalue='Progress';
output;
end;
run;
/* 11.8 sec */
data want2(drop=_m);
merge one(rename=(mvalue=_m))
one(where=(mvalue='Progress') in=p);
by type;
if ~p then mvalue=_m;
run;
/* 37.3 sec */
proc sql;
create table want3 as
select type, case when max(mvalue='Progress')
then 'Progress'
else mvalue
end as mvalue
from one
group by type;
quit;
/* 14.2 sec */
data want4;
if _N_ = 1 then do;
declare hash h(dataset:"one(where=(Mvalue='Progress'))", hashexp:20);
h.defineKey('type');
h.defineDone();
end;
set one;
if h.check()=0 then Mvalue='Progress';
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.