I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
Here is what I tried:
data mylib.data1(keep=type date value value_avg) mylib.data2;
set work.have;
by type;
if value ne . then NotMissing=1; else NotMissing=0;
if first.type then call missing(of value_avg);
value_avg+value;
if first.type then call missing(of num_per_cat);
num_per_cat+NotMissing;
Avg=divide((value_avg+value),(num_per_cat+NotMissing));
if last.type then output mylib.data1;
run;
This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.
Show what you expect as the result.
Do you want one maximum observation? Or one per BY group? One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.
data want1(keep=type value_avg num_per_cat)
want2(drop=value_avg num_per_cat)
;
set work.have end=eof;
by type;
retain value_avg num_per_cat max max_n_ ;
if first.type then call missing(value_avg,num_per_cat);
value_avg+value;
num_per_cat+Not missing(value);
if last.type then do;
value_avg=divide(value_avg,num_per_cat);
output want1;
end;
if value > max then do;
max=value;
max_n_=_n_;
end;
if eof then do;
max=max_n_;
set have point=max;
output want2;
end;
run;
value_ num_per_ Obs type avg cat 1 A 25.667 3 2 B 107.250 4 3 C 109.000 4 4 D 359.333 3 Obs type date value max_n_ 1 D 18JUL2021 951 13
Maxim 7: There Is a Procedure for It.
In this case, PROC MEANS:
proc means data=have;
class type;
var value;
output out=want1 (drop=_type_ _freq_) mean()=;
run;
For the max, two methods:
proc sort data=have;
by descending value;
run;
data want2;
set have (obs=1);
run;
or
proc sql;
create table want2 as
select *
from have
having value = max(value)
;
quit;
The second method will output several observations if there is a tie. You should also look at the RANK Procedure.
Hi @Kurt_Bremser ,
Thank you very much for the detailed reply. The methods you described worked very well for me. I am now realizing I should have been more specific in my description, so I apologize in advance. I was specifically asked to do the above using a single data step. Do you see this as being possible? I appreciate any thoughts you might have.
In a single data step:
data
want1 (keep=type _value rename=(_value=value))
want2 (keep=type date value)
;
set have end=done;
by type;
retain _mval _mpoint;
if first.type
then do;
_count = 0;
_value = 0;
end;
if value ne .
then do;
_value + value;
_count + 1;
end;
if last.type
then do;
_value = _value / _count;
output want1;
end;
if value > _mval
then do;
_mval = value;
_mpoint = _n_;
end;
if done
then do;
set have point=_mpoint;
output want2;
stop;
end;
run;
Untested, posted from my tablet.
data work.have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
proc summary data = have nway;
class type;
var value;
output out = want_mean(drop = _:) mean = ;
run;
proc summary data = have nway;
class type;
var value;
output out = want_max(drop = _:) max = ;
run;
Hi @PeterClemmensen ,
Thank you as well for your reply. It is always interesting to see how different people approach the same problem! As I mentioned to Kurt above, I should have been slightly more specific in my description. I am attempting to solve the problem I posted using a single data step. If you have any ideas on how to do that, it would be greatly appreciated. Thanks again.
Is this for a class? The main reason I would think to approach this with a forced DATA step solution is as a learning exercise.
Do they want you to do this with one data step, or would you be ok with a two-step solution?
I would assume you have learned about the RETAIN statement, correct?
Have you ever seen a SET statement that is placed inside a do-loop? I wouldn't expect that approach (unofficially named 'DOW-loop') to be introduced in an introductory class.
If you're familiar with retain, you should be able to use retain to calculated the maximum value for each by-group. The tricky part is you won't know the maximum values until the end of the by-group. So typically you would retain MaxValue and then output the records for last.type. But in this case, you would also want to retain DateOfMaxValue.
Show what you expect as the result.
Do you want one maximum observation? Or one per BY group? One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.
data want1(keep=type value_avg num_per_cat)
want2(drop=value_avg num_per_cat)
;
set work.have end=eof;
by type;
retain value_avg num_per_cat max max_n_ ;
if first.type then call missing(value_avg,num_per_cat);
value_avg+value;
num_per_cat+Not missing(value);
if last.type then do;
value_avg=divide(value_avg,num_per_cat);
output want1;
end;
if value > max then do;
max=value;
max_n_=_n_;
end;
if eof then do;
max=max_n_;
set have point=max;
output want2;
end;
run;
value_ num_per_ Obs type avg cat 1 A 25.667 3 2 B 107.250 4 3 C 109.000 4 4 D 359.333 3 Obs type date value max_n_ 1 D 18JUL2021 951 13
Thank you @Tom , this is very helpful. I should have included a code segment showing what the desired output is but I appreciate that you were able to interpret it from my long-winded description above. I owe you one, thanks for the time!
My 2 cents
data have;
input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;
data _null_;
dcl hash h(ordered : 'Y');
h.definekey('type');
h.definedata('type', 'sum', 'n', 'mean');
h.definedone();
do until (z);
set have end = z;
if h.find() ne 0 then do;
sum = 0;
n = 0;
end;
sum = sum(sum, value);
if value then n = sum(n, 1);
mean = divide(sum, n);
h.replace();
end;
h.output(dataset : 'want(drop = n)');
run;
Also, you could argue that this is a single data step 🙂
data _null_;
rc=dosubl("
proc summary data = have nway;
class type;
var value;
output out = want_mean(drop = _:) mean = ;
run;
proc summary data = have nway;
class type;
var value;
output out = want_max(drop = _:) max = ;
run;
");
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.