Solved: Extracting row with highest value in a column while also calculating a...

ajd555 · Posted 11-20-2022 10:40 PM

I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!

data work.have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;

Here is what I tried:

data mylib.data1(keep=type date value value_avg) mylib.data2;
    set work.have;
    by type;
    if value ne . then NotMissing=1; else NotMissing=0;
    if first.type then call missing(of value_avg);
        value_avg+value;
    if first.type then call missing(of num_per_cat);
        num_per_cat+NotMissing;
    Avg=divide((value_avg+value),(num_per_cat+NotMissing));
    if last.type then output mylib.data1;
run;

This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.

Tom · Posted 11-21-2022 10:47 AM

Show what you expect as the result.

Do you want one maximum observation? Or one per BY group? One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.

data want1(keep=type value_avg num_per_cat)
     want2(drop=value_avg num_per_cat)
;
    set work.have end=eof;
    by type;
    retain value_avg num_per_cat max max_n_ ;
    if first.type then call missing(value_avg,num_per_cat);
    value_avg+value;
    num_per_cat+Not missing(value);
    if last.type then do;
       value_avg=divide(value_avg,num_per_cat);
       output want1;
    end;
    if value > max then do;
       max=value;
       max_n_=_n_;
    end;
    if eof then do;
      max=max_n_;
      set have point=max;
      output want2;
    end;
run;

                value_    num_per_
Obs    type      avg         cat

 1      A       25.667        3
 2      B      107.250        4
 3      C      109.000        4
 4      D      359.333        3

Obs    type         date    value    max_n_

 1      D      18JUL2021     951       13

View solution in original post

Kurt_Bremser · Posted 11-21-2022 12:37 AM

Maxim 7: There Is a Procedure for It.

In this case, PROC MEANS:

proc means data=have;
class type;
var value;
output out=want1 (drop=_type_ _freq_) mean()=;
run;

For the max, two methods:

proc sort data=have;
by descending value;
run;

data want2;
set have (obs=1);
run;

or

proc sql;
create table want2 as
  select *
  from have
  having value = max(value)
;
quit;

The second method will output several observations if there is a tie. You should also look at the RANK Procedure.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ajd555 · Posted 11-21-2022 10:26 AM

Hi @Kurt_Bremser ,

Thank you very much for the detailed reply. The methods you described worked very well for me. I am now realizing I should have been more specific in my description, so I apologize in advance. I was specifically asked to do the above using a single data step. Do you see this as being possible? I appreciate any thoughts you might have.

Kurt_Bremser · Posted 11-21-2022 11:12 AM

In a single data step:

data
  want1 (keep=type _value rename=(_value=value))
  want2 (keep=type date value)
;
set have end=done;
by type;
retain _mval _mpoint;
if first.type
then do;
  _count = 0;
  _value = 0;
end;
if value ne .
then do;
  _value + value;
  _count + 1;
end;
if last.type
then do;
  _value = _value / _count;
  output want1;
end;
if value > _mval
then do;
  _mval = value;
  _mpoint = _n_;
end;
if done
then do;
  set have point=_mpoint;
  output want2;
  stop;
end;
run;

Untested, posted from my tablet.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 11-21-2022 03:19 AM

data work.have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;

proc summary data = have nway;
   class type;
   var value;
   output out = want_mean(drop = _:) mean = ;
run;

proc summary data = have nway;
   class type;
   var value;
   output out = want_max(drop = _:) max = ;
run;

The DATA to DATA Step Macro
Blog: SASnrd

ajd555 · Posted 11-21-2022 10:27 AM

Hi @PeterClemmensen ,

Thank you as well for your reply. It is always interesting to see how different people approach the same problem! As I mentioned to Kurt above, I should have been slightly more specific in my description. I am attempting to solve the problem I posted using a single data step. If you have any ideas on how to do that, it would be greatly appreciated. Thanks again.

Quentin · Posted 11-21-2022 10:41 AM

Is this for a class? The main reason I would think to approach this with a forced DATA step solution is as a learning exercise.

Do they want you to do this with one data step, or would you be ok with a two-step solution?

I would assume you have learned about the RETAIN statement, correct?

Have you ever seen a SET statement that is placed inside a do-loop? I wouldn't expect that approach (unofficially named 'DOW-loop') to be introduced in an introductory class.

If you're familiar with retain, you should be able to use retain to calculated the maximum value for each by-group. The tricky part is you won't know the maximum values until the end of the by-group. So typically you would retain MaxValue and then output the records for last.type. But in this case, you would also want to retain DateOfMaxValue.

The Boston Area SAS Users Group (BASUG) is hosting our in person SAS Blowout on Oct 18!
This full-day event in Cambridge, Mass features four presenters from SAS, presenting on a range of SAS 9 programming topics. Pre-registration by Oct 15 is required.
Full details and registration info at https://www.basug.org/events.

Tom · Posted 11-21-2022 10:47 AM

Show what you expect as the result.

Do you want one maximum observation? Or one per BY group? One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.

data want1(keep=type value_avg num_per_cat)
     want2(drop=value_avg num_per_cat)
;
    set work.have end=eof;
    by type;
    retain value_avg num_per_cat max max_n_ ;
    if first.type then call missing(value_avg,num_per_cat);
    value_avg+value;
    num_per_cat+Not missing(value);
    if last.type then do;
       value_avg=divide(value_avg,num_per_cat);
       output want1;
    end;
    if value > max then do;
       max=value;
       max_n_=_n_;
    end;
    if eof then do;
      max=max_n_;
      set have point=max;
      output want2;
    end;
run;

                value_    num_per_
Obs    type      avg         cat

 1      A       25.667        3
 2      B      107.250        4
 3      C      109.000        4
 4      D      359.333        3

Obs    type         date    value    max_n_

 1      D      18JUL2021     951       13

ajd555 · Posted 11-21-2022 10:56 AM

Thank you @Tom , this is very helpful. I should have included a code segment showing what the desired output is but I appreciate that you were able to interpret it from my long-winded description above. I owe you one, thanks for the time!

PeterClemmensen · Posted 11-21-2022 01:58 PM

My 2 cents

data have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .   
A 09JUL2021 20  
A 20JUL2021 55  
A 20JUL2021 2   
B 02JUL2021 9   
B 22JUL2021 6   
B 04JUL2021 8   
B 07JUL2021 406 
C 01JUL2021 215 
C 28JUL2021 63  
C 30JUL2021 78  
C 21JUL2021 80  
D 18JUL2021 951 
D 09JUL2021 .   
D 14JUL2021 54  
D 08JUL2021 73  
;


data _null_;
   dcl hash h(ordered : 'Y');
   h.definekey('type');
   h.definedata('type', 'sum', 'n', 'mean');
   h.definedone();
 
   do until (z);
      set have end = z;

      if h.find() ne 0 then do;
         sum = 0;
         n   = 0;
      end;

      sum = sum(sum, value);
      if value then n = sum(n, 1);
      mean = divide(sum, n);
      h.replace();
   end;
 
   h.output(dataset : 'want(drop = n)');
run;

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 11-21-2022 02:02 PM

Also, you could argue that this is a single data step 🙂

data _null_;
rc=dosubl("
proc summary data = have nway;
   class type;
   var value;
   output out = want_mean(drop = _:) mean = ;
run;

proc summary data = have nway;
   class type;
   var value;
   output out = want_max(drop = _:) max = ;
run;
");
run;

The DATA to DATA Step Macro
Blog: SASnrd

Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

Re: Extracting row with highest value in a column while also calculating averages by group

SAS Innovate 2025: Call for Content

Classroom Training Available!