BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ajd555
Fluorite | Level 6

I have been tasked with taking the following data and creating two permanent data sets from it. One of these permanent data sets is supposed to contain the average of the "value" column for each group (meaning there should only be four rows in the end, with a new column that represents the average of respective values for A, B, C, and D). Averages should exclude missing values, meaning that if category A has a missing value, it should be divided by 3, not 4. The second permanent data set needs to be the one row with the highest overall value in the "value" column (in this case, the row with D 09JUL2021 951 should be the only row exported). I am having a tough time extracting that single row for the second data set. If you know of a way to perform these operations simultaneously, please let me know. Thank you for your time!

 

data work.have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;

 

Here is what I tried:

 

data mylib.data1(keep=type date value value_avg) mylib.data2;
    set work.have;
    by type;
    if value ne . then NotMissing=1; else NotMissing=0;
    if first.type then call missing(of value_avg);
        value_avg+value;
    if first.type then call missing(of num_per_cat);
        num_per_cat+NotMissing;
    Avg=divide((value_avg+value),(num_per_cat+NotMissing));
    if last.type then output mylib.data1;
run;

 

This was successful for me with calculating averages, but I have no idea how to extract the row with the highest value in the "value" column to a second data set.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Show what you expect as the result. 

 

Do you want one maximum observation? Or one per BY group?  One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.

data want1(keep=type value_avg num_per_cat)
     want2(drop=value_avg num_per_cat)
;
    set work.have end=eof;
    by type;
    retain value_avg num_per_cat max max_n_ ;
    if first.type then call missing(value_avg,num_per_cat);
    value_avg+value;
    num_per_cat+Not missing(value);
    if last.type then do;
       value_avg=divide(value_avg,num_per_cat);
       output want1;
    end;
    if value > max then do;
       max=value;
       max_n_=_n_;
    end;
    if eof then do;
      max=max_n_;
      set have point=max;
      output want2;
    end;
run;
                value_    num_per_
Obs    type      avg         cat

 1      A       25.667        3
 2      B      107.250        4
 3      C      109.000        4
 4      D      359.333        3

Obs    type         date    value    max_n_

 1      D      18JUL2021     951       13

 

View solution in original post

10 REPLIES 10
Kurt_Bremser
Super User

Maxim 7: There Is a Procedure for It.

In this case, PROC MEANS:

proc means data=have;
class type;
var value;
output out=want1 (drop=_type_ _freq_) mean()=;
run;

For the max, two methods:

proc sort data=have;
by descending value;
run;

data want2;
set have (obs=1);
run;

or

proc sql;
create table want2 as
  select *
  from have
  having value = max(value)
;
quit;

The second method will output several observations if there is a tie. You should also look at the RANK Procedure.

ajd555
Fluorite | Level 6

Hi @Kurt_Bremser ,

 

Thank you very much for the detailed reply. The methods you described worked very well for me. I am now realizing I should have been more specific in my description, so I apologize in advance. I was specifically asked to do the above using a single data step. Do you see this as being possible? I appreciate any thoughts you might have.

 

 

Kurt_Bremser
Super User

In a single data step:

data
  want1 (keep=type _value rename=(_value=value))
  want2 (keep=type date value)
;
set have end=done;
by type;
retain _mval _mpoint;
if first.type
then do;
  _count = 0;
  _value = 0;
end;
if value ne .
then do;
  _value + value;
  _count + 1;
end;
if last.type
then do;
  _value = _value / _count;
  output want1;
end;
if value > _mval
then do;
  _mval = value;
  _mpoint = _n_;
end;
if done
then do;
  set have point=_mpoint;
  output want2;
  stop;
end;
run;

Untested, posted from my tablet.

PeterClemmensen
Tourmaline | Level 20
data work.have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .
A 09JUL2021 20
A 20JUL2021 55
A 20JUL2021 2
B 02JUL2021 9
B 22JUL2021 6
B 04JUL2021 8
B 07JUL2021 406
C 01JUL2021 215
C 28JUL2021 63
C 30JUL2021 78
C 21JUL2021 80
D 18JUL2021 951
D 09JUL2021 .
D 14JUL2021 54
D 08JUL2021 73
;

proc summary data = have nway;
   class type;
   var value;
   output out = want_mean(drop = _:) mean = ;
run;

proc summary data = have nway;
   class type;
   var value;
   output out = want_max(drop = _:) max = ;
run;
ajd555
Fluorite | Level 6

Hi @PeterClemmensen ,

 

Thank you as well for your reply. It is always interesting to see how different people approach the same problem! As I mentioned to Kurt above, I should have been slightly more specific in my description. I am attempting to solve the problem I posted using a single data step. If you have any ideas on how to do that, it would be greatly appreciated. Thanks again.

Quentin
Super User

Is this for a class?  The main reason I would think to approach this with a forced DATA step solution is as a learning exercise.

 

Do they want you to do this with one data step, or would you be ok with a two-step solution?

 

I would assume you have learned about the RETAIN statement, correct?

 

Have you ever seen a SET statement that is placed inside a do-loop?  I wouldn't expect that approach (unofficially named 'DOW-loop') to be introduced in an introductory class.

 

If you're familiar with retain, you should be able to use retain to calculated the maximum value for each by-group.  The tricky part is you won't know the maximum values until the end of the by-group.  So typically you would retain MaxValue and then output the records for last.type.  But in this case, you would also want to retain DateOfMaxValue.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Tom
Super User Tom
Super User

Show what you expect as the result. 

 

Do you want one maximum observation? Or one per BY group?  One overall is the easiest. Keep track of where you found the maximum value and then you can use POINT= option on an additional SET statement to get back to it when you reach the end of the data.

data want1(keep=type value_avg num_per_cat)
     want2(drop=value_avg num_per_cat)
;
    set work.have end=eof;
    by type;
    retain value_avg num_per_cat max max_n_ ;
    if first.type then call missing(value_avg,num_per_cat);
    value_avg+value;
    num_per_cat+Not missing(value);
    if last.type then do;
       value_avg=divide(value_avg,num_per_cat);
       output want1;
    end;
    if value > max then do;
       max=value;
       max_n_=_n_;
    end;
    if eof then do;
      max=max_n_;
      set have point=max;
      output want2;
    end;
run;
                value_    num_per_
Obs    type      avg         cat

 1      A       25.667        3
 2      B      107.250        4
 3      C      109.000        4
 4      D      359.333        3

Obs    type         date    value    max_n_

 1      D      18JUL2021     951       13

 

ajd555
Fluorite | Level 6

Thank you @Tom , this is very helpful. I should have included a code segment showing what the desired output is but I appreciate that you were able to interpret it from my long-winded description above. I owe you one, thanks for the time!

PeterClemmensen
Tourmaline | Level 20

 My 2 cents

 

data have;
  input type $ date DATE9. value;
datalines;
A 08JUL2021 .   
A 09JUL2021 20  
A 20JUL2021 55  
A 20JUL2021 2   
B 02JUL2021 9   
B 22JUL2021 6   
B 04JUL2021 8   
B 07JUL2021 406 
C 01JUL2021 215 
C 28JUL2021 63  
C 30JUL2021 78  
C 21JUL2021 80  
D 18JUL2021 951 
D 09JUL2021 .   
D 14JUL2021 54  
D 08JUL2021 73  
;


data _null_;
   dcl hash h(ordered : 'Y');
   h.definekey('type');
   h.definedata('type', 'sum', 'n', 'mean');
   h.definedone();
 
   do until (z);
      set have end = z;

      if h.find() ne 0 then do;
         sum = 0;
         n   = 0;
      end;

      sum = sum(sum, value);
      if value then n = sum(n, 1);
      mean = divide(sum, n);
      h.replace();
   end;
 
   h.output(dataset : 'want(drop = n)');
run;
PeterClemmensen
Tourmaline | Level 20

Also, you could argue that this is a single data step 🙂

 

data _null_;
rc=dosubl("
proc summary data = have nway;
   class type;
   var value;
   output out = want_mean(drop = _:) mean = ;
run;

proc summary data = have nway;
   class type;
   var value;
   output out = want_max(drop = _:) max = ;
run;
");
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2622 views
  • 12 likes
  • 5 in conversation