Solved: SAS equivalent to SPSS Aggregate command

emaguin · Posted 07-17-2019 10:03 AM

I'm looking for how to implement the spss aggregate command in sas. I'm sure that some you use more than sas and have used/use spss and you may be familiar with this command. Let me give you an example that applies to this specific problem.

Aggregate outfile=*/break=var1 var2/varA varB varC=first(varA varB varC)/varJ varK=last(varJ varK).

Let me explain: Imagine a dataset with multiple records per combination of values of var1 and var2. We want a new dataset with one record per var1*var2 combination such that the value of varA, varB, and varC is the first non-missing value of the target variable for the var1*var2 combination in the original dataset. Likewise, the value of varJ and varK is the last non-missing value of the target variable.

Now, command details. "outfile=*" means that the to-be created dataset replaces the currently open dataset (alternatively, a file name may be provided for the to-be created dataset). "break=var1 var2" specifies the variables that will define records in the to-be created dataset (I think you might substitute the sas syntax word "by"). I've shown that the same variable names are reused, e.g., varA. Of course, a new variable name may be used, as in v27=first(varA).

Here's a very simple example.

Suppose this input dataset.

x y a b j k

2 1 2 . 3 4

2 1 2 5 8 1

2 1 1 9 7 8

2 2 8 . 1 1

3 3 . 3 5 2

3 3 . 4 2 6

aggregate outfile=*/break=x y/a b=first(a b)/j k=last(j k).

The resulting dataset

x y a b j k

2 1 2 . 7 8

2 2 8 . 1 1

3 3 . 3 2 6

3 3 . 4 2 6

Although "first" and "last"are two functions i'm now interested in, there is a much larger set of functions and the command has greater functionality that i've ignored for this question.

I'm a very new sas user and although i did the standard google search and I found nothing. However, i may not have used the best search string, but just like in spss or stata, if you don't know what you are searching for is called in sas, you'll never find it.

Thank you,

Gene Maguin

emaguin · Posted 07-17-2019 02:21 PM

Thank you for your reply.

I accept that the code you provided does what i asked. However, i judge the answer incomplete because I can not explain either to myself or to anybody how it works/why it works. Specifics follow.

1) ?So now that i know the name, i can find it in the documentation. I see that given a set of variables, coalesce returns the first non-missing value. What about the last non-missing value?

2) I need a walk-thru because this is extremely complex to me. So first thing you do is a rename: a to _a, etc. What does that do? the variable name "a" is gone, it's now _a (and i assume that "_a" is just a convenient name, "xx" would have worked as well? Is that true?

Next, retain the original variable names. I've looked at the documentation for Retain and there's something going on that i don't understand: "Causes a variable that is created by an INPUT or assignment statement to retain its value from one iteration of the DATA step to the next." "Retain" has a common meaning but i'll bet that common meaning is not correct here. It may be the interation is the key word.

Next line: what is 'first.y', that is not in the documentation, 'first.' is not either; however, looking at Ron Cody's book i see what it means. I see that 'call missing' is a routine and i read what it does. why is it necessary?

Next, the coalesce functions. So let's take the first one (a=coalesce(a,_a). The first (x,y) combination is (2,1), what is the value of a and of _a for (2,1)? the coalesce example in the documentation shows a set of variables. Here, there's a set of values, not variables.

Last (and please note that i haven't run this code). I notice that the statements for j and k put _j/_k first whereas the statements for a and b put the _a/_b last. I can guess at something but the functional meaning is not clear. Clarify this.

View solution in original post

Tom · Posted 07-17-2019 10:15 AM

Note: For more normal types of aggregations look at PROC MEANS (also known as PROC SUMMARY). Easy enough to calculate min/max/mean and other statistics per group. Perhaps the IDGROUP option could even do some of these exotic combinations you are looking for.

Your result does not seem to match your description. Your posted results seem to be taking the value from the first observation for the group and not the first non missing value within the group. Also your result has two observations for the last group, but only one for the first two groups.

If the goal is to find the first and last OBSERVATION for the group then use BY group processing.

data have ;
 input x y a b j k;
cards;
2 1 2 . 3 4
2 1 2 5 8 1
2 1 1 9 7 8
2 2 8 . 1 1
3 3 . 3 5 2
3 3 . 4 2 6 
;
data first last;
  set have;
  by x y;
  if first.y then output first;
  if last.y then output last;
run;

data want;
  merge first(keep=x y a b) last(keep=x y j k);
  by x y ;
run;

Results:

Obs    x    y    a    b    j    k

 1     2    1    2    .    7    8
 2     2    2    8    .    1    1
 3     3    3    .    3    2    6

If you really want the LAST non-missing value you can use the UPDATE statement.

If you really want the FIRST non-missing value you could pull in non missing values for each variable separately and take the first observation for each group.

data last ;
  update have(obs=0) have;
  by x y;
run;

data first;
  merge
   have(keep=x y a where=(not missing(a)))
   have(keep=x y b where=(not missing(b)))
   have(keep=x y j where=(not missing(j)))
   have(keep=x y k where=(not missing(k)))
  ;
  by x y;
  if first.y;
run;

data want;
  merge first(keep=x y a b) last(keep=x y j k);
  by x y;
run;

Results:

Obs    x    y    a    b    j    k

 1     2    1    2    5    7    8
 2     2    2    8    .    1    1
 3     3    3    .    3    2    6

Kurt_Bremser · Posted 07-17-2019 11:03 AM

Use the coalesce() function:

data have;
input x y a b j k;
datalines;
2 1 2 . 3 4
2 1 2 5 8 1
2 1 1 9 7 8
2 2 8 . 1 1
3 3 . 3 5 2
3 3 . 4 2 6
;

data want;
set have (rename=(a=_a b=_b j=_j k=_k));
by x y;
retain a b j k;
if first.y then call missing(a,b,j,k);
a = coalesce(a,_a);
b = coalesce(b,_b);
j = coalesce(_j,j);
k = coalesce(_k,k);
if last.y;
drop _:;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

emaguin · Posted 07-17-2019 02:21 PM

Thank you for your reply.

I accept that the code you provided does what i asked. However, i judge the answer incomplete because I can not explain either to myself or to anybody how it works/why it works. Specifics follow.

1) ?So now that i know the name, i can find it in the documentation. I see that given a set of variables, coalesce returns the first non-missing value. What about the last non-missing value?

2) I need a walk-thru because this is extremely complex to me. So first thing you do is a rename: a to _a, etc. What does that do? the variable name "a" is gone, it's now _a (and i assume that "_a" is just a convenient name, "xx" would have worked as well? Is that true?

Next, retain the original variable names. I've looked at the documentation for Retain and there's something going on that i don't understand: "Causes a variable that is created by an INPUT or assignment statement to retain its value from one iteration of the DATA step to the next." "Retain" has a common meaning but i'll bet that common meaning is not correct here. It may be the interation is the key word.

Next line: what is 'first.y', that is not in the documentation, 'first.' is not either; however, looking at Ron Cody's book i see what it means. I see that 'call missing' is a routine and i read what it does. why is it necessary?

Next, the coalesce functions. So let's take the first one (a=coalesce(a,_a). The first (x,y) combination is (2,1), what is the value of a and of _a for (2,1)? the coalesce example in the documentation shows a set of variables. Here, there's a set of values, not variables.

Last (and please note that i haven't run this code). I notice that the statements for j and k put _j/_k first whereas the statements for a and b put the _a/_b last. I can guess at something but the functional meaning is not clear. Clarify this.

Kurt_Bremser · Posted 07-17-2019 02:53 PM

data want;
set have (rename=(a=_a b=_b j=_j k=_k));
/* I rename the incoming variables, so I can use their names for new ones */
by x y;
/* sets up by-group processing */
retain a b j k;
/* create new variables, and have them keep their values across data step iterations */
if first.y then call missing(a,b,j,k);
/* initialize all new variables to missing at a group change */
a = coalesce(a,_a);
/* as long as the incoming value is missing, a will stay missing */
/* the first non-missing value will be kept */
b = coalesce(b,_b);
j = coalesce(_j,j);
/* will stay missing until non-missing value */
/* subsequent non-missing values will overwrite, until only missing values
  come or the next group change is encountered */
k = coalesce(_k,k);
if last.y; /* output only when group ends */
drop _:; /* throw away the old variables */
run;

If you have further questions, feel free to ask.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Registration is open

SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Re: SAS equivalent to SPSS Aggregate command

Registration is open

SAS Training: Just a Click Away