03-01-2012 10:16 AM
We were having a (admittedly academic) discussion on the differences between using class versus by in a proc means statement. Performance issues aside, are there any differences? Some of colleagues vaguely recalled something about missing values being treated differently, but we couldn't reproduce this. Are there differences (again, performance aside), or did we remember incorrectly? Thanks.
03-01-2012 10:33 AM
I don't think there are any numerical/statistical differences. I find the CLASS statement convenient when I want to see all of the output in a single table; the BY group approach puts each BY group statistics on a separate page. Also, you need to SORT the data to use the BY group, but not to use the CLASS stmt. The BY group approach is more efficient when the data are sorted, and requires less memory. The output data sets also look different for the two approaches.
03-01-2012 11:49 AM
Rick, you have it exactly right. I just wanted to expound upon one of your points.
Comparing CLASS STATE COUNTY; vs. BY STATE COUNTY;
In the output data set using BY, there is one observation for each STATE/COUNTY combination.
In the output data set using CLASS, you get those same observations, plus: one observation holding a summary for the entire data set, one set of observations holding a summary for each STATE, and another set of observations holding a summary for each COUNTY. The variable _TYPE_ in the output data sets tells you what the level of summarization is for that observation.
The printed reports give you summaries at the most detailed level only, even if the output data sets would be different. And, as Rick noted, the format of the reports would change.
Finally, your colleague's recollection is correct. Any observation where a CLASS variable is missing will be thrown out of the analysis. The MISSING option changes that, treating missing values like any other value for a CLASS variable.
03-01-2012 10:41 AM
The SAS doc has a comparison of the two methods:
03-01-2012 02:16 PM
As a general rule where you are dealing with large datasets (> 1GB) and there are many distinct values of the class variables, I have often found SAS will process faster using BY rather than CLASS even with the SORT time added in as well. If your data is already sorted in the right order then the benefit is even greater.