Solved: Outputting by sum with proc freq

js5 · Posted 10-18-2024 10:39 AM

Hello,

I have the following code:

proc freq data=inputds noprint;
	by notsorted a b c;
	tables x*y*z/out=freq;
	weight count/zeros;
run;

proc means data=inputds completetypes noprint nway;
	by notsorted a b c;
	var count;
	output out=means(drop=_:) sum=count_sum_c;
run;

data final;
	merge freq means;
	by notsorted a b c;
run;

Is it possible to produce the dataset in one single step? having a notsorted means that the merge is not possible, and due to the fact that the input and freq datasets contain over 90 million rows, I would prefer not to have to sort anything unless absolutely unavoidable. Thank you for your tips in advance.

Tom · Posted 10-21-2024 08:48 AM

So with example data we can begin to get an idea what you are trying to do. Looks like you have grouping variables (in this case 3 of them: Make Drivetrain Type) that uniquely identify the groups. And you have a CLASS variable (in this case: Model) to subdivide the groups. And an analysis variable (in this case MSRP) that you want to SUM.

One obvious savings is to use the output of the first summary (at the BY GROUP* CLASS VARIABLE level) to calculate the BY GROUP sums.

In your example that would be changing the step that makes CARS_MEANS to use CARS_FREQ as the input.

proc summary data=cars_freq ;
  by make drivetrain type;
  var count ;
  output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;

It just doesn't attach the DOLLAR format to the variable.

Here is how you can use one PROC SUMMARY (MEANS and SUMMARY are the same procedure, they just use different defaults for the PRINT/NOPRINT option) and one data step to calculate your COUNT, PERCENT and MSRP_sum_carstate variables.

proc summary data=sashelp.cars chartype ;
  class Make Drivetrain Type Model ;
  types Make*Drivetrain*Type Make*Drivetrain*Type*Model;
  output out=step1 sum(MSRP)=COUNT ;
run;

data want;
  merge step1(where=(not indexc(_type_,'0')))
        step1(where=(indexc(_type_,'0')) drop=Model rename=(COUNT=MSRP_sum_carstate))
  ;
  by Make Drivetrain Type;
  PERCENT=100*(COUNT/MSRP_sum_carstate);
  drop _type_ _freq_ ;
run;

Let's test and see if they make the same values (within 6 decimal places).

proc compare data=want compare=final method=absolute criterion=10E-6;
run;

Results

Variables Summary

Number of Variables in Common: 7.
Number of Variables with Differing Attributes: 2.


Listing of Common Variables with Differing Attributes

Variable  Dataset     Type  Length  Format    Label

COUNT     WORK.WANT   Num        8  DOLLAR8.
          WORK.FINAL  Num        8            Frequency Count
PERCENT   WORK.WANT   Num        8
          WORK.FINAL  Num        8            Percent of Total Frequency


Observation Summary

Observation      Base  Compare

First Obs           1        1
Last  Obs         428      428

Number of Observations in Common: 428.
Total Number of Observations Read from WORK.WANT: 428.
Total Number of Observations Read from WORK.FINAL: 428.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 428.


Values Comparison Summary

Number of Variables Compared with All Observations Equal: 7.
Number of Variables Compared with Some Observations Unequal: 0.
Total Number of Values which Compare Unequal: 0.
Total Number of Values not EXACTLY Equal: 141.
Maximum Difference: 1.4211E-14.

View solution in original post

Tom · Posted 10-18-2024 11:10 AM

Without data or even a description of what you are trying to do it is going to be hard.

Why not just let PROC MEANS add up the WEIGHT variable for you (especially since you want to include the zero weight observations)?

proc means data=inputds completetypes noprint nway;
  class a b c;
  output out=means(drop=_:) n=nobs sum(count)=count_sum_c;
run;

js5 · Posted 10-21-2024 02:42 AM

This lacks the frequencies per a*b combination unfortunately

Tom · Posted 10-21-2024 08:12 AM

@js5 wrote:

This lacks the frequencies per a*b combination unfortunately

Again, without actual data or an explanation I have no idea what that means.

js5 · Posted 10-21-2024 08:16 AM

Does https://communities.sas.com/t5/SAS-Procedures/Outputting-by-sum-with-proc-freq/m-p/948331#M83716 help?

ballardw · Posted 10-18-2024 11:16 AM

You might describe what you are wanting as output and do you actually need a data set? If the data set is not used for anything other than reporting it may be that a report procedure is the way to go.

And providing data in the form of a working data step usually helps.

Provide a subset and what you expect as a final result for that subset of data.

I am a bit concerned that you may not be getting what you want as output to begin with by using not sorted.

IF two data sets have the same number of observations you might try as without a BY then SAS does a line by line merge regardless of values. The output of the two data sets may be in the same order because of the way SAS process notsorted data.

data final;
	merge freq means;
run;

js5 · Posted 10-21-2024 02:38 AM

I do need a dataset as output. The two datasets do not have the same number of observations unfortunately. Here is a simplified example using sashelp.cars:

proc sort data=sashelp.cars out=cars_sorted;
by Make Drivetrain Type;
run;

proc freq data=cars_sorted noprint;
	by Make Drivetrain Type;
	tables Model/out=cars_freq;
	weight MSRP/zeros;
run;

proc means data=cars_sorted completetypes noprint nway;
	by Make Drivetrain Type;
	var MSRP;
	output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;

data final;
	merge cars_freq cars_means;
	by Make Drivetrain Type;
run;

I am in the end interested in frequencies per by group (COUNT) as well as the denominator used to calculate them (MSRP_sum_carstate). Sorting before turned out not to be that resource-intensive after all, but using one step instead of two would still be cleaner.

Tom · Posted 10-21-2024 08:48 AM

So with example data we can begin to get an idea what you are trying to do. Looks like you have grouping variables (in this case 3 of them: Make Drivetrain Type) that uniquely identify the groups. And you have a CLASS variable (in this case: Model) to subdivide the groups. And an analysis variable (in this case MSRP) that you want to SUM.

One obvious savings is to use the output of the first summary (at the BY GROUP* CLASS VARIABLE level) to calculate the BY GROUP sums.

In your example that would be changing the step that makes CARS_MEANS to use CARS_FREQ as the input.

proc summary data=cars_freq ;
  by make drivetrain type;
  var count ;
  output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;

It just doesn't attach the DOLLAR format to the variable.

Here is how you can use one PROC SUMMARY (MEANS and SUMMARY are the same procedure, they just use different defaults for the PRINT/NOPRINT option) and one data step to calculate your COUNT, PERCENT and MSRP_sum_carstate variables.

proc summary data=sashelp.cars chartype ;
  class Make Drivetrain Type Model ;
  types Make*Drivetrain*Type Make*Drivetrain*Type*Model;
  output out=step1 sum(MSRP)=COUNT ;
run;

data want;
  merge step1(where=(not indexc(_type_,'0')))
        step1(where=(indexc(_type_,'0')) drop=Model rename=(COUNT=MSRP_sum_carstate))
  ;
  by Make Drivetrain Type;
  PERCENT=100*(COUNT/MSRP_sum_carstate);
  drop _type_ _freq_ ;
run;

Let's test and see if they make the same values (within 6 decimal places).

proc compare data=want compare=final method=absolute criterion=10E-6;
run;

Results

Variables Summary

Number of Variables in Common: 7.
Number of Variables with Differing Attributes: 2.


Listing of Common Variables with Differing Attributes

Variable  Dataset     Type  Length  Format    Label

COUNT     WORK.WANT   Num        8  DOLLAR8.
          WORK.FINAL  Num        8            Frequency Count
PERCENT   WORK.WANT   Num        8
          WORK.FINAL  Num        8            Percent of Total Frequency


Observation Summary

Observation      Base  Compare

First Obs           1        1
Last  Obs         428      428

Number of Observations in Common: 428.
Total Number of Observations Read from WORK.WANT: 428.
Total Number of Observations Read from WORK.FINAL: 428.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 428.


Values Comparison Summary

Number of Variables Compared with All Observations Equal: 7.
Number of Variables Compared with Some Observations Unequal: 0.
Total Number of Values which Compare Unequal: 0.
Total Number of Values not EXACTLY Equal: 141.
Maximum Difference: 1.4211E-14.

Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Re: Outputting by sum with proc freq

Registration is open

SAS Training: Just a Click Away