Hello,
I have the following code:
proc freq data=inputds noprint;
by notsorted a b c;
tables x*y*z/out=freq;
weight count/zeros;
run;
proc means data=inputds completetypes noprint nway;
by notsorted a b c;
var count;
output out=means(drop=_:) sum=count_sum_c;
run;
data final;
merge freq means;
by notsorted a b c;
run;
Is it possible to produce the dataset in one single step? having a notsorted means that the merge is not possible, and due to the fact that the input and freq datasets contain over 90 million rows, I would prefer not to have to sort anything unless absolutely unavoidable. Thank you for your tips in advance.
So with example data we can begin to get an idea what you are trying to do. Looks like you have grouping variables (in this case 3 of them: Make Drivetrain Type) that uniquely identify the groups. And you have a CLASS variable (in this case: Model) to subdivide the groups. And an analysis variable (in this case MSRP) that you want to SUM.
One obvious savings is to use the output of the first summary (at the BY GROUP* CLASS VARIABLE level) to calculate the BY GROUP sums.
In your example that would be changing the step that makes CARS_MEANS to use CARS_FREQ as the input.
proc summary data=cars_freq ;
by make drivetrain type;
var count ;
output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;
It just doesn't attach the DOLLAR format to the variable.
Here is how you can use one PROC SUMMARY (MEANS and SUMMARY are the same procedure, they just use different defaults for the PRINT/NOPRINT option) and one data step to calculate your COUNT, PERCENT and MSRP_sum_carstate variables.
proc summary data=sashelp.cars chartype ;
class Make Drivetrain Type Model ;
types Make*Drivetrain*Type Make*Drivetrain*Type*Model;
output out=step1 sum(MSRP)=COUNT ;
run;
data want;
merge step1(where=(not indexc(_type_,'0')))
step1(where=(indexc(_type_,'0')) drop=Model rename=(COUNT=MSRP_sum_carstate))
;
by Make Drivetrain Type;
PERCENT=100*(COUNT/MSRP_sum_carstate);
drop _type_ _freq_ ;
run;
Let's test and see if they make the same values (within 6 decimal places).
proc compare data=want compare=final method=absolute criterion=10E-6;
run;
Results
Variables Summary Number of Variables in Common: 7. Number of Variables with Differing Attributes: 2. Listing of Common Variables with Differing Attributes Variable Dataset Type Length Format Label COUNT WORK.WANT Num 8 DOLLAR8. WORK.FINAL Num 8 Frequency Count PERCENT WORK.WANT Num 8 WORK.FINAL Num 8 Percent of Total Frequency Observation Summary Observation Base Compare First Obs 1 1 Last Obs 428 428 Number of Observations in Common: 428. Total Number of Observations Read from WORK.WANT: 428. Total Number of Observations Read from WORK.FINAL: 428. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 428. Values Comparison Summary Number of Variables Compared with All Observations Equal: 7. Number of Variables Compared with Some Observations Unequal: 0. Total Number of Values which Compare Unequal: 0. Total Number of Values not EXACTLY Equal: 141. Maximum Difference: 1.4211E-14.
Without data or even a description of what you are trying to do it is going to be hard.
Why not just let PROC MEANS add up the WEIGHT variable for you (especially since you want to include the zero weight observations)?
proc means data=inputds completetypes noprint nway;
class a b c;
output out=means(drop=_:) n=nobs sum(count)=count_sum_c;
run;
This lacks the frequencies per a*b combination unfortunately
@js5 wrote:
This lacks the frequencies per a*b combination unfortunately
Again, without actual data or an explanation I have no idea what that means.
You might describe what you are wanting as output and do you actually need a data set? If the data set is not used for anything other than reporting it may be that a report procedure is the way to go.
And providing data in the form of a working data step usually helps.
Provide a subset and what you expect as a final result for that subset of data.
I am a bit concerned that you may not be getting what you want as output to begin with by using not sorted.
IF two data sets have the same number of observations you might try as without a BY then SAS does a line by line merge regardless of values. The output of the two data sets may be in the same order because of the way SAS process notsorted data.
data final;
merge freq means;
run;
I do need a dataset as output. The two datasets do not have the same number of observations unfortunately. Here is a simplified example using sashelp.cars:
proc sort data=sashelp.cars out=cars_sorted;
by Make Drivetrain Type;
run;
proc freq data=cars_sorted noprint;
by Make Drivetrain Type;
tables Model/out=cars_freq;
weight MSRP/zeros;
run;
proc means data=cars_sorted completetypes noprint nway;
by Make Drivetrain Type;
var MSRP;
output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;
data final;
merge cars_freq cars_means;
by Make Drivetrain Type;
run;
I am in the end interested in frequencies per by group (COUNT) as well as the denominator used to calculate them (MSRP_sum_carstate). Sorting before turned out not to be that resource-intensive after all, but using one step instead of two would still be cleaner.
So with example data we can begin to get an idea what you are trying to do. Looks like you have grouping variables (in this case 3 of them: Make Drivetrain Type) that uniquely identify the groups. And you have a CLASS variable (in this case: Model) to subdivide the groups. And an analysis variable (in this case MSRP) that you want to SUM.
One obvious savings is to use the output of the first summary (at the BY GROUP* CLASS VARIABLE level) to calculate the BY GROUP sums.
In your example that would be changing the step that makes CARS_MEANS to use CARS_FREQ as the input.
proc summary data=cars_freq ;
by make drivetrain type;
var count ;
output out=cars_means(drop=_:) sum=MSRP_sum_carstate;
run;
It just doesn't attach the DOLLAR format to the variable.
Here is how you can use one PROC SUMMARY (MEANS and SUMMARY are the same procedure, they just use different defaults for the PRINT/NOPRINT option) and one data step to calculate your COUNT, PERCENT and MSRP_sum_carstate variables.
proc summary data=sashelp.cars chartype ;
class Make Drivetrain Type Model ;
types Make*Drivetrain*Type Make*Drivetrain*Type*Model;
output out=step1 sum(MSRP)=COUNT ;
run;
data want;
merge step1(where=(not indexc(_type_,'0')))
step1(where=(indexc(_type_,'0')) drop=Model rename=(COUNT=MSRP_sum_carstate))
;
by Make Drivetrain Type;
PERCENT=100*(COUNT/MSRP_sum_carstate);
drop _type_ _freq_ ;
run;
Let's test and see if they make the same values (within 6 decimal places).
proc compare data=want compare=final method=absolute criterion=10E-6;
run;
Results
Variables Summary Number of Variables in Common: 7. Number of Variables with Differing Attributes: 2. Listing of Common Variables with Differing Attributes Variable Dataset Type Length Format Label COUNT WORK.WANT Num 8 DOLLAR8. WORK.FINAL Num 8 Frequency Count PERCENT WORK.WANT Num 8 WORK.FINAL Num 8 Percent of Total Frequency Observation Summary Observation Base Compare First Obs 1 1 Last Obs 428 428 Number of Observations in Common: 428. Total Number of Observations Read from WORK.WANT: 428. Total Number of Observations Read from WORK.FINAL: 428. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 428. Values Comparison Summary Number of Variables Compared with All Observations Equal: 7. Number of Variables Compared with Some Observations Unequal: 0. Total Number of Values which Compare Unequal: 0. Total Number of Values not EXACTLY Equal: 141. Maximum Difference: 1.4211E-14.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.