Hello All,
I have a input SAS dataset say insasds with 6 Million records and have 16 variables (1 variable in them has dollars and is summed up), say var1 to var16 (where var16 has the dollars).
I have two summarizations running in my SAS program as shown below.
PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)
BY var1 var2 var3 var4 var5 var6...... var10;
VAR var16;
OUTPUT OUT=TEMP1 SUM=;
and another summarization
PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is again sorted here by the class variables (var11 to var15)
BY var11 var12 var13 var14 var15;
VAR var16;
OUTPUT OUT=TEMP2 SUM=;
Now my issue is when I use the PROC TABULATE procedure to generate the total dollars value, what I am seeing is the total dollars value in TEMP1 and TEMP2 is not equal.
PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
My understanding of the SUMMARY procedure is it makes the arrangement based on the BY/CLASS variables but the final dollar value would be same. The program has been running fine now for almost 3 years now and the two summarizations used to yield the same total dollars. From the last couple of runs, I am seeing that tabulate procedure for TEMP1 produces some $X and TEMP2 gives $Y where $X and $Y are not same.
Could anyone please advise from where I need to start looking at the problem?
Thanks in Advance!!
Regards,
HK
Can you post a sample dataset that results with the discrepancy you mentioned? Also, post the code you used to sort the data for each analysis, as that could explain the discrepancy.
Hi:
And what you've posted for code doesn't make sense, to me. You call your variables VAR1-VAR10 and VAR11-VAR15 CLASS variables, but you ONLY list them in a BY statement, not a CLASS statement. That doesn't make sense. PROC MEANS would give you different results with a BY versus a CLASS. Consider the following code and the differences between using just the BY and using a CLASS statement. With the BY statement, we just get 1 observation in the output. With the CLASS statement, you get 1 obs for every unique combination of the CLASS variables.
Since you say you have millions of obs, I'm not sure why you are taking this approach anyway (doing a MEANS and then following it with a TABULATE). Especially since TABULATE is capable of generating a summary for your CLASS variables directly without the MEANS step. I'd recommend that you work with Tech Support since you probably can't post millions of records.
cynthia
proc means data=sashelp.class nway missing;
by age sex;
var weight;
output out=work.mns1 sum=;
run;
proc print data=work.mns1;
title 'after proc means 1';
run;
proc means data=sashelp.class nway missing;
class age sex;
var weight;
output out=work.mns2 sum=;
run;
proc print data=work.mns2;
title 'after proc means 2';
run;
@Cynthia: Either it is me, or the fact that it is Friday, but I disagree. You can't run your first set of code without error since sashelp.class isn't sorted according to the by variables.
If one first uses:
proc sort data=sashelp.class out=class;
by age sex;
run;
and then uses class as the input data for the two means runs, the only difference (I think) will be in the value assigned to _type_.
Ah, Art:
You are absolutely correct, _TYPE_ is the only difference. Rookie mistake, I did not check my log & I did not sort and so my print was displaying the results of an entirely different MEANS step (which is a reason for not using the same name over and over!).
I still don't understand the need to have 2 separate PROC MEANS steps prior to TABULATE. I think of TABULATE as being the "GINSU" steak knife of SAS and being able to slice and dice data with great efficiency.
cynthia
@Cynthia: Let's agree to blame it on being a Friday .. you are definitely NOT a rookie! I'd still like to know why the OP got different results. Many, including me, use means and summary, rather than tabulate, as we find it easier to understand. Someday I really should learn the procedure (along with the other gazillion procedures I don't currently use because they weren't instinctively and almost immediately comprehensible).
Hi Arthur,
PROC SORT DATA=insasds out=insasds.sorted;
BY var1 var2 var3 var4 var5 var6.... var15;
PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)
BY var1 var2 var3 var4 var5 var6...... var10;
VAR var16;
OUTPUT OUT=TEMP1 SUM=;
PROC SUMMARY NWAY MISSING DATA=insasds.sorted;
BY var11 var12 var13 var14 var15;
VAR var16;
OUTPUT OUT=TEMP2 SUM=;
PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
This is my code.
Today, I have done an extensive testing by varying the count of observations and here are my observations:
1. Until 3 Million records (out of total universe of 6138735 records), the total var16 dollar value is same in TEMP1 and TEMP2.
2. With 4 Million records, the first difference is seen. the total var16 dollar value is not consisent across TEMP1, TEMP2 SAS Datasets.
3. With the whole universe of 6 Million+ records, the difference increases
I am trying to replace PROC SUMMARY with PROC SQL and use group by and then sum the variable to see if that validates.
Another thing I observed is when I produced a tabular report on insasds.sorted, TEMP1 and TEMP2 for var16. here is what I see.
On insasds.sorted ==> Sum of var16 ==> 77,663,588
On TEMP1 ==> Sum of var16 ==> 77,650,245
On TEMP2 ==> Sum of var16 ==> 77,645,951
Again for 3 Million record obs,
On insasds.sorted ==> Sum of var16 ==> 52,005,676
On TEMP1 ==> Sum of var16 ==> 52,005,676
On TEMP2 ==> Sum of var16 ==> 52,005,676
Please let me know what else details you would like me to elaborate here? I am unable to post a sample dataset because the universe is huge and for small count of observations till 3 Million, the PROC SUMMARY seems to work fine and the total dollars are consistent across input, TEMP1, TEMP2 output files.
Could it indicate rounding off the dollar values (dollars with decimal values) that is creating the difference?
Thank You!!
HK
HK: Your problem is in your proc sort! The one proc sort won't work for both proc summaries.
Try it by either substituting class instead of by, thus not requiring a sorted file or, on the other hand, separate proc sorts. In either case, you can reduce the code by specifying variable lists. e.g.:
PROC SORT DATA=insasds out=insasds.sorted;
BY var1-var10;
run;
PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)
BY var1-var10;
VAR var16;
OUTPUT OUT=TEMP1 SUM=;
run;
PROC SORT DATA=insasds out=insasds.sorted;
BY var11-var15;
run;
PROC SUMMARY NWAY MISSING DATA=insasds.sorted;
BY var11-var15;
VAR var16;
OUTPUT OUT=TEMP2 SUM=;
run;
Let us know if that fixes the discrepancy.
Art
Hi Arthur,
Thanks for the suggestion, I replaced BY with CLASS and removed sort. The result is even more weird now :-(... I am beginning to suspect that it could be a case of rounding issue.
PROC SUMMARY NWAY MISSING DATA=insasds;
CLASS var1 var2 var3 var4.....var10;
VAR var16;
OUTPUT OUT=TEMP1 SUM=;
run;
PROC SUMMARY NWAY MISSING DATA=insasds;
CLASS var11 var12 var13 var14 var15;
VAR var16;
OUTPUT OUT=TEMP2 SUM=;
run;
Is my understanding correct? Will the grand total dollars of var16 in insasds, TEMP1 and TEMP2 be same irrespective of the way they are summarized?
Current Results:
PROC TABULATE NOSEPS MISSING DATA=insasds FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;
TITLE "Total Dollars"
VAR var16;
TABLE SUM*var16;
RUN; TITLE;
Tabular proc for insasds: 77,757,256
Tabular proc for TEMP1: 77,648,952
Tabular proc for TEMP2: 77,676,511
On a couple of occasions, now, you have listed the following as your code:
CLASS var1 var2 var3 var4.....var10;
Is that really how the statement looks? That wouldn't run. You either have to spell the variables out, or use them in a variable list (like var1-var10).
In any case, yes, all of your results should have the same number. I doubt if it is rounding error causing that large of a discrepancy.
Can you post copies of temp1 and temp2?
Hi Arthur,
Apologies for creating a confusion! Yes, I am spelling out the VAR1 to VAR10 variables separately.
CLASS var1 var2 var3 var4 var5 var6 var7 var8 var9 var10;
and actually in my program these variables have more meaningful names (not as var1 to var15).
I don't know if I can post the data here. The summary gives 35933 observations in TEMP1 and 23093 observations in TEMP2.
Hi Tom,
Can you please give an example? I am a beginner with the PROC UNIVARIATE procedure. I have used the var statement as shown below.
PROC UNIVARIATE DATA=TEMP1;
VAR var16;
WEIGHT var16;
RUN;
and please note the program has been giving consistent amount summary on TEMP1 and TEMP2 till last 3 years with around 5 Million observations.
Results of PROC UNIVARIATE.
on insasds
Weight: var16 | ||||
Weighted Moments | ||||
N | 6138735 | Sum Weights | 1.0468E22 | |
Mean | 2.21819E21 | Sum Observations | 2.322E43 | |
Std Deviation | 3.78917E28 | Variance | 1.43578E57 | |
Skewness | 199.575324 | Kurtosis | 1537338.08 | |
Uncorrected SS | 6.03203E64 | Corrected SS | 8.81388E63 | |
Coeff Variation | 1708229227 | Std Error Mean | 3.70349E17 |
On TEMP1.
N | 35933 | Sum Weights | 80644396.8 |
Mean | 576248.022 | Sum Observations | 4.64712E13 |
Std Deviation | 50713325.6 | Variance | 2.57184E15 |
Skewness | 113.995947 | Kurtosis | 17152.4143 |
Uncorrected SS | 1.1919E20 | Corrected SS | 9.24114E19 |
Coeff Variation | 8800.60732 | Std Error Mean | 5647.2237 |
ON TEMP2
N | 23093 | Sum Weights | 84646161.4 |
Mean | 637189.816 | Sum Observations | 5.39357E13 |
Std Deviation | 48992853 | Variance | 2.4003E15 |
Skewness | 42.4383756 | Kurtosis | 2982.60987 |
Uncorrected SS | 8.9795E19 | Corrected SS | 5.54277E19 |
Coeff Variation | 7688.89455 | Std Error Mean | 5325.11629 |
I have to wonder whether your sum is a larger number than your computer can handle. Obviously, not knowing what your data look like, it is difficult to say and (I would think) highly unusual given the scope of the numbers you have reported. However, if even just one of your newer records had an absurdly high number, that could be the problem.
Can you re-run the analyses, but first reducing var 16 by dividing it by 1,000?
Also, why did you use a weight statement in your proc univariate? It would have been more useful if you hadn't done that.
I, for one, would be most interested in seeing your extreme data points. Can you run something like the following and post the results:
ods select ExtremeObs;
proc univariate data=yourfilename;
var var16;
run;
maybe he used weight because it is a sampled dataset.
you have to correct for that.
why else would you use it.
He hasn't used it before, which is why I asked. Why else would one use it? One possibility is that they see it in the documentation, don't understand its ramifications, and simply include it as part of their code .. however wrong that may be!
Hi Arthur,
As I read again and again, I am now getting to understand what the WEIGHT statement is used for! That was completely not needed in my case.
Thanks for putting me in the right direction!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.