Re: proc summary procedure giving difference in sum when summarized by...

HK_EndeavourForever · Posted 08-02-2012 11:12 PM

Hello All,

I have a input SAS dataset say insasds with 6 Million records and have 16 variables (1 variable in them has dollars and is summed up), say var1 to var16 (where var16 has the dollars).

I have two summarizations running in my SAS program as shown below.

PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)

BY var1 var2 var3 var4 var5 var6...... var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

and another summarization

PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is again sorted here by the class variables (var11 to var15)

BY var11 var12 var13 var14 var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

Now my issue is when I use the PROC TABULATE procedure to generate the total dollars value, what I am seeing is the total dollars value in TEMP1 and TEMP2 is not equal.

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

My understanding of the SUMMARY procedure is it makes the arrangement based on the BY/CLASS variables but the final dollar value would be same. The program has been running fine now for almost 3 years now and the two summarizations used to yield the same total dollars. From the last couple of runs, I am seeing that tabulate procedure for TEMP1 produces some $X and TEMP2 gives $Y where $X and $Y are not same.

Could anyone please advise from where I need to start looking at the problem?

Thanks in Advance!!

Regards,

HK

art297 · Posted 08-02-2012 11:36 PM

Can you post a sample dataset that results with the discrepancy you mentioned? Also, post the code you used to sort the data for each analysis, as that could explain the discrepancy.

Cynthia_sas · Posted 08-03-2012 10:12 AM

Hi:

And what you've posted for code doesn't make sense, to me. You call your variables VAR1-VAR10 and VAR11-VAR15 CLASS variables, but you ONLY list them in a BY statement, not a CLASS statement. That doesn't make sense. PROC MEANS would give you different results with a BY versus a CLASS. Consider the following code and the differences between using just the BY and using a CLASS statement. With the BY statement, we just get 1 observation in the output. With the CLASS statement, you get 1 obs for every unique combination of the CLASS variables.

Since you say you have millions of obs, I'm not sure why you are taking this approach anyway (doing a MEANS and then following it with a TABULATE). Especially since TABULATE is capable of generating a summary for your CLASS variables directly without the MEANS step. I'd recommend that you work with Tech Support since you probably can't post millions of records.

cynthia

proc means data=sashelp.class nway missing;

by age sex;

var weight;

output out=work.mns1 sum=;

run;

proc print data=work.mns1;

title 'after proc means 1';

run;

proc means data=sashelp.class nway missing;

class age sex;

var weight;

output out=work.mns2 sum=;

run;

proc print data=work.mns2;

title 'after proc means 2';

run;

art297 · Posted 08-03-2012 11:01 AM

@Cynthia: Either it is me, or the fact that it is Friday, but I disagree. You can't run your first set of code without error since sashelp.class isn't sorted according to the by variables.

If one first uses:

proc sort data=sashelp.class out=class;

by age sex;

run;

and then uses class as the input data for the two means runs, the only difference (I think) will be in the value assigned to _type_.

Cynthia_sas · Posted 08-03-2012 11:41 AM

Ah, Art:

You are absolutely correct, _TYPE_ is the only difference. Rookie mistake, I did not check my log & I did not sort and so my print was displaying the results of an entirely different MEANS step (which is a reason for not using the same name over and over!).

I still don't understand the need to have 2 separate PROC MEANS steps prior to TABULATE. I think of TABULATE as being the "GINSU" steak knife of SAS and being able to slice and dice data with great efficiency.

cynthia

art297 · Posted 08-03-2012 11:50 AM

@Cynthia: Let's agree to blame it on being a Friday .. you are definitely NOT a rookie! I'd still like to know why the OP got different results. Many, including me, use means and summary, rather than tabulate, as we find it easier to understand. Someday I really should learn the procedure (along with the other gazillion procedures I don't currently use because they weren't instinctively and almost immediately comprehensible).

HK_EndeavourForever · Posted 08-03-2012 06:24 PM

Hi Arthur,

PROC SORT DATA=insasds out=insasds.sorted;

BY var1 var2 var3 var4 var5 var6.... var15;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)

BY var1 var2 var3 var4 var5 var6...... var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;

BY var11 var12 var13 var14 var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

This is my code.

Today, I have done an extensive testing by varying the count of observations and here are my observations:

1. Until 3 Million records (out of total universe of 6138735 records), the total var16 dollar value is same in TEMP1 and TEMP2.

2. With 4 Million records, the first difference is seen. the total var16 dollar value is not consisent across TEMP1, TEMP2 SAS Datasets.

3. With the whole universe of 6 Million+ records, the difference increases

I am trying to replace PROC SUMMARY with PROC SQL and use group by and then sum the variable to see if that validates.

Another thing I observed is when I produced a tabular report on insasds.sorted, TEMP1 and TEMP2 for var16. here is what I see.

On insasds.sorted ==> Sum of var16 ==> 77,663,588

On TEMP1 ==> Sum of var16 ==> 77,650,245

On TEMP2 ==> Sum of var16 ==> 77,645,951

Again for 3 Million record obs,

On insasds.sorted ==> Sum of var16 ==> 52,005,676

On TEMP1 ==> Sum of var16 ==> 52,005,676

On TEMP2 ==> Sum of var16 ==> 52,005,676

Please let me know what else details you would like me to elaborate here? I am unable to post a sample dataset because the universe is huge and for small count of observations till 3 Million, the PROC SUMMARY seems to work fine and the total dollars are consistent across input, TEMP1, TEMP2 output files.

Could it indicate rounding off the dollar values (dollars with decimal values) that is creating the difference?

Thank You!!

HK

art297 · Posted 08-03-2012 06:37 PM

HK: Your problem is in your proc sort! The one proc sort won't work for both proc summaries.

Try it by either substituting class instead of by, thus not requiring a sorted file or, on the other hand, separate proc sorts. In either case, you can reduce the code by specifying variable lists. e.g.:

PROC SORT DATA=insasds out=insasds.sorted;

BY var1-var10;

run;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted; ==> The insasds is sorted by the class variables (var1 to var10)

BY var1-var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

run;

PROC SORT DATA=insasds out=insasds.sorted;

BY var11-var15;

run;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;

BY var11-var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

run;

Let us know if that fixes the discrepancy.

Art

HK_EndeavourForever · Posted 08-04-2012 12:53 AM

Hi Arthur,

Thanks for the suggestion, I replaced BY with CLASS and removed sort. The result is even more weird now :-(... I am beginning to suspect that it could be a case of rounding issue.

PROC SUMMARY NWAY MISSING DATA=insasds;

CLASS var1 var2 var3 var4.....var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

run;

PROC SUMMARY NWAY MISSING DATA=insasds;

CLASS var11 var12 var13 var14 var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

run;

Is my understanding correct? Will the grand total dollars of var16 in insasds, TEMP1 and TEMP2 be same irrespective of the way they are summarized?

Current Results:

PROC TABULATE NOSEPS MISSING DATA=insasds FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;

TITLE "Total Dollars"

VAR var16;

TABLE SUM*var16;

RUN; TITLE;

Tabular proc for insasds: 77,757,256

Tabular proc for TEMP1: 77,648,952

Tabular proc for TEMP2: 77,676,511

art297 · Posted 08-04-2012 11:34 AM

On a couple of occasions, now, you have listed the following as your code:

CLASS var1 var2 var3 var4.....var10;

Is that really how the statement looks? That wouldn't run. You either have to spell the variables out, or use them in a variable list (like var1-var10).

In any case, yes, all of your results should have the same number. I doubt if it is rounding error causing that large of a discrepancy.

Can you post copies of temp1 and temp2?

HK_EndeavourForever · Posted 08-04-2012 05:20 PM

Hi Arthur,

Apologies for creating a confusion! Yes, I am spelling out the VAR1 to VAR10 variables separately.

CLASS var1 var2 var3 var4 var5 var6 var7 var8 var9 var10;

and actually in my program these variables have more meaningful names (not as var1 to var15).

I don't know if I can post the data here. The summary gives 35933 observations in TEMP1 and 23093 observations in TEMP2.

Hi Tom,

Can you please give an example? I am a beginner with the PROC UNIVARIATE procedure. I have used the var statement as shown below.

PROC UNIVARIATE DATA=TEMP1;

VAR var16;

WEIGHT var16;

RUN;

and please note the program has been giving consistent amount summary on TEMP1 and TEMP2 till last 3 years with around 5 Million observations.

Results of PROC UNIVARIATE.

on insasds

	Weight: var16
	Weighted Moments
N	6138735	Sum Weights	1.0468E22
Mean	2.21819E21	Sum Observations	2.322E43
Std Deviation	3.78917E28	Variance	1.43578E57
Skewness	199.575324	Kurtosis	1537338.08
Uncorrected SS	6.03203E64	Corrected SS	8.81388E63
Coeff Variation	1708229227	Std Error Mean	3.70349E17

On TEMP1.

N	35933	Sum Weights	80644396.8
Mean	576248.022	Sum Observations	4.64712E13
Std Deviation	50713325.6	Variance	2.57184E15
Skewness	113.995947	Kurtosis	17152.4143
Uncorrected SS	1.1919E20	Corrected SS	9.24114E19
Coeff Variation	8800.60732	Std Error Mean	5647.2237

ON TEMP2

N	23093	Sum Weights	84646161.4
Mean	637189.816	Sum Observations	5.39357E13
Std Deviation	48992853	Variance	2.4003E15
Skewness	42.4383756	Kurtosis	2982.60987
Uncorrected SS	8.9795E19	Corrected SS	5.54277E19
Coeff Variation	7688.89455	Std Error Mean	5325.11629

art297 · Posted 08-04-2012 07:19 PM

I have to wonder whether your sum is a larger number than your computer can handle. Obviously, not knowing what your data look like, it is difficult to say and (I would think) highly unusual given the scope of the numbers you have reported. However, if even just one of your newer records had an absurdly high number, that could be the problem.

Can you re-run the analyses, but first reducing var 16 by dividing it by 1,000?

Also, why did you use a weight statement in your proc univariate? It would have been more useful if you hadn't done that.

I, for one, would be most interested in seeing your extreme data points. Can you run something like the following and post the results:

ods select ExtremeObs;

proc univariate data=yourfilename;

var var16;

run;

econ · Posted 08-04-2012 07:37 PM

maybe he used weight because it is a sampled dataset.

you have to correct for that.

why else would you use it.

art297 · Posted 08-04-2012 08:03 PM

He hasn't used it before, which is why I asked. Why else would one use it? One possibility is that they see it in the documentation, don't understand its ramifications, and simply include it as part of their code .. however wrong that may be!

HK_EndeavourForever · Posted 08-06-2012 12:14 AM

Hi Arthur,

As I read again and again, I am now getting to understand what the WEIGHT statement is used for! That was completely not needed in my case.

Thanks for putting me in the right direction!

proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Registration is open

SAS Training: Just a Click Away