BookmarkSubscribeRSS Feed
HK_EndeavourForever
Fluorite | Level 6

Hello All,

I have a input SAS dataset say insasds with 6 Million records and have 16 variables (1 variable in them has dollars and is summed up), say var1 to var16 (where var16 has the dollars).

I have two summarizations running in my SAS program as shown below.

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;  ==> The insasds is sorted by the class variables (var1 to var10)

BY var1 var2 var3 var4 var5 var6...... var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

and another summarization

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;  ==> The insasds is again sorted here by the class variables (var11 to var15)

BY var11 var12 var13 var14 var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

Now my issue is when I use the PROC TABULATE procedure to generate the total dollars value, what I am seeing is the total dollars value in TEMP1 and TEMP2 is not equal.

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;     

   TABLE SUM*var16;

   RUN; TITLE;      

My understanding of the SUMMARY procedure is it makes the arrangement based on the BY/CLASS variables but the final dollar value would be same. The program has been running fine now for almost 3 years now and the two summarizations used to yield the same total dollars. From the last couple of runs, I am seeing that tabulate procedure for TEMP1 produces some $X and TEMP2 gives $Y where $X and $Y are not same.

Could anyone please advise from where I need to start looking at the problem?

Thanks in Advance!!

Regards,

HK

25 REPLIES 25
art297
Opal | Level 21

Can you post a sample dataset that results with the discrepancy you mentioned?  Also, post the code you used to sort the data for each analysis, as that could explain the discrepancy.

Cynthia_sas
SAS Super FREQ

Hi:

  And what you've posted for code doesn't make sense, to me. You call your variables VAR1-VAR10 and VAR11-VAR15 CLASS variables, but you ONLY list them in a BY statement, not a CLASS statement. That doesn't make sense. PROC MEANS would give you different results with a BY versus a CLASS. Consider the following code and the differences between using just the BY and using a CLASS statement. With the BY statement, we just get 1 observation in the output. With the CLASS statement, you get 1 obs for every unique combination of the CLASS variables.

  Since you say you have millions of obs, I'm not sure why you are taking this approach anyway (doing a MEANS and then following it with a TABULATE).  Especially since TABULATE is capable of generating a summary for your CLASS variables directly without the MEANS step. I'd recommend that you work with Tech Support since you probably can't post millions of records.
 

cynthia

  

proc means data=sashelp.class nway missing;

  by age sex;

  var weight;

  output out=work.mns1 sum=;

run;

 

proc print data=work.mns1;

  title 'after proc means 1';

run;

 

proc means data=sashelp.class nway missing;

  class age sex;

  var weight;

  output out=work.mns2 sum=;

run;

 

proc print data=work.mns2;

  title 'after proc means 2';

run;

art297
Opal | Level 21

@Cynthia: Either it is me, or the fact that it is Friday, but I disagree.  You can't run your first set of code without error since sashelp.class isn't sorted according to the by variables.

If one first uses:

proc sort data=sashelp.class out=class;

  by age sex;

run;

and then uses class as the input data for the two means runs, the only difference (I think) will be in the value assigned to _type_.

Cynthia_sas
SAS Super FREQ

Ah, Art:

  You are absolutely correct, _TYPE_ is the only difference. Rookie mistake, I did not check my log & I did not sort and so my print was displaying the results of an entirely different MEANS step (which is a reason for not using the same name over and over!).

  I still don't understand the need to have 2 separate PROC MEANS steps prior to TABULATE. I think of TABULATE as being the "GINSU" steak knife of SAS and being able to slice and dice data with great efficiency.

cynthia

art297
Opal | Level 21

@Cynthia: Let's agree to blame it on being a Friday .. you are definitely NOT a rookie!  I'd still like to know why the OP got different results.  Many, including me, use means and summary, rather than tabulate, as we find it easier to understand.  Someday I really should learn the procedure (along with the other gazillion procedures I don't currently use because they weren't instinctively and almost immediately comprehensible).

HK_EndeavourForever
Fluorite | Level 6

Hi Arthur,

PROC SORT DATA=insasds out=insasds.sorted;

BY var1 var2 var3 var4 var5 var6.... var15;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;  ==> The insasds is sorted by the class variables (var1 to var10)

BY var1 var2 var3 var4 var5 var6...... var10;

VAR var16;

OUTPUT OUT=TEMP1 SUM=;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted; 

BY var11 var12 var13 var14 var15;

VAR var16;

OUTPUT OUT=TEMP2 SUM=;

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;     

   TABLE SUM*var16;

   RUN; TITLE;   

PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;     

   TABLE SUM*var16;

   RUN; TITLE;   

This is my code.

Today, I have done an extensive testing by varying the count of observations and here are my observations:

1.  Until 3 Million records (out of total universe of 6138735 records), the total var16 dollar value is same in TEMP1 and TEMP2.

2.  With 4 Million records, the first difference is seen. the total var16 dollar value is not consisent across TEMP1, TEMP2 SAS Datasets.

3.  With the whole universe of 6 Million+ records, the difference increases

I am trying to replace PROC SUMMARY with PROC SQL and use group by and then sum the variable to see if that validates.

Another thing I observed is when I produced a tabular report on insasds.sorted, TEMP1 and TEMP2 for var16. here is what I see.

On insasds.sorted ==> Sum of var16 ==> 77,663,588

On TEMP1            ==> Sum of var16 ==> 77,650,245

On TEMP2            ==> Sum of var16 ==> 77,645,951

Again for 3 Million record obs,

On insasds.sorted ==> Sum of var16 ==> 52,005,676

On TEMP1            ==> Sum of var16 ==> 52,005,676

On TEMP2            ==> Sum of var16 ==> 52,005,676

Please let me know what else details you would like me to elaborate here? I am unable to post a sample dataset because the universe is huge and for small count of observations till 3 Million, the PROC SUMMARY seems to work fine and the total dollars are consistent across input, TEMP1, TEMP2 output files.

Could it indicate rounding off the dollar values (dollars with decimal values) that is creating the difference?

Thank You!!

HK

art297
Opal | Level 21

HK: Your problem is in your proc sort!  The one proc sort won't work for both proc summaries.

Try it by either substituting class instead of by, thus not requiring a sorted file or, on the other hand, separate proc sorts.  In either case, you can reduce the code by specifying variable lists.  e.g.:

PROC SORT DATA=insasds out=insasds.sorted;

  BY var1-var10;

run;


PROC SUMMARY NWAY MISSING DATA=insasds.sorted;  ==> The insasds is sorted by the class variables (var1 to var10)

  BY var1-var10;

  VAR var16;

  OUTPUT OUT=TEMP1 SUM=;

run;

PROC SORT DATA=insasds out=insasds.sorted;

  BY var11-var15;

run;

PROC SUMMARY NWAY MISSING DATA=insasds.sorted;

  BY var11-var15;

  VAR var16;

  OUTPUT OUT=TEMP2 SUM=;

run;

Let us know if that fixes the discrepancy.

Art

HK_EndeavourForever
Fluorite | Level 6

Hi Arthur,

Thanks for the suggestion, I replaced BY with CLASS and removed sort. The result is even more weird now :-(... I am beginning to suspect that it could be a case of rounding issue.

PROC SUMMARY NWAY MISSING DATA=insasds;

  CLASS var1 var2 var3 var4.....var10;

  VAR var16;

  OUTPUT OUT=TEMP1 SUM=;

run;

PROC SUMMARY NWAY MISSING DATA=insasds;

  CLASS var11 var12 var13 var14 var15;

  VAR var16;

  OUTPUT OUT=TEMP2 SUM=;

run;

Is my understanding correct? Will the grand total dollars of var16 in insasds, TEMP1 and TEMP2 be same irrespective of the way they are summarized?

Current Results:

PROC TABULATE NOSEPS MISSING DATA=insasds FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;    

   TABLE SUM*var16;

   RUN; TITLE;     

PROC TABULATE NOSEPS MISSING DATA=TEMP1 FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;    

   TABLE SUM*var16;

   RUN; TITLE;     

PROC TABULATE NOSEPS MISSING DATA=TEMP2 FORMAT=COMMA14.0;

  TITLE "Total Dollars"

   VAR var16;    

   TABLE SUM*var16;

   RUN; TITLE;     

Tabular proc for insasds: 77,757,256

Tabular proc for TEMP1:  77,648,952

Tabular proc for TEMP2:  77,676,511

art297
Opal | Level 21

On a couple of occasions, now, you have listed the following as your code:

CLASS var1 var2 var3 var4.....var10;


Is that really how the statement looks?  That wouldn't run.  You either have to spell the variables out, or use them in a variable list (like var1-var10).


In any case, yes, all of your results should have the same number.  I doubt if it is rounding error causing that large of a discrepancy.


Can you post copies of temp1 and temp2?


HK_EndeavourForever
Fluorite | Level 6

Hi Arthur,

Apologies for creating a  confusion! Yes, I am spelling out the VAR1 to VAR10 variables separately.

CLASS var1 var2 var3 var4 var5 var6 var7 var8 var9 var10;

and actually in my program these variables have more meaningful names (not as var1 to var15).


I don't know if I can post the data here. The summary gives 35933 observations in TEMP1 and 23093 observations in TEMP2.

Hi Tom,

Can you please give an example? I am a beginner with the PROC UNIVARIATE procedure. I have used the var statement as shown below.


PROC UNIVARIATE DATA=TEMP1;

VAR var16;       

WEIGHT var16;

RUN;                      

and please note the program has been giving consistent amount summary on TEMP1 and TEMP2 till last 3 years with around 5 Million observations.

Results of PROC UNIVARIATE.

on insasds

                    Weight:  var16                     
                   Weighted Moments                         
N                 6138735Sum Weights      1.0468E22 
Mean           2.21819E21Sum Observations  2.322E43 
Std Deviation  3.78917E28Variance        1.43578E57 
Skewness       199.575324Kurtosis        1537338.08 
Uncorrected SS 6.03203E64Corrected SS    8.81388E63 
Coeff Variation1708229227Std Error Mean  3.70349E17 

On TEMP1.

N                   35933Sum Weights     80644396.8
Mean           576248.022Sum Observations4.64712E13
Std Deviation  50713325.6Variance        2.57184E15
Skewness       113.995947Kurtosis        17152.4143
Uncorrected SS  1.1919E20Corrected SS    9.24114E19
Coeff Variation8800.60732Std Error Mean   5647.2237

ON TEMP2

N                   23093Sum Weights     84646161.4
Mean           637189.816Sum Observations5.39357E13
Std Deviation    48992853Variance         2.4003E15
Skewness       42.4383756Kurtosis        2982.60987
Uncorrected SS  8.9795E19Corrected SS    5.54277E19
Coeff Variation7688.89455Std Error Mean  5325.11629
art297
Opal | Level 21

I have to wonder whether your sum is a larger number than your computer can handle.  Obviously, not knowing what your data look like, it is difficult to say and (I would think) highly unusual given the scope of the numbers you have reported.  However, if even just one of your newer records had an absurdly high number, that could be the problem.

Can you re-run the analyses, but first reducing var 16 by dividing it by 1,000?

Also, why did you use a weight statement in your proc univariate?  It would have been more useful if you hadn't done that.

I, for one, would be most interested in seeing your extreme data points.  Can you run something like the following and post the results:

ods select ExtremeObs;

proc univariate data=yourfilename;

  var var16;

run;

econ
Quartz | Level 8

maybe he used weight because it is a sampled dataset.

you have to correct for that.

why else would you use it.

art297
Opal | Level 21

He hasn't used it before, which is why I asked.  Why else would one use it?  One possibility is that they see it in the documentation, don't understand its ramifications, and simply include it as part of their code .. however wrong that may be!

HK_EndeavourForever
Fluorite | Level 6

Hi Arthur,

As I read again and again, I am now getting to understand what the WEIGHT statement is used for! That was completely not needed in my case.

Thanks for putting me in the right direction!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 3988 views
  • 0 likes
  • 8 in conversation