BookmarkSubscribeRSS Feed
HK_EndeavourForever
Fluorite | Level 6

Hello Arthur,

The system on which the program is running is MVS (Mainframes).

I am beginner with UNIVARIATE function so when TOM suggested to use UNIVARIATE on sums, I used WEIGHT to see how the total dollars was being showed.

I have tried your suggestion of dividing the var16 by 1000 in the input dataset and then using this new input (with var16 value reduced) for summarizations. The dollar difference is now 186.

SASDS              No. of Observations          Total $

insasds.new        6138735                          78,277        

TEMP1                35933                             78,200

TEMP2                23093                             78,014

And yes, by using the VAR statement and your example UNIVARIATE procedure, I do now see that the input sasds has dollars of very high values that I think is falling out of the numeric range (default min and max range of a numeric, when a variable is declared as numeric, SAS takes it of length 8 bytes??)

Actual run with UNIVARIATE procedure:

SASDS              No. of Observations          Total $

insasds.new        6138735                          77,757,256        

TEMP1                35933                             77,648,952

TEMP2                23093                             77,676,511

insasds:

             The UNIVARIATE Procedure             

                Variable:  var16                                           

               Extreme Observations               

--------Lowest-------                -------Highest-------

       Value         Obs               Value             Obs

-1.640145E21   3.47E6        1.0926427E21   3.47E6

-1.640145E21   3.47E6        1.2878341E21   3.47E6

-1.637649E21   3.47E6        1.6401453E21   3.47E6

-1.287872E21   3.47E6        2.5757056E21   3.47E6

-1.287834E21   3.47E6        3.2777946E21   3.47E6

                                                  

TEMP1:                                                  

             The UNIVARIATE Procedure             

                Variable:  var16                

               Extreme Observations               

    -----Lowest-----        -----Highest-----     

     Value      Obs           Value          Obs     

    -165766     1663         1269577     6439     

    -129850     1661         1779938       66     

    -128195    35750         1905396    11274     

    -128195    35749         2386700      346     

    -116770     1692         4376822       402     

                                                  

TEMP2:                                                  

           The UNIVARIATE Procedure             

              Variable:  var16                

             Extreme Observations               

  -----Lowest-----        -----Highest-----     

    Value      Obs           Value         Obs     

  -892673    22993         1819878    15431     

  -516203    11224         1866471    15875     

  -485833    11225         2171612     4093     

  -388422    11226         2343286    15303     

  -371947    22994         3030318    14326     

Tom
Super User Tom
Super User

If you are trying to store dollars and cents as decimal fractions you will get issues from rounding as many numbers cannot be represented exactly using binary floating point numbers.

Try converting your stored value from dollars to pennies (and hence to integers) and you might solve your problem.

cents= round(100 * var16,1);

Remember to divide your resulting sums by 100 to convert from pennies back to dollars.

HK_EndeavourForever
Fluorite | Level 6

Tom,

  Thanks for the suggestion! But it did not work!!

Hi Arthur,

  These unusually high values seem to have exceeded the maximum representation that SAS can handle on Mainframes?  With in the two summaries being grouped in different combinations, the order in which the dollars are summed up also differs resulting in inconsistent totals.

I am approaching my customer with these observations.

insasds:

             The UNIVARIATE Procedure             

                Variable:  var16                                           

               Extreme Observations               

--------Lowest-------                -------Highest-------

       Value         Obs               Value             Obs

-1.640145E21   3.47E6        1.0926427E21   3.47E6

-1.640145E21   3.47E6        1.2878341E21   3.47E6

-1.637649E21   3.47E6        1.6401453E21   3.47E6

-1.287872E21   3.47E6        2.5757056E21   3.47E6

-1.287834E21   3.47E6        3.2777946E21   3.47E6

Regards.

mkeintz
PROC Star

HK:

For a a series of numbers that add up to the scale you report, I think it's quite possible to get different sums from these two appoaches:

  1. Sort data from smallest (absolute value) to largest, and iteratively add each element to the sum, vs.
  2. Sort from largest to smallest, and add each successive element.

This is because, in the 2nd scenario, the last numbers added may be no more than rounding error at the 16th or 17th digit (or whatever precision your system offers). But in the first scenario, they are being accumulated before they represent rounding error.

This is not a SAS issue, it's a computing issue, and would be true for any software that uses double precision floating point computing.

Now I don't think you've proven this is the case, but you probably could by sorting your data these two ways, and then have proc means generate global totals each way. If that doesn't show differences then you might have to drop the numeric precision theory. But if it does, you might have found the explanation for your differences. BTW, if there are differences attributable to precision, I wouldn't expect them to be large as a proportion.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
GeoffreyBrent
Calcite | Level 5

Yeah, looks like roundoff errors to me. A length-8 SAS numeric variable is stored with 52-56 bits in the mantissa (depending on your system), which translates to around 15-17 significant figures in base-10. So when you store a number of magnitude ~ 1E21, you can expect rounding errors of magnitude ~ 100,000.

Some demo code:

data test;

input x;

cards;

1E4

1E21

-1E21

;

run;

proc summary data=test nway;

var x;

output out=test2 sum=;

quit;

On my system, this returns zero: when we add 1E4 to 1E21, it gets rounded away to zero. However, if I reorder the data so that the "1E4" appears last in the data step, the two big values cancel out exactly, and we get the correct sum.

Some options:

- accept the rounding error

- find a way to implement calculations in higher precision (see 2008 SGF paper "Ludicrously Large Numbers" for some ideas)

- fine-tune the SAS calculation to minimise rounding error.

If the values you are adding are mostly positive, I'd go with the first strategy suggested by mkeintz: pick up the small values early so they can add up to something that won't be wiped out by roundoff error.

However, looking at the values you've listed above, it seems you have several very large positives and negatives that almost exactly cancel one another out. For instance, your two largest negatives almost exactly cancel the largest positive; the fourth- and fifth-largest negatives cancel the second-largest positive; and the third-largest on both lists almost cancel out.

Is there some reason why this might be the case? And if so, is there some way of matching up values that are likely to cancel out?

Take the following example:

data test;

length group $1;

input group x;

cards;

A 1E4

B 1E21

B 1E21

B -2E21

C 1E10

C 1E10

C -2E10

;

run;

proc summary data=test;

var x;

output out=test2 sum=;

quit;

This returns zero, because the original 1E4 is wiped out by rounding error. But add "class group;" to the PROC SUMMARY and you get the right answer: it forces large values to cancel before combining them with small ones.

art297
Opal | Level 21

: are such extreme positive and negative values, at the individual record level, realistic? Might they be data entry errors?

HK_EndeavourForever
Fluorite | Level 6

Hi Arthur,

These dollar amounts are the results of some calculations and yes, agreed that these are unrealistic. I have written back the observations to the business.

Thank You a lot for your suggestions!!

HK_EndeavourForever
Fluorite | Level 6

Hello Cynthia,

The two summarizations are because they form the feed to two different systems. And Tabular report is for dollar validation.

Based on your example, I changed PROC MEANS into PROC SUMMARY and here is my output. My understanding in using the BY and CLASS statements in PROC SUMMARY is both will generate the same result. In the case of BY statement, it is expected the dataset being summarized is sorted.

In the case of CLASS statement, dataset need not be sorted.

Please correct me if I am wrong. This is based on my observations till now.

proc sort data=sashelp.class out=sash;  by age sex; run;

proc summary data=sash nway missing;

  by age sex;

  var weight;

  output out=work.mns1 sum=;

run;

proc print data=work.mns1;

  title 'after proc means 1';

run;

proc summary data=sashelp.class nway missing;

  class age sex;

  var weight;

  output out=work.mns2 sum=;

run;

proc print data=work.mns2;

  title 'after proc means 2';

run;

after proc means 1                                                
ObsAgeSex_TYPE__FREQ_Weight  
   111F   0    1   50.5  
   211M   0    1   85.0  
   312F   0    161.5  
   412M   0    310.5  
   513F   0    182.0  
   613M   0    1   84.0  
   714F   0    192.5  
   814M   0    215.0  
   915F   0    224.5  
  1015M   0    245.0  
  1116M   0    150.0  

after proc means 2            
ObsAgeSex_TYPE__FREQ_Weight 
   111F   3    1   50.5 
   211M   3    1   85.0 
   312F   3    161.5 
   412M   3    310.5 
   513F   3    182.0 
   613M   3    1   84.0 
   714F   3    192.5 
   814M   3    215.0 
   915F   3    224.5 
  1015M   3    245.0 
  1116M   3    150.0 
Tom
Super User Tom
Super User

Sounds like you might be having trouble with orders of magnitude issues.   If you try to add a small number to a really large number floating point representation causes the small number to get lost.  By adding the numbers in different order then different numbers are being eliminated by this process.

How many observations do you get when you sum by VAR1 - VAR10?  What about using VAR11-VAR15?

Run PROC UNIVARIATE on the sums generated into TEMP1 and TEMP2.   Do the sums vary widely?

ballardw
Super User

Also are there any records that have missing values for any of your class variables? The NWAY in proc summary doesn't create a combination of the class variables that include missing. So the totals of two different CLASS statements on the same value when the VAR variable is summed are likely to differ. The larger the difference the larger the difference in the number of records with missing class variables.

Try using a dummy variable with a value of 1 for each record and compare the output.

Use both proc summary and tabulate to compare the sums of the output data sets.

A brief example of what I mean:

 

data junk;

input c1 c2 c3 var1;

datalines;

1 . 1 1

1 2 1 2

1 2 2 3

. 2 . 3

2 2 2 4

;

run;

proc summary data=junk nway;

class c1 c2 c3;

var var1;

output out= junk2 sum=;

run;

proc summary data=junk nway;

class c1 c3;

var var1;

output out= junk3 sum=;

run;

proc means data=junk2 sum;

title 'First group of class variables';

var var1;

run;

proc means data=junk3 sum;

title 'Second group of class variables';

var var1;

run;

art297
Opal | Level 21

@ballardw: The OP had included a missing option in the code.  I think you will find that having done so produces output that does indeed include all combinations, including those that contain missing values.  E.g., with the code you provided:

proc summary data=junk nway missing;

class c1 c2 c3;

var var1;

output out= junk2 sum=;

run;

proc summary data=junk nway missing;

class c1 c3;

var var1;

output out= junk3 sum=;

run;

proc means data=junk2 sum missing;

title 'First group of class variables';

var var1;

run;

proc means data=junk3 sum missing;

title 'Second group of class variables';

var var1;

run;

Art

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 4429 views
  • 0 likes
  • 8 in conversation