Re: proc summary procedure giving difference in sum when summarized by... - Page 2

HK_EndeavourForever · Posted 08-05-2012 11:26 PM

Hello Arthur,

The system on which the program is running is MVS (Mainframes).

I am beginner with UNIVARIATE function so when TOM suggested to use UNIVARIATE on sums, I used WEIGHT to see how the total dollars was being showed.

I have tried your suggestion of dividing the var16 by 1000 in the input dataset and then using this new input (with var16 value reduced) for summarizations. The dollar difference is now 186.

SASDS No. of Observations Total $

insasds.new 6138735 78,277

TEMP1 35933 78,200

TEMP2 23093 78,014

And yes, by using the VAR statement and your example UNIVARIATE procedure, I do now see that the input sasds has dollars of very high values that I think is falling out of the numeric range (default min and max range of a numeric, when a variable is declared as numeric, SAS takes it of length 8 bytes??)

Actual run with UNIVARIATE procedure:

SASDS No. of Observations Total $

insasds.new 6138735 77,757,256

TEMP1 35933 77,648,952

TEMP2 23093 77,676,511

insasds:

The UNIVARIATE Procedure

Variable: var16

Extreme Observations

--------Lowest------- -------Highest-------

Value Obs Value Obs

-1.640145E21 3.47E6 1.0926427E21 3.47E6

-1.640145E21 3.47E6 1.2878341E21 3.47E6

-1.637649E21 3.47E6 1.6401453E21 3.47E6

-1.287872E21 3.47E6 2.5757056E21 3.47E6

-1.287834E21 3.47E6 3.2777946E21 3.47E6

TEMP1:

The UNIVARIATE Procedure

Variable: var16

Extreme Observations

-----Lowest----- -----Highest-----

Value Obs Value Obs

-165766 1663 1269577 6439

-129850 1661 1779938 66

-128195 35750 1905396 11274

-128195 35749 2386700 346

-116770 1692 4376822 402

TEMP2:

The UNIVARIATE Procedure

Variable: var16

Extreme Observations

-----Lowest----- -----Highest-----

Value Obs Value Obs

-892673 22993 1819878 15431

-516203 11224 1866471 15875

-485833 11225 2171612 4093

-388422 11226 2343286 15303

-371947 22994 3030318 14326

Tom · Posted 08-06-2012 12:39 AM

If you are trying to store dollars and cents as decimal fractions you will get issues from rounding as many numbers cannot be represented exactly using binary floating point numbers.

Try converting your stored value from dollars to pennies (and hence to integers) and you might solve your problem.

cents= round(100 * var16,1);

Remember to divide your resulting sums by 100 to convert from pennies back to dollars.

HK_EndeavourForever · Posted 08-06-2012 02:49 PM

Tom,

Thanks for the suggestion! But it did not work!!

Hi Arthur,

These unusually high values seem to have exceeded the maximum representation that SAS can handle on Mainframes? With in the two summaries being grouped in different combinations, the order in which the dollars are summed up also differs resulting in inconsistent totals.

I am approaching my customer with these observations.

insasds:

The UNIVARIATE Procedure

Variable: var16

Extreme Observations

--------Lowest------- -------Highest-------

Value Obs Value Obs

-1.640145E21 3.47E6 1.0926427E21 3.47E6

-1.640145E21 3.47E6 1.2878341E21 3.47E6

-1.637649E21 3.47E6 1.6401453E21 3.47E6

-1.287872E21 3.47E6 2.5757056E21 3.47E6

-1.287834E21 3.47E6 3.2777946E21 3.47E6

Regards.

mkeintz · Posted 08-06-2012 03:32 PM

HK:

For a a series of numbers that add up to the scale you report, I think it's quite possible to get different sums from these two appoaches:

Sort data from smallest (absolute value) to largest, and iteratively add each element to the sum, vs.
Sort from largest to smallest, and add each successive element.

This is because, in the 2nd scenario, the last numbers added may be no more than rounding error at the 16th or 17th digit (or whatever precision your system offers). But in the first scenario, they are being accumulated before they represent rounding error.

This is not a SAS issue, it's a computing issue, and would be true for any software that uses double precision floating point computing.

Now I don't think you've proven this is the case, but you probably could by sorting your data these two ways, and then have proc means generate global totals each way. If that doesn't show differences then you might have to drop the numeric precision theory. But if it does, you might have found the explanation for your differences. BTW, if there are differences attributable to precision, I wouldn't expect them to be large as a proportion.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

GeoffreyBrent · Posted 08-06-2012 09:22 PM

Yeah, looks like roundoff errors to me. A length-8 SAS numeric variable is stored with 52-56 bits in the mantissa (depending on your system), which translates to around 15-17 significant figures in base-10. So when you store a number of magnitude ~ 1E21, you can expect rounding errors of magnitude ~ 100,000.

Some demo code:

data test;

input x;

cards;

1E4

1E21

-1E21

;

run;

proc summary data=test nway;

var x;

output out=test2 sum=;

quit;

On my system, this returns zero: when we add 1E4 to 1E21, it gets rounded away to zero. However, if I reorder the data so that the "1E4" appears last in the data step, the two big values cancel out exactly, and we get the correct sum.

Some options:

- accept the rounding error

- find a way to implement calculations in higher precision (see 2008 SGF paper "Ludicrously Large Numbers" for some ideas)

- fine-tune the SAS calculation to minimise rounding error.

If the values you are adding are mostly positive, I'd go with the first strategy suggested by mkeintz: pick up the small values early so they can add up to something that won't be wiped out by roundoff error.

However, looking at the values you've listed above, it seems you have several very large positives and negatives that almost exactly cancel one another out. For instance, your two largest negatives almost exactly cancel the largest positive; the fourth- and fifth-largest negatives cancel the second-largest positive; and the third-largest on both lists almost cancel out.

Is there some reason why this might be the case? And if so, is there some way of matching up values that are likely to cancel out?

Take the following example:

data test;

length group $1;

input group x;

cards;

A 1E4

B 1E21

B -2E21

C 1E10

C -2E10

;

run;

proc summary data=test;

var x;

output out=test2 sum=;

quit;

This returns zero, because the original 1E4 is wiped out by rounding error. But add "class group;" to the PROC SUMMARY and you get the right answer: it forces large values to cancel before combining them with small ones.

art297 · Posted 08-06-2012 10:31 PM

: are such extreme positive and negative values, at the individual record level, realistic? Might they be data entry errors?

HK_EndeavourForever · Posted 08-10-2012 12:30 AM

Hi Arthur,

These dollar amounts are the results of some calculations and yes, agreed that these are unrealistic. I have written back the observations to the business.

Thank You a lot for your suggestions!!

HK_EndeavourForever · Posted 08-03-2012 06:36 PM

Hello Cynthia,

The two summarizations are because they form the feed to two different systems. And Tabular report is for dollar validation.

Based on your example, I changed PROC MEANS into PROC SUMMARY and here is my output. My understanding in using the BY and CLASS statements in PROC SUMMARY is both will generate the same result. In the case of BY statement, it is expected the dataset being summarized is sorted.

In the case of CLASS statement, dataset need not be sorted.

Please correct me if I am wrong. This is based on my observations till now.

proc sort data=sashelp.class out=sash; by age sex; run;

proc summary data=sash nway missing;

by age sex;

var weight;

output out=work.mns1 sum=;

run;

proc print data=work.mns1;

title 'after proc means 1';

run;

proc summary data=sashelp.class nway missing;

class age sex;

var weight;

output out=work.mns2 sum=;

run;

proc print data=work.mns2;

title 'after proc means 2';

run;

	after proc means 1
Obs	Age	Sex	_TYPE_	_FREQ_	Weight

1	11	F	0	1	50.5
2	11	M	0	1	85.0
3	12	F	0	2	161.5
4	12	M	0	3	310.5
5	13	F	0	2	182.0
6	13	M	0	1	84.0
7	14	F	0	2	192.5
8	14	M	0	2	215.0
9	15	F	0	2	224.5
10	15	M	0	2	245.0
11	16	M	0	1	150.0

	after proc means 2
Obs	Age	Sex	_TYPE_	_FREQ_	Weight

1	11	F	3	1	50.5
2	11	M	3	1	85.0
3	12	F	3	2	161.5
4	12	M	3	3	310.5
5	13	F	3	2	182.0
6	13	M	3	1	84.0
7	14	F	3	2	192.5
8	14	M	3	2	215.0
9	15	F	3	2	224.5
10	15	M	3	2	245.0
11	16	M	3	1	150.0

Tom · Posted 08-04-2012 03:36 PM

Sounds like you might be having trouble with orders of magnitude issues. If you try to add a small number to a really large number floating point representation causes the small number to get lost. By adding the numbers in different order then different numbers are being eliminated by this process.

How many observations do you get when you sum by VAR1 - VAR10? What about using VAR11-VAR15?

Run PROC UNIVARIATE on the sums generated into TEMP1 and TEMP2. Do the sums vary widely?

ballardw · Posted 08-06-2012 10:47 AM

Also are there any records that have missing values for any of your class variables? The NWAY in proc summary doesn't create a combination of the class variables that include missing. So the totals of two different CLASS statements on the same value when the VAR variable is summed are likely to differ. The larger the difference the larger the difference in the number of records with missing class variables.

Try using a dummy variable with a value of 1 for each record and compare the output.

Use both proc summary and tabulate to compare the sums of the output data sets.

A brief example of what I mean:

data junk;

input c1 c2 c3 var1;

datalines;

1 . 1 1

1 2 1 2

1 2 2 3

. 2 . 3

2 2 2 4

;

run;

proc summary data=junk nway;

class c1 c2 c3;

var var1;

output out= junk2 sum=;

run;

proc summary data=junk nway;

class c1 c3;

var var1;

output out= junk3 sum=;

run;

proc means data=junk2 sum;

title 'First group of class variables';

var var1;

run;

proc means data=junk3 sum;

title 'Second group of class variables';

var var1;

run;

art297 · Posted 08-06-2012 11:08 AM

@ballardw: The OP had included a missing option in the code. I think you will find that having done so produces output that does indeed include all combinations, including those that contain missing values. E.g., with the code you provided:

proc summary data=junk nway missing;

class c1 c2 c3;

var var1;

output out= junk2 sum=;

run;

proc summary data=junk nway missing;

class c1 c3;

var var1;

output out= junk3 sum=;

run;

proc means data=junk2 sum missing;

title 'First group of class variables';

var var1;

run;

proc means data=junk3 sum missing;

title 'Second group of class variables';

var var1;

run;

Art

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Re: proc summary procedure giving difference in sum when summarized by different classes

Classroom Training Available!