turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- proc sort - SUM function on group by returns diffe...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 01:49 AM

I am hoping somebody can explain why - and how to work around - the following.

This comes from sales data - essentially a few sales and then all of the items were returned for a refund - which is why I have positive and the same negative values.

The expected net result of all of the transactions is zero.

Run the code as is and then remove the comments on the SORT procedure and the **proc sql group by** will produce 2 different results

either:

num=.0000000000000568434188608080000

or

num=.0000000000000000000000000000000

if num >0 then a sale is counted if num=0 then no sale

So depending on the sort order of the input data set I can either have a sale counted or not - this is obviously inaccurate!!!

I know I can round and format to get a value closer to zero - that's not the question - at the office there are many examples of this type of coding and not enough time to go and recode all of the analytic code around the enterprise.

Preferably somebody has come across this and there is a SAS Option I can use or a hotfix that can be installed.

currently tested on:

aix 6.1 with SAS 9.2

and

aix 7.1 with SAS 9.4

data work.Input_Data;

category ='a';

length num 8.;

input num;

cards;

299

-299

-283.8

299

283.8

-299

283.8

283.8

-283.8

-283.8

-218.31

218.31

218.31

-218.31

;

run;

* UNCOMMENT THIS SORT BELOW TO SEE CORRECT RESULTS IN WORK.OUTPUT1 *;

/*proc sort data=work.Input_Data; by num; quit;*/

proc sql;

* this (group by) sometimes produces incorrect results *;

create table WORK.output1 as select category, sum(num) as num from work.Input_Data

group by category;

* this works *;

create table WORK.output2 as select sum(num) as num from work.Input_Data;

quit;

* this works *;

proc summary data=work.Input_Data nway n;

class category;

var num;

output out=WORK.output3 sum=;

quit;

data _null_; set WORK.output1;

* num should be equal to zero - strangely depends on the the order of rows in the work.Input_Data above *;

format num tst 32.31;

tst=round(num,0.001);

put num=;

if num >0 then put "Greater";

if num =0 then put "Equal";

if num <0 then put "Less";

put tst=;

if tst >0 then put "Greater";

if tst =0 then put "Equal";

if tst <0 then put "Less";

run;

data _null_; set WORK.output2;

* num equal to zero *;

format num 32.31;

put num=;

if num >0 then put "Greater";

if num =0 then put "Equal";

if num <0 then put "Less";

run;

data _null_; set WORK.output3;

* num equal to zero *;

format num 32.31;

put num=;

if num >0 then put "Greater";

if num =0 then put "Equal";

if num <0 then put "Less";

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 02:45 AM

Did you read all those things about numeric orecision?

As soon as you are going into not whole numbers or having more than 12 digits you are in the area that 1 might not be equal to 1 due to that precision. Remember the slider.

It is not a SAS problem it is numeric binary global one.

---->-- ja karman --<-----

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 02:54 AM

Thanks Jaap - I am aware of what you are discussing but that is not the question.

This is not about precision or scale.

My question:

Why does the **sort order of the records** in the input data set return different total values when summed in the proc sql - when using the SAME individual record & values?

Which indicates a SAS proc sql bug.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 02:57 AM

Also - look at the example code.

The CORRECT result is returned with a PROC SUMMARY (implies the data gets sorted) where as the PROC SQL .. SUM ... GROUP BY returns a different result to the proc summary on the SAME records when not sorted - but the correct result when the input data set is sorted.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 03:54 AM

#327 on my constantly growing list of reasons why to avoid PROC SQL.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 04:18 AM

Unfortunately - SAS DI studio produces a lot of proc sql code - so how reliable are the results?

must I go to 20 plus analysts and senior management to explain that SAS proc sql is not 100% accurate - that the scoring models used for profitability are most likely producing inaccurate results.

I.E:

Depending on the record SORT ORDER - a customer might be scored as profitable?

However the same customer with the same records (sales behavior) - but using a different SORT ORDER will NOT be scored profitable?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 04:26 AM

Well, the inaccuracies are very small.

Note that when working with data in "real" storage format, it is prudent to round before doing a = comparison, as you may always have artifacts from previous computations.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 04:41 AM

well - for statistics - any inaccuracies is unacceptable

0 (zero) is either 0 (zero) or it is greater than or less than 0 (zero)

My challenge is that there are perhaps a hundred different SAS programs dealing with more than 5 SAS data sets each and each SAS data set has many hundreds of variables.

So going back to recode (round) each of those variables in all of that code is a huge undertaking - and impracticable right now.

I am in the process of migrating from SAS 9.2 to a new SAS 9.4 environment and have to compare the 2 environments - so this might need to be done in both environments to get equal results.

Which goes back to the original question(s)

- is there a SAS option that can be used to force a SAS data set to be sorted before a proc sql (since this seems to solve the issue but I don't know of such an option)

- is there a SAS option that globally will "round" numeric variables - i.e. evaluate using the first 3 decimals only?

- why does proc summary produce the expected "correct" result but the proc sql does not?

I am just trying to ensure that there is no quick fix solution before going back to the client and informing them that their current code is inaccurate.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 08:04 AM

This goes back to 's statement about it being a numeric binary problem--sum is a sequential operation, and when you exceed the numerical precision, you start accumulating rounding "errors". It seems logical, at least to me, that the sort order is going to affect this dramatically. The difference comes into the sum at different points in the sequence, and propagates. i see this happen a fair amount in iterative statistical procedures (maximum likelihood estimation, especially), so when duplicating analyses on separate machines, it is critical to have the sort order prior to analysis identical on the two machines. At least that way, the propagated errors should be identical.

Can you shift everything to integer arithmetic, and not overflow? That seems to be the most straightforward way. The other is to compare using a fuzzy logic that accounts for accumulated error.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 08:37 AM

Thanks Steven -

if you look at the test data - a SAS numeric variable of length 8 should be able to store these test values - without accumulating rounding "errors"

299

-299

-283.8

299

283.8

-299

283.8

283.8

-283.8

-283.8

-218.31

218.31

218.31

-218.31

It also doesn't seem to explain why in this test case the GROUP BY function - in this test case only 1 group - produces different totals with the same records.

I would expect the same "accumulating rounding errors" to occur since the single group total is the same as the 'grand' total.

With testing (changing the order of records) I can produce different 'errors' (value differences) - so I agree it has something to do with the numeric binary problem - I don't see why it should be occurring in this test case though.

SAS length 8 numeric variable should be able to store these values - without rounding concerns.

Many thanks for all the replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 08:44 AM

None of those values will translate exactly in hexadecimal form. The 8 byte format has nothing to do with the physical length of the string--that only applies to character variables (as I found out the other day when I tried to format a numeric variable as 2.1). So order and representation of decimals as hexadecimals are having an effect. Data_null's solution seems the easiest to me.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 08:54 AM

Hi Steven

taken from - SAS(R) 9.2 Companion for UNIX Environments

====

Numeric Variable Length and Precision in UNIX Environments

The default length of numeric variables in SAS data sets is 8 bytes.

The issue of numeric precision affects the return values of almost all SAS math functions and many numeric values returned from SAS procedures. Numeric values in SAS for UNIX are represented as IEEE double-precision floating-point numbers. **The decimal precision of a full 8-byte number is effectively 15 decimal digits.**

**=====**

So - once again - All of those test numbers should easily fit into a length 8 numeric SAS variable that has precision to 15 decimal digits.

Just to clarify - I am using a length 8 numeric variable - the HIGHEST precision that SAS offers.

I am not formatting the numeric variable and getting 'error's or mismatches.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 09:55 AM

It is not the VARIABLES, it is the OPERATIONS that are most affected in this case.I guess you could even work with 256 bit variables, as soon as you have fractions that do not easily translate into binary, you will end up with non-zero digits at the end of the mantissa, which will cause problems after mathematical operations. Sometimes those cancel each other out, thanks to the sequence, sometimes not.

---------------------------------------------------------------------------------------------

Maxims of Maximally Efficient SAS Programmers

Maxims of Maximally Efficient SAS Programmers

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-10-2015 10:48 AM

Hi Kurt - thanks for the reply.

I am starting to understand this a bit more:

==

The IEEE 754 standard is a method of storing floating-point numbers in a compact way that is easy to manipulate. This standard is used by Intel coprocessors and most PC-based programs that implement floating-point math.

IEEE 754 specifies that numbers be stored in binary format to reduce storage requirements and allow the built-in binary arithmetic instructions that are available on all microprocessors to process the data in a relatively rapid fashion. However, some numbers that are simple, nonrepeating decimal numbers are converted into repeating binary numbers that cannot be stored with perfect accuracy.

For example, the number 1/10 can be represented in a decimal number system with a simple decimal:

.1

However, the same number in binary format becomes the repeating binary decimal:

.0001100011000111000111 (and so on)

This number cannot be represented in a finite amount of space. Therefore, this number is rounded down by approximately -2.78E-17 when it is stored.

If several arithmetic operations are performed to obtain a given result, these rounding errors may be cumulative.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

06-11-2015 01:24 AM

THIS is exactly what happens. The only way out of this is by adding additional numeric storage formats. Like packed decimal or integer, where the programmer has to provide logic for fractions.

SAS obviously does not want to do that, for reasons of keeping everything simple.

Everytime you have the need to make comparisons with exact discrete values, have to deal with fractions, and need to do computations first, it is very wise to use the round() function on the results of the computations.

Maxims of Maximally Efficient SAS Programmers