11-22-2017 10:06 AM - edited 01-08-2018 04:41 AM
Selecting a numeric value from a dataset using a (scalar) PROC SQL subquery can lead to grossly incorrect results if the numeric variable has been defined with a length less than 8 bytes in the dataset. I observed this using SAS 9.4 (TS1M2) on the Windows (X64_7PRO) platform.
/* Simplified example demonstrating the issue */ data a; length x 7; x=1; run; proc sql; select (select x from a) from a; quit; /* returns -311E229 (=-2**769, in HEX16. format: F000000000000000, as opposed to 3FF0000000000000, which would represent 1) */ /* The following two steps reveal what seems to happen internally: */ data a; length x 8; /* lengths 7, 6, 5 and 4 yield different results */ x=631472; /* In HEX16. format (i.e. internal representation) this is 4123456000000000. */ run; proc sql; select (select x from a) format=hex16. from a; quit; /* Results for various lengths of x Length Result 8 4123456000000000 -- correct 7 2345600000000000 6 4560000000000000 5 6000000000000000 4 0000000000000000 */
Apparently, the truncation of the (8 minus length) bytes occurs on the left side of the internal binary representation rather than the right (where it should occur). Thus, the results are not just less precise (lacking some of the least significant bits), but completely wrong (having lost some of the most significant bits).
This actually ruined the results of a large program of mine where SAS date values of length 4 bytes were "truncated" to zero by a subquery which was part of an ON clause.
On 24 October 2017, I reported the issue to SAS Technical Support. They could reproduce it and suggested to add y=round(x); in the data step and then select y rather than x in the subquery. Moreover, they provided a link to the paper "Dealing with Numeric Representation Error in SAS® Applications" (which I knew already).
I replied that rounding is unnecessary and suggested the following:
/* Tentative workaround */ proc sql; select (select x+0 from a) from a; quit; /* It seems that x*1 works as well as x+0, however, --x does not. */
Since then (4 weeks ago) I haven't heard anything from them. The status of the support ticket is still "SAS researching."
I will update this post once I receive more information from Technical Support.
Update 2017-11-27: In the few days since I wrote the above post the status of my support ticket has been changed to "SAS answered." Thanks for the likes and the reply btw.
Update 2018-01-08: In the meantime, the support ticket has been closed without notice.
11-22-2017 12:36 PM
01-08-2018 04:52 AM - edited 01-08-2018 07:03 AM
I just ran this test on SAS 9.4 TS1M2 on AIX:
data a; length x 7; x=1; run; proc sql; select (select x from a) from a; quit; %macro lengthtest; %do length = 8 %to 4 %by -1; data a; length x &length; /* lengths 7, 6, 5 and 4 yield different results */ x=631472; /* In HEX16. format (i.e. internal representation) this is 4123456000000000. */ run; proc sql; select (select x from a) format=hex16. from a; quit; %end; %mend; %lengthtest
-------- 1 ---------------- 4123456000000000 ---------------- 4123456000000000 ---------------- 4123456000000000 ---------------- 4123456000000000 ---------------- 4123456000000000
So this is a Windows-specific problem. And from what I've read here on communities, this is not the first real-format problem that SAS has on Windows. IMO the Windows numeric libraries suffer from the execrable SW quality typical for the Microsoft garbage. And since catastrophic bugs are usually called a "feature" in Redmond, this won't be fixed.
Your best solution is to switch to a solid computing environment for your SAS server (UNIX).
01-08-2018 06:26 AM
Thanks for pointing out that this is a Windows-specific problem. This is interesting.
In my single-user environment I will rather increase my validation efforts than migrate from Windows to Unix. I remember that working directly on Unix SAS (lacking the Enhanced Editor) used to be very painful in the past. (Not sure if the GUI has changed since then.) A combination of both operating systems via SAS/CONNECT would certainly involve substantial additional licensing costs.
01-08-2018 07:02 AM
Since we do all our front-end work with Enterprise Guide and the Java-based Apps anyway, the DM GUI without the EE is not a problem. I use the direct login to AIX (with CDE) only for doing maintenance and running SAS batch jobs. Very rarely will I start SAS in interactive mode.
01-08-2018 02:30 PM
I can reproduce the problem using SAS 9.4M2 on Windows Server 2012 R2. Curious to know why you are shortening lengths of numeric variables. I used to do this in the bad old days for dates when disk space was limited and expensive but haven't bothered with it for many years now. IMO shortening numeric variables to save space shouldn't really be necessary. Do you have a different use case?
01-08-2018 04:58 PM
Thanks, @SASKiwi, for checking another OS.
When I set up my current project, I didn't know how large my datasets would get. In fact, the project is still developing, some datasets already had several hundred million observations and I once encountered a disk-full error (with an SSD!) when I ran two SAS sessions in parallel.
Indeed, I started using SAS in those "bad old days" (1997) in a marketing database setting where shortening numeric variables was a must due to disk space limitations (and it never caused any problems). That's probably why doing so still seemed quite natural to me.
01-08-2018 06:42 PM
Have you tried SAS dataset compression? We have this turned on by default in our SAS environment and it can reduce the size of datasets by up to 80% for very wide ones. To me this is far more useful than shortening numerics. In our experience the CPU overhead of compression is more than offset by the IO reduction.
01-09-2018 05:42 AM - edited 01-09-2018 05:43 AM
Thank you for the suggestion.
I always thought that compression would primarily reduce the space occupied by long character variables (especially those with many trailing blanks), so I hadn't considered it for my current project where most of the variables are numeric and the few character variables are short.
But after rereading the documentation I've just given it a try with a typical dataset which is frequently used and which contains only numeric variables (five of length 8, one of length 6 and four of length 4). It has 141 million observations. Without compression it is 8.431 GB large.
Results (dataset sizes in GB):
COMPRESS | NO YES
YES | 9.118 9.114
BINARY | 9.731 9.727
The SAS log confirms in each case: "Compressing data set WORK.xxxx increased size by nn.nn percent" (percentages varying from 8.10 to 15.42).
I'm sure dataset compression makes more sense with data containing, say, free-text comments from questionnaires etc. and I have seen it in use in clinical research projects.
01-10-2018 03:36 PM
OK, looks like your tables are mostly numeric so compression isn't going to help much. Ours have quite a lot of character variables where compression makes a huge difference.