Re: Interesting result when where condition is applied for dataset

thummala · Posted 10-02-2023 09:30 AM

Hello community,

Can you please help me understand the puzzle/mystery here. I have created a dataset (chk) using the below code. When I do apply a subset condition for the dataset, it results in 0 records when theta value is >= 1.6. The same 0 records is the outcome when a where condition is applied to the dataset manually in sas window.

data chk;
do miu=2 to 3.5 by 0.5;
do sd=0.7 to 1.3 by 0.3;
do theta=0.8 to 2.8 by 0.2;
do n=6 to 30 by 4;
output;
end;
end;
end;
end;
run;

data chk1;
set chk;
where theta = 1.6;
run;

Thank You.

Astounding · Posted 10-02-2023 10:09 AM

Computers can have trouble getting exact numbers. It's just a function of the storage system for numeric values. Try it this way:

where round(theta, 0.000001) = 1.6;

Some detail on the issue: https://documentation.sas.com/doc/en/lrcon/9.4/p0ji1unv6thm0dn1gp4t01a1u0g6.htm

vfarmak · Posted 10-02-2023 11:00 AM

Hello!

I faced the same situation in the past and what I did was the following:

1. specified the length of the fields

2. also the format of the fields

3. the character representation of the numeric and then filtered it out

data chk;
	length miu 8;
	length sd 8;
	length theta 8;
	length n 8;

	do miu=2 to 3.5 by 0.5;
		do sd=0.7 to 1.3 by 0.3;
			do theta=0.8 to 2.8 by 0.2;
				do n=6 to 30 by 4;
					output;
				end;
			end;
		end;
	end;

	format miu 8.;
	format sd 18.1;
	format theta 18.1;
	format n 8.;

run;

And then for the filtering part

data chk1;
	set chk;
	theta_c = input(theta, $20.);
run;

data chk2;
set chk1;
where theta_c ="1.6";
run;

FreelanceReinh · Posted 10-02-2023 11:08 AM

Hello @thummala,

Alternatively, use an integer index variable and then compute the non-integer values as needed. This avoids the tiny rounding errors (which cause the sort of problem you have encountered) right from the start.

Example:

do _t=8 to 28 by 2;
  theta=_t/10;
...
drop _:;

In a loop over very few values (such as your sd loop, but that happened to produce "clean" values) you can list them like this:

do sd=0.7, 1, 1.3;

yabwon · Posted 10-02-2023 11:37 AM

Computer is just a very fast abacus, it adds, multiplies, and subtracts integers fairly good, but floating point numbers is totally different suit.

And it is not only SAS case, in R Python, Excel, C, Java, etc. you will be able to reproduce example similar to below.

You would expect to get 1 observation form:

data test;
  do i = -1 to 1 by 0.1;
    if 0=i then output;
  end;
run;

but you get 0, as the log states:

1    data test;
2      do i = -1 to 1 by 0.1;
3        if 0=i then output;
4      end;
5    run;

NOTE: The data set WORK.TEST has 0 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

In this case it's fault of infinite extension of one-tenth (1/10) in binary system, similar way one-third (1/3) has 0.333333... in the decimal system.

Lecture of the link from @Astounding will help you to understand the situation.

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

s_lassen · Posted 10-02-2023 03:25 PM

You already had some good suggestions on how to solve this - the problem you are encountering is the following:

Numbers in SAS are internally represented as base 2 floating points (unless you are on the mainframe, then they are base 16 floating points). This means that some decimal numbers can be represented exactly, for instance your MIU values:

do miu=2 to 3.5 by 0.5;

- here you will have 2E0, 5E-1, 3E0 etc. up to 7E-1, as the numbers are all possible to calculate as integers, some of them divided by 2.

But your thetas are not like that - the number 0.2 cannot be represented exactly as an integer divided by a power of 2. So when you keep adding 0.2 to a decimal number, the result will eventually become inexact. In the case you show, this happens at 1.6.

One other possibility, apart from rounding or using integers instead of decimals is to spell out the numbers you want in a list:

do theta=0.8,1.0,1.2...2.8;

which will give the "exact" decimal numbers. One way to do it is like this - put the list of values you want in a macro variable:

data _null_;
  length thetas $200;
  do theta=0.8 to 2.8 by 0.2;
    call catx(',',thetas,put(theta,3.1));
    end;
  call symputx('thetas',thetas);
run;
  
data chk;
do miu=2 to 3.5 by 0.5;
         do sd=0.7 to 1.3 by 0.3;
                   do theta=&thetas;
                              do n=6 to 30 by 4;
                              output;
                              end;
                   end;
           end;
end;
run;

- of course, you may want to do the same for the SD values:

data _null_;
  length thetas $200;
  do theta=0.8 to 2.8 by 0.2;
    call catx(',',thetas,put(theta,3.1));
    end;
  call symputx('thetas',thetas);
  length sds $200;
    do sd=0.7 to 1.3 by 0.3;
    call catx(',',sds,put(sd,3.1));
    end;
  call symputx('sds',sds);
run;
  
data chk;
do miu=2 to 3.5 by 0.5;
         do sd=&sds;
                   do theta=&thetas;
                              do n=6 to 30 by 4;
                              output;
                              end;
                   end;
           end;
end;
run;

Is this worth the trouble? Maybe, if you are doing a lot of comparisons on the same data - then it probably will be faster than rounding every time you try to look up a value. And if you have a lot of inexperienced users trying to find the different values, this solution may save a lot of time - not just CPU time, but actual working hours, which are a lot more expensive.

Interesting result when where condition is applied for dataset