PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NO...

SAS_Doctor · Posted 06-02-2011 03:24 PM

Why do I get conflicting results? I tested using an ODBC database - both Oracle and SQL Server. This works fine if I use SAS data.

Why are the "missing values" counted in the Sum(Case(Where NOT IN))) but not in the "Where NOT IN"?

Here is my code:
libname awdb2 meta library='awdb2'; /*Any ODBC connected data base.*/
Title1 'Base Data';
proc sql;
select ServiceLevelThreshold Label='SLT'
, count(1) as SLT_Count
from awdb2.t_call_type
group by 1
order by 1
;
quit;

Title1 'Where Not In (10)';
proc sql;
select ServiceLevelThreshold Label='SLT'
, count(1) as SLT_Count
from awdb2.t_Call_Type
where ServiceLevelThreshold not in (10)
group by ServiceLevelThreshold
order by ServiceLevelThreshold
;
quit;

Title1 'Sum(Case( Where Not In (10)))';
proc sql;
select sum(case when (ServiceLevelThreshold not in (10)) then 1 else 0 end) as SLT_Count
from awdb2.t_call_type
;
quit;

/* Test Data:
data Test;
infile datalines missover;
input ServiceLevelThreshold $ Status;
datalines;
h0 92
h0 51
h0 10
h0 51
h0 10
h0 66
h0 66
h0 10
h0 66
h1 10
h1 51
h1 92
h1
h1 10
h1
h2 51
h2
h2 66
h2 66
h2 10
h2 51
h2 66
h2 51
h2 10
;
run;*/

Ksharp · Posted 06-02-2011 10:35 PM

I tested it.There is no problem
[pre]

data Test;
infile datalines truncover;
input ServiceLevelThreshold $ Status;
datalines;
h0 92
h0 51
h0 10
h0 51
h0 10
h0 66
h0 66
h0 10
h0 66
h1 10
h1 51
h1 92
h1
h1 10
h1
h2 51
h2
h2 66
h2 66
h2 10
h2 51
h2 66
h2 51
h2 10
;
run;
Title1 'Base Data';
proc sql;
select ServiceLevelThreshold Label='SLT'
, count(1) as SLT_Count
from test
group by 1
order by 1
;
quit;

Title1 'Where Not In (10)';
proc sql;
select ServiceLevelThreshold Label='SLT'
, count(1) as SLT_Count
from test
where status not in (10)
group by ServiceLevelThreshold
order by ServiceLevelThreshold
;
quit;

Title1 'Sum(Case( Where Not In (10)))';
proc sql;
select sum(case when (status not in (10)) then 1 else 0 end) as SLT_Count
from test
;
quit;

[/pre]

They all equal 17.

Ksharp

SAS_Doctor · Posted 06-03-2011 10:33 AM

There is a known difference with how databases handle NULLs. Here is the wonderful explination I got from Michele Austin
@ SAS Tech Support. Thank You Michele!

James,

It sounds like you are hitting a known issue with databases. A NULL is the absence of a value so a database doesn't consider the value when doing equality testing. Any rows that are NULLs will not show up when doing an NOT IN if the DBMS is processing the WHERE. In SAS we don't have the concept of NULLs so the rows would be kept. If you want to keep the NULLs then you need to explicitly code for them using an OR condition

where var not in (.....) or var is null

The reason that the SUM CASE included the NULLs is I suspect SAS was doing the processing as the query is not valid in most databases so SAS would retrieve the rows and do the processing in SAS.

Let me know if you have any questions. I'll plan on keeping the track open for 2 weeks unless I hear otherwise. After that I'll assume that it's okay to resolve.

Michele Austin
SAS Technical Support

Below is some information that I copied from the online documentation that discusses differences in results when processing NULL data

Potential Result Set Differences When Processing Null Data When your data contains null values or when internal processing generates intermediate data sets that contain null values, you might get different result sets depending on whether the processing is done by SAS or by the DBMS. Although in many cases this does not present a problem, it is important to understand how these differences occur.

Most relational database systems have a special value called null, which means an absence of information and is analogous to a SAS missing value. SAS/ACCESS translates SAS missing values to DBMS null values when creating DBMS tables from within SAS. Conversely, SAS/ACCESS translates DBMS null values to SAS missing values when reading DBMS data into SAS.

There is, however, an important difference in the behavior of DBMS null values and SAS missing values:

A DBMS null value is interpreted as the absence of data, so you cannot sort a DBMS null value or evaluate it with standard comparison operators.

A SAS missing value is interpreted as its internal floating-point representation because SAS supports 28 missing values (where a period (.) is the most common missing value). Because SAS supports multiple missing values, you can sort a SAS missing value and evaluate it with standard comparison operators.

This means that SAS and the DBMS interpret null values differently, which has significant implications when SAS/ACCESS passes queries to a DBMS for processing. This can be an issue in the following situations:
when filtering data (for example, in a WHERE clause, a HAVING clause, or an outer join ON clause). SAS interprets null values as missing; many DBMS exclude null values from consideration. For example, if you have null values in a DBMS column that is used in a WHERE clause, your results might differ depending on whether the WHERE clause is processed in SAS or is passed to the DBMS for processing. This is because the DBMS removes null values from consideration in a WHERE clause, but SAS does not.

when using certain functions. For example, if you use the MIN aggregate function on a DBMS column that contains null values, the DBMS does not consider the null values, but SAS interprets the null values as missing. This interpretation affects the result.

when submitting outer joins where internal processing generates nulls for intermediate result sets.

when sorting data. SAS sorts null values low; most DBMSs sort null values high. (See Sorting DBMS Data for more information.)

For example, create a simple data set that consists of one observation and one variable.

libname myoralib oracle user=testuser password=testpass; data myoralib.table;
x=.; /*create a missing value */
run;
Then, print the data set using a WHERE clause, which SAS/ACCESS passes to the DBMS for processing.

proc print data=myoralib.table;
where x<0;
run;
The log indicates that no observations were selected by the WHERE clause, because Oracle interprets the missing value as the absence of data, and does not evaluate it with the less-than (<) comparison operator.

When there is the potential for inconsistency, consider using one of these strategies.

Use the LIBNAME option DIRECT_SQL= to control whether SAS or the DBMS handles processing.

Use the SQL pass-through facility to ensure that the DBMS handles processing.

Add the "is not null" expression to WHERE clauses and ON clauses to ensure that you get the same result regardless of whether SAS or the DBMS does the processing.

Note: Use the NULLCHAR= data set option to specify how the DBMS interprets missing SAS character values when updating DBMS data or inserting rows into a DBMS table.

You can use the first of these strategies to force SAS to process the data in this example.

libname myoralib oracle user=testuser password=testpass direct_sql=nowhere; /* forces SAS to process WHERE clauses */
data myoralib.table;
x=.; /*create a missing value */
run;

You can then print the data set using a WHERE clause:

proc print data=myoralib.table;
where x<0;
run;

This time the log indicates that one observation was read from the data set because SAS evaluates the missing value as satisfying the less-than-zero condition in the WHERE clause.

SAS_Doctor · Posted 06-03-2011 10:38 AM

Michelle also passed this information along:

James,

Just be aware that this isn't really anything dealing with SAS as far as the NOT IN and NOT EQUALS not including missing values. That would occur outside of SAS. That is a database thing. This is from Oracle's website

Nulls in Conditions A condition that evaluates to UNKNOWN acts almost like FALSE. For example, a SELECT statement with a condition in the WHERE clause that evaluates to UNKNOWN returns no rows. However, a condition evaluating to UNKNOWN differs from FALSE in that further operations on an UNKNOWN condition evaluation will evaluate to UNKNOWN. Thus, NOT FALSE evaluates to TRUE, but NOT UNKNOWN evaluates to UNKNOWN.

Table 3-20 shows examples of various evaluations involving nulls in conditions. If the conditions evaluating to UNKNOWN were used in a WHERE clause of a SELECT statement, then no rows would be returned for that query.

Table 3-20 Conditions Containing Nulls

Condition Value of A Evaluation
a IS NULL 10 FALSE

a IS NOT NULL 10 TRUE

a IS NULL NULL TRUE

a IS NOT NULL NULL FALSE

a = NULL 10 UNKNOWN

a != NULL 10 UNKNOWN

a = NULL NULL UNKNOWN

a != NULL NULL UNKNOWN

a = 10 NULL UNKNOWN

a != 10 NULL UNKNOWN

Let me know if you have any questions.

Michele Austin
SAS Technical Support

PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Re: PROC SQL "Where NOT IN" gives different results than Sum(Case(Where NOT IN)

Registration is open

SAS Training: Just a Click Away