Hello,
While testing a macro to return email addresses from a metadata dataset, I get (seemingly) inconsistent results from the FINDW and INDEXW functions. When running the code given below, I am really curious why the datasets A1 and A2 are returning different results, and why the datasets B1 and B2 are returning different results. It seems like the order in which I check things influences the result.
My feeling is that A1 and A2 should be same, and that B1 and B2 should be same. Can anybody give me a hint why they are not?
I get:
NOTE: Table WORK.TEST_A1 created, with 1 rows and 2 columns.
NOTE: Table WORK.TEST_A2 created, with 0 rows and 2 columns.
NOTE: Table WORK.TEST_B1 created, with 0 rows and 2 columns.
NOTE: Table WORK.TEST_B2 created, with 1 rows and 2 columns.
I am running in SAS EG 7.1 on UNIX, and in SAS 9.4 on WINDOWS.
data T_USER_EMAIL;
length userid $6 email $100;
userid='ABC090'; email='john.smith@xyz.com'; output;
userid='DEF090'; email='joanna.smythe@xyz.com'; output;
userid='HIJ090'; email=' '; output;
run;
* test with Name;
proc sql noprint;
create table test_A1 as
select *
from T_USER_EMAIL
where index(upcase(email),'@XYZ.COM')
and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) or
(FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) ) ;
quit;
proc sql noprint;
create table test_A2 as
select *
from T_USER_EMAIL
where index(upcase(email),'@XYZ.COM')
and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) or
(FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) ) ;
quit;
* test with UserId;
proc sql noprint;
create table test_B1 as
select *
from T_USER_EMAIL
where index(upcase(email),'@XYZ.COM')
and ( (FINDW(" DEF090 xxx",strip(upcase(scan(email,1,'@'))),' ')) or
(FINDW(" DEF090 xxx",strip(upcase(userid)))) ) ;
quit;
proc sql noprint;
create table test_B2 as
select *
from T_USER_EMAIL
where index(upcase(email),'@XYZ.COM')
and ( (FINDW(" DEF090 xxx",strip(upcase(userid)))) or
(FINDW(" DEF090 xxx",strip(upcase(scan(email,1,'@'))),' ')) ) ;
quit;
(Just fyi, I want Users to be able to provide UserIds and/or Names to the macro ... that is why the code generated by the macro is checking in both the userid and email columns)
I think you're on to something here. Since the condition works when the first part of the "or" returns a non-zero value, but fails when that is zero, I have a suspicion that the condition optimizer does not honor the brackets as it should and breaks off too early.
Clearly a case for SAS TS, IMO.
I agree, this seems strange. And seems to limit itself to the WHERE Clause. The two data sets created with the following code are identical
data T_USER_EMAIL;
length userid $6 email $100;
userid='ABC090'; email='john.smith@xyz.com'; output;
userid='DEF090'; email='joanna.smythe@xyz.com'; output;
userid='HIJ090'; email=' '; output;
run;
* test with Name;
proc sql noprint;
create table test_A1 as
select *
,(index(upcase(email),'@XYZ.COM')
and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) or
(FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) )) as testvar
from T_USER_EMAIL;
quit;
proc sql noprint;
create table test_A2 as
select *
,(index(upcase(email),'@XYZ.COM')
and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) or
(FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) )) as testvar
from T_USER_EMAIL;
quit;
proc compare base=test_A1 compare=test_A2;
run;
Hello @aknight1,
I fully agree with @Kurt_Bremser and @PeterClemmensen and strongly suspect a SAS bug, a pretty substantial one indeed, which should be reported to Technical Support. Here is a boiled down example:
data have;
c=' ';
d='x';
run;
data want;
set have;
where find('x',c) or find('x',d);
run;
data want;
set have;
where find('y',c) or find('x',d);
run;
data want;
set have;
where find('x',c) | find(lowcase(left(' X')),d);
run;
data want;
set have;
where find('x',c) | find('x',d) | length(d)=2;
run;
Log (from SAS 9.4 TS1M2 on the X64_7PRO platform):
1 data have; 2 c=' '; 3 d='x'; 4 run; NOTE: The data set WORK.HAVE has 1 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 5 6 data want; 7 set have; 8 where find('x',c) or find('x',d); 9 run; NOTE: There were 0 observations read from the data set WORK.HAVE. WHERE not (not FIND('x', c)); NOTE: The data set WORK.WANT has 0 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 10 11 data want; 12 set have; 13 where find('y',c) or find('x',d); 14 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE FIND('y', c) or FIND('x', d); NOTE: The data set WORK.WANT has 1 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 15 16 data want; 17 set have; 18 where find('x',c) | find(lowcase(left(' X')),d); 19 run; NOTE: There were 0 observations read from the data set WORK.HAVE. WHERE not (not FIND('x', c)); NOTE: The data set WORK.WANT has 0 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 20 21 data want; 22 set have; 23 where find('x',c) | find('x',d) | length(d)=2; 24 run; NOTE: There were 0 observations read from the data set WORK.HAVE. WHERE FIND('x', c) or (LENGTH(d)=2); NOTE: The data set WORK.WANT has 0 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Obviously, dataset WANT should contain 1 observation in all four cases.
Preliminary conclusions:
Hi ...
Thanks for the replies so far ... for your information, I have now reported this to SAS Technical Support. In the ticket I have also notified them of this thread, so that they can gain any insight from the Users here.
Thanks all
Antony
Wow, this could ruin my Friday. : )
Here's an example of the bug. The compiler seems to have munged the where clause, completely ignoring the find('x',x);
1 data have;
2 x='x' ;
3 y='y' ;
4 run;
NOTE: The data set WORK.HAVE has 1 observations and 2 variables.
5
6 *bug: returns 0 and where clause is reported in log as ;
7 *WHERE not (not FIND('x', y));
8 *so its ignorning second part of clause ;
9 data want;
10 set have;
11 where find('x',y) or find('x',x);
12 run;
NOTE: There were 0 observations read from the data set WORK.HAVE.
WHERE not (not FIND('x', y));
NOTE: The data set WORK.WANT has 0 observations and 2 variables.
Changing the expression slightly makes it work:
14 *works as expected just by adding to expression; 15 data want; 16 set have; 17 where find('x',y) or (find('x',x)=1); 18 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE FIND('x', y) or (FIND('x', x)=1);
Adding a blank to the literal string makes it work
20 *works as expected ; 21 data want; 22 set have; 23 where find('x',y) or find(' x',x); 24 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE FIND('x', y) or FIND(' x', x); NOTE: The data set WORK.WANT has 1 observations and 2 variables.
If the second half of the where clause is find('y',y) you do not get the same bug as find('x',x).
26 *doesnt happen with variable named y? ; 27 data want; 28 set have; 29 where find('x',y) or find('y',y); 30 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE FIND('x', y) or FIND('y', y);
So it's somehow related to a pesky 'x' being mis-compiled? My head is hurting...
@Quentin wrote:
If the second half of the where clause is find('y',y) you do not get the same bug as find('x',x).
26 *doesnt happen with variable named y? ; 27 data want; 28 set have; 29 where find('x',y) or find('y',y); 30 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE FIND('x', y) or FIND('y', y);
So it's somehow related to a pesky 'x' being mis-compiled? My head is hurting...
I don't think it has to do with variable names, y is affected in the same way (using your dataset HAVE):
88 data want; 89 set have; 90 where find('y',' ') or find('y',y); 91 run; NOTE: There were 0 observations read from the data set WORK.HAVE. WHERE 0 /* an obviously FALSE WHERE clause */ ; NOTE: The data set WORK.WANT has 0 observations and 2 variables.
It still seems to me that the equality of the first arguments (after evaluation, if any) is necessary for the bug to occur (but not sufficient, as your example with "...=1" shows).
So if the conditions in the WHERE statement has multiple conditions that treat a function call (so far FIND,INDEX,FINDW,INDEX) as a boolean result AND where the first argument in all of them is the same constant value (even if the later arguments are different) then it is being optimized incorrectly.
data bad;
set sashelp.class ;
where index('Alice M',strip(name)) or index('Alice M',strip(sex));
run;
data good;
set sashelp.class ;
if index('Alice M',strip(name)) or index('Alice M',strip(sex));
run;
One fix is to explicitly convert the function call into a boolean expression.
data good2;
set sashelp.class ;
where 0^=index('Alice M',strip(name)) or 0^=index('Alice M',strip(sex));
run;
UPDATE
The where statement works properly in SAS 9.3, but not in SAS 9.4M3 or SAS 9.4M5.
@Tom wrote:
UPDATE
The where statement works properly in SAS 9.3, but not in SAS 9.4M3 or SAS 9.4M5.
This is very interesting (and good news for older projects).
Here's another example, still following the pattern @Tom has described, but now using a function that produces a character result (i.e. with non-blank values being evaluated as TRUE) and without any involvement of variables:
105 data have; 106 run; NOTE: The data set WORK.HAVE has 1 observations and 0 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 107 108 data want; 109 set have; 110 where cat('x','y') | cat('x'); 111 run; NOTE: There were 1 observations read from the data set WORK.HAVE. WHERE not (not 'xy');
Thanks @FreelanceReinh . I see your point. Glad it wasn't variable names.
I agree with you (and @Tom) looks like it happens when the first argument is a literal value, and the same same literal value is used in multiple calls to the function, e.g.:
78 data want; 79 set have; 80 where find('FOO',x) or find('FOO',y); 81 run; NOTE: There were 0 observations read from the data set WORK.HAVE. WHERE not (not FIND('FOO', x));
That's a scary bug.
Hi all ...
Just for your information ...
Today I was notified by SAS Tech Support that the Hot-Fix for this bug is now available.
Product: Base SAS 9.4_M3
Hot Fix: V01099
This hot fix is now available to customers on our external download site:
http://ftp.sas.com/techsup/download/hotfix/HF2/V01.html#V01099
V01099 is currently available on the following platform(s):
Mainframe: z/OS, z/OS 64-bit
PC: Windows, Windows for x64
UNIX: 64-bit Enabled Solaris, 64-bit Enabled AIX, HP-UX IPF, Linux for x64, Solaris for x64
Thanks to all who responded to my question, and encouraged me to contact SAS Tech Support.
Special thanks to @FreelanceReinh ... your concise examples were very helpful in convincing SAS Tech Support that this is a genuine problem that needs to be fixed.
Regards,
Antony
Good Morning Mr. Knight,
good news - we got the hot fix for SAS 9.4 M4.
Please find the links to the download area of the fix below.
Product: Base SAS 9.4_M4 Hot Fix: A3Z053 This hot fix is now available to customers on our external download site: http://ftp.sas.com/techsup/download/hotfix/HF2/A3Z.html#A3Z053
http://support.sas.com/kb/62/724.html
A3Z053 is currently available on the following platform(s):
Mainframe: z/OS
PC: Windows, Windows for x64
UNIX: 64-bit Enabled Solaris, 64-bit Enabled AIX, HP-UX IPF, Linux for x64, Solaris for x64
SAS Notes have been updated. A "Hot Fix tab" will appear on SAS Note(s) after the overnight update job has run to push the tab beyond the firewall. The availability of this hot fix will be announced within 48 hours on the SAS Hot Fix Communities Site.
The fix for SAS 9.4 M5 should follow soon.
I find the title and description of the problem note really confusing.
The example in the note is:
data test;
var1="abc001"; var2="LOOK FOR ME(HERE)"; output;
var1="def001"; var2="I AM NOT HERE "; output;
run;
data ex1;
set test;
where (findw("LOOK FOR ME",strip(upcase(var1)))) or
(findw("LOOK FOR ME",strip(upcase(scan(var2,1,'('))),' '));
run;
Minor problems: OR is an operator, not a condition. Operators do not have arguments. Clearly in the example, the two expressions on the WHERE statement are not identical.
What would be a better description? Something like "When a character function is used multiple times in WHERE statement expressions joined by an OR operator, with identical literal values as the first argument, incorrect results may be returned"? I don't love that either. Can someone write a better version of the support note?
Or did I misunderstand the scope of this bug?
Ouch. This goes deeper than I initially suspected, and can turn out to be the first really SERIOUS bug I've seen with respect to SAS. It's much more serious insofar as it can go undetected for quite some time and produce false results. Can hurt much more than a simple Segmentation Violation that causes a very noticeable crash (I've found one of those quite recently in the 9.4M5 code for SAS/SHARE, see http://support.sas.com/kb/62/346.html).
Not going to go into more testing but you could try simplifying the FINDW code by using the I and R (or T) options and drop the strip(upcase()) code.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.