DATA Step, Macro, Functions and more

Curious about FINDW / INDEXW results

Reply
Occasional Contributor
Posts: 10

Curious about FINDW / INDEXW results

Hello,

 

While testing a macro to return email addresses from a metadata dataset, I get (seemingly) inconsistent results from the FINDW and INDEXW functions. When running the code given below, I am really curious why the datasets A1 and A2 are returning different results, and why the datasets B1 and B2 are returning different results. It seems like the order in which I check things influences the result.

 

My feeling is that A1 and A2 should be same, and that B1 and B2 should be same. Can anybody give me a hint why they are not? 

 

I get:

NOTE: Table WORK.TEST_A1 created, with 1 rows and 2 columns.

NOTE: Table WORK.TEST_A2 created, with 0 rows and 2 columns.

NOTE: Table WORK.TEST_B1 created, with 0 rows and 2 columns.

NOTE: Table WORK.TEST_B2 created, with 1 rows and 2 columns.

 

I am running in SAS EG 7.1 on UNIX, and in SAS 9.4 on WINDOWS.

 

data T_USER_EMAIL;
   length userid $6 email $100;
   userid='ABC090'; email='john.smith@xyz.com';    output;
   userid='DEF090'; email='joanna.smythe@xyz.com'; output;
   userid='HIJ090'; email=' ';                     output;
run;

* test with Name;
proc sql noprint;
   create table test_A1 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) or
            (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) ) ;
quit;

proc sql noprint;
   create table test_A2 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) or 
            (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) ) ;
quit;


* test with UserId;
proc sql noprint;
   create table test_B1 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" DEF090 xxx",strip(upcase(scan(email,1,'@'))),' ')) or
            (FINDW(" DEF090 xxx",strip(upcase(userid)))) ) ;
quit;

proc sql noprint;
   create table test_B2 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" DEF090 xxx",strip(upcase(userid)))) or 
            (FINDW(" DEF090 xxx",strip(upcase(scan(email,1,'@'))),' ')) ) ;
quit;

 

 

 

(Just fyi, I want Users to be able to provide UserIds and/or Names to the macro ... that is why the code generated by the macro is checking in both the userid and email columns) 

Super User
Posts: 10,283

Re: Curious about FINDW / INDEXW results

I think you're on to something here. Since the condition works when the first part of the "or" returns a non-zero value, but fails when that is zero, I have a suspicion that the condition optimizer does not honor the brackets as it should and breaks off too early.

Clearly a case for SAS TS, IMO.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
PROC Star
Posts: 1,286

Re: Curious about FINDW / INDEXW results

I agree, this seems strange. And seems to limit itself to the WHERE Clause. The two data sets created with the following code are identical

 

data T_USER_EMAIL;
   length userid $6 email $100;
   userid='ABC090'; email='john.smith@xyz.com';    output;
   userid='DEF090'; email='joanna.smythe@xyz.com'; output;
   userid='HIJ090'; email=' ';                     output;
run;

* test with Name;
proc sql noprint;
   create table test_A1 as
   select *
         ,(index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) or
            (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) )) as testvar
     from T_USER_EMAIL;
quit;

proc sql noprint;
   create table test_A2 as
   select *
         ,(index(upcase(email),'@XYZ.COM')
      and ( (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) or 
            (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) )) as testvar
     from T_USER_EMAIL;
quit;

proc compare base=test_A1 compare=test_A2;
run;
Trusted Advisor
Posts: 1,259

Re: Curious about FINDW / INDEXW results

Hello @aknight1,

 

I fully agree with @KurtBremser and @draycut and strongly suspect a SAS bug, a pretty substantial one indeed, which should be reported to Technical Support. Here is a boiled down example:

data have;
c=' ';
d='x';
run;

data want;
set have;
where find('x',c) or find('x',d);
run;

data want;
set have;
where find('y',c) or find('x',d);
run;

data want;
set have;
where find('x',c) | find(lowcase(left(' X')),d);
run;

data want;
set have;
where find('x',c) | find('x',d) | length(d)=2;
run;

Log (from SAS 9.4 TS1M2 on the X64_7PRO platform):

1    data have;
2    c=' ';
3    d='x';
4    run;

NOTE: The data set WORK.HAVE has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


5
6    data want;
7    set have;
8    where find('x',c) or find('x',d);
9    run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE not (not FIND('x', c));
NOTE: The data set WORK.WANT has 0 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


10
11   data want;
12   set have;
13   where find('y',c) or find('x',d);
14   run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE FIND('y', c) or FIND('x', d);
NOTE: The data set WORK.WANT has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


15
16   data want;
17   set have;
18   where find('x',c) | find(lowcase(left(' X')),d);
19   run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE not (not FIND('x', c));
NOTE: The data set WORK.WANT has 0 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


20
21   data want;
22   set have;
23   where find('x',c) | find('x',d) | length(d)=2;
24   run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE FIND('x', c) or (LENGTH(d)=2);
NOTE: The data set WORK.WANT has 0 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Obviously, dataset WANT should contain 1 observation in all four cases.

 

Preliminary conclusions:

 

  • The bug is not limited to FINDW and INDEXW.
  • It seems unrelated to PROC SQL.
  • The WHERE condition, as shown in the NOTEs, is inappropriately modified ("optimized"?).
  • The same modification occurs if "or" is replaced by "and" (not shown above).
  • Different types of incorrect modifications can take place (see first vs. last example).
  • The bug seems to occur as soon as the first arguments of the function calls (with the same function) result in the same string.

 

Occasional Contributor
Posts: 10

Re: Curious about FINDW / INDEXW results

Posted in reply to FreelanceReinhard

Hi ...

 

Thanks for the replies so far ... for your information, I have now reported this to SAS Technical Support. In the ticket I have also notified them of this thread, so that they can gain any insight from the Users here.

 

Thanks all Smiley Happy

Antony 

PROC Star
Posts: 1,460

Re: Curious about FINDW / INDEXW results

Wow, this could ruin my Friday. : )

 

Here's an example of the bug.  The compiler seems to have munged the where clause, completely ignoring the find('x',x);

 

1    data have;
2    x='x' ;
3    y='y' ;
4    run;

NOTE: The data set WORK.HAVE has 1 observations and 2 variables.

5
6    *bug: returns 0 and where clause is reported in log as  ;
7    *WHERE not (not FIND('x', y));
8    *so its ignorning second part of clause ;
9    data want;
10   set have;
11   where find('x',y) or find('x',x);
12   run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE not (not FIND('x', y));
NOTE: The data set WORK.WANT has 0 observations and 2 variables.

Changing the expression slightly makes it work:

14   *works as expected just by adding to expression;
15   data want;
16   set have;
17   where find('x',y) or (find('x',x)=1);
18   run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE FIND('x', y) or (FIND('x', x)=1);

Adding a blank to the literal string makes it work

20   *works as expected ;
21   data want;
22   set have;
23   where find('x',y) or find(' x',x);
24   run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE FIND('x', y) or FIND(' x', x);
NOTE: The data set WORK.WANT has 1 observations and 2 variables.

If the second half of the where clause is find('y',y) you do not get the same bug as find('x',x). 

26   *doesnt happen with variable named y?  ;
27   data want;
28   set have;
29   where find('x',y) or find('y',y);
30   run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE FIND('x', y) or FIND('y', y);

 

So it's somehow related to a pesky  'x' being mis-compiled?  My  head is hurting... 

Trusted Advisor
Posts: 1,259

Re: Curious about FINDW / INDEXW results


@Quentin wrote:

 

If the second half of the where clause is find('y',y) you do not get the same bug as find('x',x). 

26   *doesnt happen with variable named y?  ;
27   data want;
28   set have;
29   where find('x',y) or find('y',y);
30   run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE FIND('x', y) or FIND('y', y);

 

So it's somehow related to a pesky  'x' being mis-compiled?  My  head is hurting... 


I don't think it has to do with variable names, y is affected in the same way (using your dataset HAVE):

 

88   data want;
89   set have;
90   where find('y',' ') or find('y',y);
91   run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE 0 /* an obviously FALSE WHERE clause */ ;
NOTE: The data set WORK.WANT has 0 observations and 2 variables.

It still seems to me that the equality of the first arguments (after evaluation, if any) is necessary for the bug to occur (but not sufficient, as your example with "...=1" shows).

 

Super User
Super User
Posts: 8,127

Re: Curious about FINDW / INDEXW results

[ Edited ]
Posted in reply to FreelanceReinhard

So if the conditions in the WHERE statement has multiple conditions that treat a function call (so far FIND,INDEX,FINDW,INDEX) as a boolean result AND where the first argument in all of them is the same constant value (even if the later arguments are different) then it is being optimized incorrectly.

 

data bad;
  set sashelp.class ;
  where index('Alice M',strip(name)) or index('Alice M',strip(sex));
run;

data good;
  set sashelp.class ;
  if index('Alice M',strip(name)) or index('Alice M',strip(sex));
run;

One fix is to explicitly convert the function call into a boolean expression.

data good2;
  set sashelp.class ;
  where 0^=index('Alice M',strip(name)) or 0^=index('Alice M',strip(sex));
run;

UPDATE

 The where statement works properly in SAS 9.3, but not in SAS 9.4M3 or SAS 9.4M5.

Trusted Advisor
Posts: 1,259

Re: Curious about FINDW / INDEXW results


@Tom wrote:

 

UPDATE

 The where statement works properly in SAS 9.3, but not in SAS 9.4M3 or SAS 9.4M5.


This is very interesting (and good news for older projects).

 

Here's another example, still following the pattern @Tom has described, but now using a function that produces a character result (i.e. with non-blank values being evaluated as TRUE) and without any involvement of variables:

105  data have;
106  run;

NOTE: The data set WORK.HAVE has 1 observations and 0 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


107
108  data want;
109  set have;
110  where cat('x','y') | cat('x');
111  run;

NOTE: There were 1 observations read from the data set WORK.HAVE.
      WHERE not (not 'xy');
PROC Star
Posts: 1,460

Re: Curious about FINDW / INDEXW results

Posted in reply to FreelanceReinhard

Thanks @FreelanceReinhard .  I see your point.  Glad it wasn't variable names.

 

I agree with you (and @Tom) looks like it happens when the first argument is a literal value, and the same same literal value is used in  multiple calls to the function, e.g.:

 

78   data want;
79   set have;
80   where find('FOO',x) or find('FOO',y);
81   run;

NOTE: There were 0 observations read from the data set WORK.HAVE.
      WHERE not (not FIND('FOO', x));

That's a scary bug.

Super User
Posts: 10,283

Re: Curious about FINDW / INDEXW results

Posted in reply to FreelanceReinhard

Ouch. This goes deeper than I initially suspected, and can turn out to be the first really SERIOUS bug I've seen with respect to SAS. It's much more serious insofar as it can go undetected for quite some time and produce false results. Can hurt much more than a simple Segmentation Violation that causes a very noticeable crash (I've found one of those quite recently in the 9.4M5 code for SAS/SHARE, see http://support.sas.com/kb/62/346.html).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Posts: 13,583

Re: Curious about FINDW / INDEXW results

Not going to go into more testing but you could try simplifying the FINDW code by using the I and R (or T) options and drop the strip(upcase()) code.

Regular Contributor
Posts: 247

Re: Curious about FINDW / INDEXW results

Try removing the parentheses around the two findw conditions....it seems to work for me.

 

 

data T_USER_EMAIL;
   length userid $6 email $100;
   userid='ABC090'; email='john.smith@xyz.com';    output;
   userid='DEF090'; email='joanna.smythe@xyz.com'; output;
   userid='HIJ090'; email=' ';                     output;
run;

* test with Name;
proc sql noprint;
   create table test_A1 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and  (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' ')) or
            (FINDW(" JOHN.SMITH xxx",strip(upcase(userid))))  ;
quit;

proc sql noprint;
   create table test_A2 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and  (FINDW(" JOHN.SMITH xxx",strip(upcase(userid)))) or 
            (FINDW(" JOHN.SMITH xxx",strip(upcase(scan(email,1,'@'))),' '))  ;
quit;


* test with UserId;
proc sql noprint;
   create table test_B1 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and  (FINDW("DEF090 xxx",strip(upcase(scan(email,1,'@'))),' ')) or
            (FINDW("DEF090 xxx",strip(upcase(userid))))  ;
quit;

proc sql noprint;
   create table test_B2 as
   select *
     from T_USER_EMAIL
    where index(upcase(email),'@XYZ.COM')
      and  (FINDW("DEF090 xxx",strip(upcase(userid)))) or 
            (FINDW("DEF090 xxx",strip(upcase(scan(email,1,'@'))),' '))  ;
quit;
Ask a Question
Discussion stats
  • 12 replies
  • 203 views
  • 11 likes
  • 8 in conversation