Solved: Re: SQL LIKE operator and whitespace

quickbluefish · Posted 06-02-2020 10:26 AM

Hi all -

I am joining two tables, one of which has diagnostic events for patients and the other being a lookup table categorizing diagnosis codes as inclusions or exclusions. The first is potentially quite large and the 2nd is generally going to be fairly small and contains some complete diagnosis codes and some truncated ones that might have one or more '%' signs in them to make them compatible with the SQL LIKE operator.

Here's an example:

data claims;
infile cards dlm=',';
length ptid mpid 5 dx $10;
input ptid mpid dx;
cards;
1, 12, H41198
1, 15, C7781
2, 294, H677119
2, 300, G55181
;
run;

data codelist;
infile cards dlm=',';
length code_type $8 code $10 condition $15;
input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
from
	claims A
	inner join
	codelist B
	on a.dx LIKE trim(b.code);       /* <<-- I would like to remove the TRIM function */
quit;

The above works fine and produces output as expected:

ptid mpid dx code_type condition lookup_code

1	15	C7781	DX_INCL	CANCER	C%
2	300	G55181	DX_EXCL	STUFF	G55181

...however, being that 'LIKE' is abusive enough, processor-wise, when working with large tables, I would really like to be able to at least get rid of the TRIM( ) function in the join. But removing it from the join, even if I have pre-processed the CODELIST table like this:

data codelist;
set codelist;
code=trim(code);
run;

causes the SELECT to return only the exact match (code G55181) and not the C% code. Maybe TRIM/STRIP/COMPRESS and similar don't really add much processor time and I should just ignore, but I am really puzzled by this behavior. Any ideas would be appreciated!

Thanks.

Tom · Posted 06-02-2020 02:18 PM

Not sure there is much of a performance hit from having the TRIM() function call there.

If you prefer you could just add trailing spaces to DX to make sure has enough trailing spaces to match what you might have in your LIKE pattern. A little testing show you need to pad 2 spaces or increase the length by 3 spaces to get the same results as using TRIM(). I think that the value of 2 or 3 spaces is related the relative length of the value of DX and value of CODE since 'C7781' is three bytes longer than 'C%'.

data claims;
  infile cards dlm=',';
  length ptid mpid 5 dx $10;
  input ptid mpid dx;
cards;
1, 15, C7781
;

data codelist;
  infile cards dlm=',';
  length code_type $8 code $10 condition $15;
  input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
%macro test(npad);

data claims2;
  length dx $%eval(10+&npad);
  set claims;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
     , &npad as NPAD
     , a.dx like b.code as NOTRIM
     , a.dx like trim(b.code) as TRIM
     , a.dx||"%sysfunc(repeat(%str( ),&npad-1))" like b.code as PAD
from claims2 A
	 , codelist B
;      
quit;

%mend ;

%test(0);
%test(1);
%test(2);
%test(3);

View solution in original post

biopharma · Posted 06-02-2020 11:12 AM

The length of the variable CODE is set. A TRIM does the trimming to the right but stores back into CODE by padding spaces up to the pre-set length.

I don't have a solution for replacing the LIKE operator with anything less invasive.

quickbluefish · Posted 06-02-2020 12:53 PM

Interesting... though I don't know why the behavior is different for codes that end with a '%' vs. not - all of the codes are less than length $10, but the TRIM( ) is only necessary in the join for those that end with %. Thanks for your answer!

ballardw · Posted 06-02-2020 11:45 AM

@quickbluefish wrote:

Hi all -

I am joining two tables, one of which has diagnostic events for patients and the other being a lookup table categorizing diagnosis codes as inclusions or exclusions. The first is potentially quite large and the 2nd is generally going to be fairly small and contains some complete diagnosis codes and some truncated ones that might have one or more '%' signs in them to make them compatible with the SQL LIKE operator.

I would question the " '%' sigsn in them to make them compatible".

If the code list values are relatively static I would seriously consider having custom formats. Then you get value you need with put(codevariable, codeformat.) as newvariable , if the additional variable is even needed.

And for some things a multilabel format is hard to beat.

quickbluefish · Posted 06-02-2020 12:56 PM

Thanks, yes, I would prefer a format as well, maybe even using the regex options, but the codelist is both messy and dynamic... and potentially too large for a format as well. Appreciate the reply!

ballardw · Posted 06-02-2020 01:30 PM

@quickbluefish wrote:
Thanks, yes, I would prefer a format as well, maybe even using the regex options, but the codelist is both messy and dynamic... and potentially too large for a format as well. Appreciate the reply!

Dynamic, if that means the values change often is more of an issue and would relate to the maintenance. Though if the "new" values appear in an organized manner then it might be a simple as appending some data to a cntlin data set for Proc format to use.

I am not sure how "messy" would be an issue unless you mean that a single value potentially gets mapped to multiple values. Which I would think would make SQL joining questionable as well. Single value to result for formats doesn't matter much. And how large is "potentially too large"? SAS has multiple formats for date values that accept every date from years 1581 to 9999, that's over 3 million values (though I suspect some interesting things are done behind the scenes for these) and datetimes have potentially many more.

I suspect a large enough custom format might have performance issues but if it avoids having to go into odd gyrations with "like" and similar it might be worth the performance hit.

Tom · Posted 06-02-2020 02:18 PM

Not sure there is much of a performance hit from having the TRIM() function call there.

If you prefer you could just add trailing spaces to DX to make sure has enough trailing spaces to match what you might have in your LIKE pattern. A little testing show you need to pad 2 spaces or increase the length by 3 spaces to get the same results as using TRIM(). I think that the value of 2 or 3 spaces is related the relative length of the value of DX and value of CODE since 'C7781' is three bytes longer than 'C%'.

data claims;
  infile cards dlm=',';
  length ptid mpid 5 dx $10;
  input ptid mpid dx;
cards;
1, 15, C7781
;

data codelist;
  infile cards dlm=',';
  length code_type $8 code $10 condition $15;
  input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
%macro test(npad);

data claims2;
  length dx $%eval(10+&npad);
  set claims;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
     , &npad as NPAD
     , a.dx like b.code as NOTRIM
     , a.dx like trim(b.code) as TRIM
     , a.dx||"%sysfunc(repeat(%str( ),&npad-1))" like b.code as PAD
from claims2 A
	 , codelist B
;      
quit;

%mend ;

%test(0);
%test(1);
%test(2);
%test(3);

ChrisNZ · Posted 06-02-2020 10:38 PM

The like operator is expensive, as you stated, and I doubt trim() adds much processing time, if any.

What it does, though, is shorten the like operator's match-string.

It would not surprise me that a shorter like string actually saves time.

Expensive string matching operators, such as like, or functions, such as regular expressions, benefit greatly from shorter match-strings.

High-Performance SAS Coding - Third Edition

quickbluefish · Posted 06-03-2020 10:56 AM

Thank you! Yeah, I am mostly wondering _why_ trim is necessary for those that end with % vs. not (even if length is the same, e.g., C5% and C56, to give a fake example) and that the problem cannot be resolved by pre-processing the CODELIST file -- trimming it there, either in a DATA step or subquery -- rather than in the ON clause. I am wondering if it's got something to do with the fact that % also serves as an escape character, and it's actually somehow causing SAS to hang on to an extra bit of whitespace for, e.g., C5% and not C56 ('C5% ' vs. 'C56'), forcing you to have to explicitly trim it off. Unrelated - I saw your comment about working with SAS EG being like walking on glass shards (in your post about file locks) - could not agree more. It inspires such atrocious programming habits in virtually every new hire we've had.

mkeintz · Posted 06-02-2020 11:33 PM

If all the % signs are at the rightmost non-blank character of code, than I think you'll be better of using a hash lookup, with a new variable TRIM_CODE (stripped of the % sign) in the codelist dataset. Then for each incoming claims record do a full length (using L for length of the dx value) lookup of DX against a hash object built from CODELIST and using TRIM_CODE as the lookup key. If not found, then shorten L by one character at a time, until either a lookup succeeds (i.e. the hash FIND method returns a zero) or L=0 (meaning no matching trim_code was found):

data claims;
infile cards dlm=',';
length ptid mpid 5 dx $10;
input ptid mpid dx;
cards;
1, 12, H41198
1, 15, C7781
2, 294, H677119
2, 300, G55181
;
run;

data codelist;
infile cards dlm=',';
length code_type $8 code $10 condition $15;
input code_type code condition;
  trim_code=tranwrd(code,'% ',' ');
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
run;

data want (drop=L);
  set claims codelist(obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'codelist');
      h.definekey('trim_code');
      h.definedata(all:'Y');
      h.definedone();
  end;
  do L=length(dx) to 1 by -1 until(h.find(key:substr(dx,1,L))=0);
  end;
  if L>0;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

quickbluefish · Posted 06-03-2020 10:34 AM

That is really interesting - I will give it a try and compare processing times. I also like the 'CALL ZERO' idea - I have had this thought many times and never realized that this site had anything like a ballot to submit ideas. Thanks!

ChrisNZ · Posted 06-03-2020 06:11 PM

You have chosen a solution where the like string is longer.

Not too sure how you expect the run time to be faster.

Rather, I'd expect the fastest time might be reached by trimming both side of the like operator.

Please report the results. I am curious now.

High-Performance SAS Coding - Third Edition

SAS Innovate 2025: Call for Content

Classroom Training Available!