BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
quickbluefish
Obsidian | Level 7

Hi all - 

I am joining two tables, one of which has diagnostic events for patients and the other being a lookup table categorizing diagnosis codes as inclusions or exclusions.  The first is potentially quite large and the 2nd is generally going to be fairly small and contains some complete diagnosis codes and some truncated ones that might have one or more '%' signs in them to make them compatible with the SQL LIKE operator.  

Here's an example:

data claims;
infile cards dlm=',';
length ptid mpid 5 dx $10;
input ptid mpid dx;
cards;
1, 12, H41198
1, 15, C7781
2, 294, H677119
2, 300, G55181
;
run;

data codelist;
infile cards dlm=',';
length code_type $8 code $10 condition $15;
input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
from
	claims A
	inner join
	codelist B
	on a.dx LIKE trim(b.code);       /* <<-- I would like to remove the TRIM function */
quit;

The above works fine and produces output as expected:

ptid   mpid  dx           code_type  condition   lookup_code

115C7781DX_INCLCANCERC%
2300G55181DX_EXCLSTUFFG55181

...however, being that 'LIKE' is abusive enough, processor-wise, when working with large tables, I would really like to be able to at least get rid of the TRIM( ) function in the join.  But removing it from the join, even if I have pre-processed the CODELIST table like this:

data codelist;
set codelist;
code=trim(code);
run;

causes the SELECT to return only the exact match (code G55181) and not the C% code.  Maybe TRIM/STRIP/COMPRESS and similar don't really add much processor time and I should just ignore, but I am really puzzled by this behavior.  Any ideas would be appreciated!

Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Not sure there is much of a performance hit from having the TRIM() function call there. 

If you prefer you could just add trailing spaces to DX to make sure has enough trailing spaces to match what you might have in your LIKE pattern.  A little testing show you need to pad 2 spaces or increase the length by 3 spaces to get the same results as using TRIM().  I think that the value of 2 or 3 spaces is related the relative length of the value of DX and value of CODE since 'C7781' is three bytes longer than 'C%'.

data claims;
  infile cards dlm=',';
  length ptid mpid 5 dx $10;
  input ptid mpid dx;
cards;
1, 15, C7781
;

data codelist;
  infile cards dlm=',';
  length code_type $8 code $10 condition $15;
  input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
%macro test(npad);

data claims2;
  length dx $%eval(10+&npad);
  set claims;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
     , &npad as NPAD
     , a.dx like b.code as NOTRIM
     , a.dx like trim(b.code) as TRIM
     , a.dx||"%sysfunc(repeat(%str( ),&npad-1))" like b.code as PAD
from claims2 A
	 , codelist B
;      
quit;

%mend ;

%test(0);
%test(1);
%test(2);
%test(3);

image.png

 

View solution in original post

11 REPLIES 11
biopharma
Quartz | Level 8
The length of the variable CODE is set. A TRIM does the trimming to the right but stores back into CODE by padding spaces up to the pre-set length.

I don't have a solution for replacing the LIKE operator with anything less invasive.
quickbluefish
Obsidian | Level 7
Interesting... though I don't know why the behavior is different for codes that end with a '%' vs. not - all of the codes are less than length $10, but the TRIM( ) is only necessary in the join for those that end with %. Thanks for your answer!
ballardw
Super User

@quickbluefish wrote:

Hi all - 

I am joining two tables, one of which has diagnostic events for patients and the other being a lookup table categorizing diagnosis codes as inclusions or exclusions.  The first is potentially quite large and the 2nd is generally going to be fairly small and contains some complete diagnosis codes and some truncated ones that might have one or more '%' signs in them to make them compatible with the SQL LIKE operator.  

 


I would question the " '%' sigsn in them to make them compatible".

If the code list values are relatively static I would seriously consider having custom formats. Then you get value you need with put(codevariable, codeformat.) as newvariable , if the additional variable is even needed.

 

And for some things a multilabel format is hard to beat.

 

 

quickbluefish
Obsidian | Level 7
Thanks, yes, I would prefer a format as well, maybe even using the regex options, but the codelist is both messy and dynamic... and potentially too large for a format as well. Appreciate the reply!
ballardw
Super User

@quickbluefish wrote:
Thanks, yes, I would prefer a format as well, maybe even using the regex options, but the codelist is both messy and dynamic... and potentially too large for a format as well. Appreciate the reply!

Dynamic, if that means the values change often is more of an issue and would relate to the maintenance. Though if the "new" values appear in an organized manner then it might be a simple as appending some data to a cntlin data set for Proc format to use.

 

I am not sure how "messy" would be an issue unless you mean that a single value potentially gets mapped to multiple values. Which I would think would make SQL joining questionable as well. Single value to result for formats doesn't matter much. And how large is "potentially too large"? SAS has multiple formats for date values that accept every date from years 1581 to 9999, that's over 3 million values (though I suspect some interesting things are done behind the scenes for these) and datetimes have potentially many more.

 

I suspect a large enough custom format might have performance issues but if it avoids having to go into odd gyrations with "like" and similar it might be worth the performance hit.

Tom
Super User Tom
Super User

Not sure there is much of a performance hit from having the TRIM() function call there. 

If you prefer you could just add trailing spaces to DX to make sure has enough trailing spaces to match what you might have in your LIKE pattern.  A little testing show you need to pad 2 spaces or increase the length by 3 spaces to get the same results as using TRIM().  I think that the value of 2 or 3 spaces is related the relative length of the value of DX and value of CODE since 'C7781' is three bytes longer than 'C%'.

data claims;
  infile cards dlm=',';
  length ptid mpid 5 dx $10;
  input ptid mpid dx;
cards;
1, 15, C7781
;

data codelist;
  infile cards dlm=',';
  length code_type $8 code $10 condition $15;
  input code_type code condition;
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
%macro test(npad);

data claims2;
  length dx $%eval(10+&npad);
  set claims;
run;

proc sql;
select a.ptid, a.mpid, a.dx, b.code_type, b.condition, b.code as lookup_code
     , &npad as NPAD
     , a.dx like b.code as NOTRIM
     , a.dx like trim(b.code) as TRIM
     , a.dx||"%sysfunc(repeat(%str( ),&npad-1))" like b.code as PAD
from claims2 A
	 , codelist B
;      
quit;

%mend ;

%test(0);
%test(1);
%test(2);
%test(3);

image.png

 

ChrisNZ
Tourmaline | Level 20

The like operator is expensive, as you stated, and I doubt trim() adds much processing time, if any.

What it does, though, is shorten the like operator's match-string.

It would not surprise me that a shorter like string actually saves time.

Expensive string matching operators, such as like, or functions, such as regular expressions, benefit greatly from shorter match-strings.

quickbluefish
Obsidian | Level 7
Thank you! Yeah, I am mostly wondering _why_ trim is necessary for those that end with % vs. not (even if length is the same, e.g., C5% and C56, to give a fake example) and that the problem cannot be resolved by pre-processing the CODELIST file -- trimming it there, either in a DATA step or subquery -- rather than in the ON clause. I am wondering if it's got something to do with the fact that % also serves as an escape character, and it's actually somehow causing SAS to hang on to an extra bit of whitespace for, e.g., C5% and not C56 ('C5% ' vs. 'C56'), forcing you to have to explicitly trim it off. Unrelated - I saw your comment about working with SAS EG being like walking on glass shards (in your post about file locks) - could not agree more. It inspires such atrocious programming habits in virtually every new hire we've had.
mkeintz
PROC Star

If all the % signs are at the rightmost non-blank character of code, than I think you'll be better of using a hash lookup, with a new variable TRIM_CODE (stripped of the % sign) in the codelist dataset.  Then for each incoming claims record do a full length (using L for length of the dx value) lookup of DX against a hash object built from CODELIST and using TRIM_CODE as the lookup key.  If not found, then shorten L by one character at a time, until either a lookup succeeds (i.e. the hash FIND method returns a zero) or L=0 (meaning no matching trim_code was found):

 

data claims;
infile cards dlm=',';
length ptid mpid 5 dx $10;
input ptid mpid dx;
cards;
1, 12, H41198
1, 15, C7781
2, 294, H677119
2, 300, G55181
;
run;

data codelist;
infile cards dlm=',';
length code_type $8 code $10 condition $15;
input code_type code condition;
  trim_code=tranwrd(code,'% ',' ');
cards;
DX_INCL,C%,CANCER
DX_EXCL,G55181,STUFF
;
run;

data want (drop=L);
  set claims codelist(obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'codelist');
      h.definekey('trim_code');
      h.definedata(all:'Y');
      h.definedone();
  end;
  do L=length(dx) to 1 by -1 until(h.find(key:substr(dx,1,L))=0);
  end;
  if L>0;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
quickbluefish
Obsidian | Level 7
That is really interesting - I will give it a try and compare processing times. I also like the 'CALL ZERO' idea - I have had this thought many times and never realized that this site had anything like a ballot to submit ideas. Thanks!
ChrisNZ
Tourmaline | Level 20

You have chosen a solution where the like string is longer.

Not too sure how you expect the run time to be faster.

Rather, I'd expect the fastest time might be reached by trimming both side of the like operator.

Please report the results. I am curious now.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 3350 views
  • 6 likes
  • 6 in conversation