Solved: Handling CASE sensitivity dynamically

keen_sas · Posted 05-21-2020 04:17 AM

As part of the standardization some codes are provided by user which are not standard to use and these have to be converted in to proper SAS algorthim to avoid case sensitivity without manual intervention. HAVE variables are provided by user and these have to be converted into WANT variables to avoid case sensitivity. In the below code for example DECOD have values COMPLETED in different cases , this has to be standardization.

data have;
length have1 have2 have3 have4 $1000;
have1='if strip(DECOD)="COMPLETED"  then COMPL = "Yes";	' ;
want1='IF UPCASE(STRIP(DECOD))="COMPLETED"  then COMPL = "Yes";	' ;
have2='else if strip(DECOD) in ("Not Completed" "PartIAL") and strip(EXPOSED)="N"  then COMPL = "Not applicable";';
want2='ELSE IF UPCASE(STRIP(DECOD)) IN ("NOT COMPLETED" "PARTIAL") AND STRIP(EXPOSED)="N"  then COMPL = "Not applicable";';
have3='else if strip(DECOD) in ("NOT completed" " ") and strip(EXPOSED)="y"  then COMPL = "Ongoing";';
want3='ELSE IF UPCASE(STRIP(DECOD)) IN ("NOT COMPLETED" " ") AND UPCASE(STRIP(EXPOSED))="Y"  then COMPL = "Ongoing";';
have4='else COMPL =  "No ("||strip(DECOD)||")";';
want4='else COMPL =  "No ("||strip(DECOD)||")";';
run ;


data want;
 set have ; 
 array var_{4} $ have1 have2 have3 have4  ;
 do i = 1 to 4 ;
 	if find(var_{i},"then","i")>0 then do ;
	 if find(compress(var_{i}),')="')>0 then do;
	 string1=upcase(substr(var_{i},1,find(var_{i},"then","i")-1));
	 string2=substr(var_{i},find(var_{i},"then","i"));
	 substr1=tranwrd(tranwrd(scan(string1,1,'='),'STRIP','UPCASE(STRIP'),')','))'); 
	 *temporarily replacing due to above issue;
	 substr1=tranwrd(substr1,'))))',')))');
	 substr2=scan(string1,2,'=');
	 var_{i}=strip(substr1)||' = '||strip(substr2)||' '||strip(string2); end;
	end ;end;
run;

From the above code I am splitting the code in to two parts based on the keyword THEN and converting left side completely into upcase and then applying upcase function for variables and converting closed parenthesis into double closed parenthesis. But the issue occurs when the IN operator is present and close parenthesis of IN operator is also getting converted into double )) which is wrong. How to
avoid such instance or any other alternative to get the desired result.

The issue is when IN operator is present

ELSE IF UPCASE(STRIP(DECOD)) IN ("NOT COMPLETED" " ")) AND UPCASE(STRIP(EXPOSED)) = "Y" then COMPL = "Ongoing";

RichardDeVen · Posted 05-22-2020 08:07 AM

If you rerun the data through same program the doubling will occur because on the second run through the once processed data that contains an UPCASE(STRIP(DECODE)) will become UPCASE(UPCASE(STRIP(DECODE)))

As for the meanings in the regular expression patterns

lexpr = prxchange ( 's/(\bstrip\b)\s*($.+?$)/UPCASE(\U$1$2)/i', -1, lexpr);

arg1, regex search pattern and replace instructions
- s - substitute, as in s/ find / replace /
arg2, -1, globally, as in everywhere in the arg3 string
arg3, string to process
(\bstrip\b)\s*($.+?$), search pattern. find the word strip followed by 0 or more spaces followed by an open parenthesis, one or more characters and a close parenthesis. The pattern does NOT check for nested parentheses.
- \b word boundary
- (, open parenthesis starting capture group #1. the contents of a capture group can be used in the replacement
- strip, literally, the letters of the word "strip"
- ), close parenthesis closing capture group #1.
- \s*, zero or more whitespace characters
- (, open parenthesis starting capture group #2. the contents of a capture group can be used in the replacement
- $, an escaped open parenthesis means a literal ( character
- .+?, . any character, + one or more times, ? the 'or more' is non-greedy and will stop at the first next anchor point located by the remaining part of the pattern
- $, an escaped close parenthesis means a literal ) character, as would be expected for STRIP(decOd). Will not detect properly nestings such as STRIP(substr(decode,3,5))
- ), close parenthesis closing capture group #2.
UPCASE(\U$1$2), how to replace what was found
- UPCASE, literally the letters UPCASE
- (, literally an open parenthesis (as would be need for an UPCASE function call in source code)
- \U, start instruction to upper case whatever follows, ends at either \E or end of replacement instructions
- $1, the first capture group, which would be literally STRIP
- $2, the second capture group, which will be what is inside the original STRIP()
- ), literally a close parenthesis (as would be needed to close an UPCASE function call in source code)
i, ignore case during search

lexpr = prxchange ( 's/(["''])(.*?)\1/$1\U$2\E$1/', -1, lexpr);, find a literally quoted string and replace it with the uppercase version.

Search
- (["'']), Capture group 1, starts with either " or '
  - [], list of characters, any which will match
- (.*?), Capture group 2, any number of any characters, non-greedy, stops at first next found
- \1, next find is the contents of the first capture group, which is the closing " or '
Replace
- $1\U$2\E$1, the original string, uppercased

What does all the complication mean in either your tranwrd or my regexes? Probably that they are a bit of a Rube Goldberg machine that can be simplified using better processes? Perhaps you need to look alternatives per @ballardw . Better data means less code. Does program source code necessarily have to be in data set variables ?

View solution in original post

RichardDeVen · Posted 05-21-2020 06:37 AM

Use regular expressions.

Example:

data have;
length have1 have2 have3 have4 $1000;
have1='if strip(DECOD)="COMPLETED"  then COMPL = "Yes";	' ;
want1='IF UPCASE(STRIP(DECOD))="COMPLETED"  then COMPL = "Yes";	' ;
have2='else if strip(DECOD) in ("Not Completed" "PartIAL") and strip(EXPOSED)="N"  then COMPL = "Not applicable";';
want2='ELSE IF UPCASE(STRIP(DECOD)) IN ("NOT COMPLETED" "PARTIAL") AND STRIP(EXPOSED)="N"  then COMPL = "Not applicable";';
have3='else if strip(DECOD) in ("NOT completed" " ") and strip(EXPOSED)="y"  then COMPL = "Ongoing";';
want3='ELSE IF UPCASE(STRIP(DECOD)) IN ("NOT COMPLETED" " ") AND UPCASE(STRIP(EXPOSED))="Y"  then COMPL = "Ongoing";';
have4='else COMPL =  "No ("||strip(DECOD)||")";';
want4='else COMPL =  "No ("||strip(DECOD)||")";';
run ;

data want;
  set have ; 
  array stmt have1-have4;

  do i = 1 to dim(stmt);
    putlog stmt(i)=;

    rxid = prxparse ( '/(.+)\bthen\b(.*)/i' ); * constant patterns compiled only once;

    if prxmatch ( rxid, stmt(i) ) then do;
      lexpr = prxposn (rxid, 1, stmt(i) );  * l is for Logic!;
      lstmt = prxposn (rxid, 2, stmt(i) );

      * presume no nested expr in strip();
      lexpr = prxchange ( 's/(\bstrip\b)\s*(\(.+?\))/UPCASE(\U$1$2)/i', -1, lexpr);
      lexpr = prxchange ( 's/(["''])(.*?)\1/$1\U$2\E$1/', -1, lexpr);

      if LENGTH ( catx(' then ', lexpr, lstmt) ) <= LENGTHC (stmt(i)) then 
        stmt(i) = catx(' then ', lexpr, lstmt);
      else do;
        put 'ERROR: Modified statement exceeds available space. ' _n_= i=;
        ABORT CANCEL;
      end;
   end;
   putlog stmt(i)=;
  
 end;

 drop i rxid lexpr lstmt;
run;

Log

have1=if strip(DECOD)="COMPLETED"  then COMPL = "Yes";
have1=if UPCASE(STRIP(DECOD))="COMPLETED" then COMPL = "Yes";

have2=else if strip(DECOD) in ("Not Completed" "PartIAL") and strip(EXPOSED)="N"  then COMPL = "Not applicable";
have2=else if UPCASE(STRIP(DECOD)) in ("NOT COMPLETED" "PARTIAL") and UPCASE(STRIP(EXPOSED))="N" then COMPL = "Not applicable";

have3=else if strip(DECOD) in ("NOT completed" " ") and strip(EXPOSED)="y"  then COMPL = "Ongoing";
have3=else if UPCASE(STRIP(DECOD)) in ("NOT COMPLETED" " ") and UPCASE(STRIP(EXPOSED))="Y" then COMPL = "Ongoing";

have4=else COMPL =  "No ("||strip(DECOD)||")";
have4=else COMPL =  "No ("||strip(DECOD)||")";

keen_sas · Posted 05-22-2020 06:33 AM

Thank you both for your responses. @Richard .It works perfectly.Since i am not a regular user of regular expressions , not clear about these expressions.
lexpr = prxchange ( 's/(\bstrip\b)\s*($.+?$)/UPCASE(\U$1$2)/i', -1, lexpr);
lexpr = prxchange ( 's/(["''])(.*?)\1/$1\U$2\E$1/', -1, lexpr);
Could you please give me a glimpse of these as how these are working, since the UPCASEis populating twice in my code, i want to get rid of the second one.

RichardDeVen · Posted 05-22-2020 08:07 AM

If you rerun the data through same program the doubling will occur because on the second run through the once processed data that contains an UPCASE(STRIP(DECODE)) will become UPCASE(UPCASE(STRIP(DECODE)))

As for the meanings in the regular expression patterns

lexpr = prxchange ( 's/(\bstrip\b)\s*($.+?$)/UPCASE(\U$1$2)/i', -1, lexpr);

arg1, regex search pattern and replace instructions
- s - substitute, as in s/ find / replace /
arg2, -1, globally, as in everywhere in the arg3 string
arg3, string to process
(\bstrip\b)\s*($.+?$), search pattern. find the word strip followed by 0 or more spaces followed by an open parenthesis, one or more characters and a close parenthesis. The pattern does NOT check for nested parentheses.
- \b word boundary
- (, open parenthesis starting capture group #1. the contents of a capture group can be used in the replacement
- strip, literally, the letters of the word "strip"
- ), close parenthesis closing capture group #1.
- \s*, zero or more whitespace characters
- (, open parenthesis starting capture group #2. the contents of a capture group can be used in the replacement
- $, an escaped open parenthesis means a literal ( character
- .+?, . any character, + one or more times, ? the 'or more' is non-greedy and will stop at the first next anchor point located by the remaining part of the pattern
- $, an escaped close parenthesis means a literal ) character, as would be expected for STRIP(decOd). Will not detect properly nestings such as STRIP(substr(decode,3,5))
- ), close parenthesis closing capture group #2.
UPCASE(\U$1$2), how to replace what was found
- UPCASE, literally the letters UPCASE
- (, literally an open parenthesis (as would be need for an UPCASE function call in source code)
- \U, start instruction to upper case whatever follows, ends at either \E or end of replacement instructions
- $1, the first capture group, which would be literally STRIP
- $2, the second capture group, which will be what is inside the original STRIP()
- ), literally a close parenthesis (as would be needed to close an UPCASE function call in source code)
i, ignore case during search

lexpr = prxchange ( 's/(["''])(.*?)\1/$1\U$2\E$1/', -1, lexpr);, find a literally quoted string and replace it with the uppercase version.

Search
- (["'']), Capture group 1, starts with either " or '
  - [], list of characters, any which will match
- (.*?), Capture group 2, any number of any characters, non-greedy, stops at first next found
- \1, next find is the contents of the first capture group, which is the closing " or '
Replace
- $1\U$2\E$1, the original string, uppercased

What does all the complication mean in either your tranwrd or my regexes? Probably that they are a bit of a Rube Goldberg machine that can be simplified using better processes? Perhaps you need to look alternatives per @ballardw . Better data means less code. Does program source code necessarily have to be in data set variables ?

ballardw · Posted 05-21-2020 12:23 PM

If you have poor data entry then sometimes how your READ the data can fix many problems before they start. One of the tools are custom informats.

A rough example:

Proc format;
invalue $decode (upcase)
"COMPLETED"    ='COMPLETED'
"NOT COMPLETED"="NOT COMPLETED"
"PARTIAL"      ="PARTIAL"
;
run;

data example;
   input x :$decode.;
datalines;
COMPLETED
Completed
compLETED
COMPLeted
compLETED
completed
;

The UPCASE option on the INVALUE statement says to take what ever value is encountered when reading the data and convert it to uppercase before comparing to the list of values (the Left side of the = in the Proc format code). Then after the comparison the value returned is what appears on the right hand side.

Note that the example code has many different combinations of case to READ but the result is the same.

And if you already have crappy values in the data you can use a data step with the informat fix them:

data example2;
   input x :$15.;
datalines;
COMPLETED
Completed
compLETED
COMPLeted
compLETED
completed
;

data want;
   set example2;
   x = input(x,$decode.);
run;

I use this approach with several projects where the data is manually entered and the users a kind of sloppy. And additional piece you can add is data validation. If the variable is not ever supposed to have blank values after you have all of the expected list of values you can use an: other = _error_ value. That means that if something is misspelled on top of the case issues such as the value "Cmplte" then you will get an error message when reading or converting the data. So you can pick your favorite approach to cleaning/correcting the values.

So with cleaner data you should not have to have multiple cases for handing the values.

There are many more things that can be done along these lines including use of Value or Invalue to map single variables to additional values.

Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

Re: Handling CASE sensitivity dynamically

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!