Re: Select numeric values embedded in loooooong string

Jumboshrimps · Posted 06-15-2021 05:17 PM

Inherited a bunch of case statements as below (& and this is the SANITIZED version!). Upper value is 200, lowest value is 1. Person before me started with '%200 U%' / ' %200 u%' / '%200U%' / '%200 u%' all the way down to '%1 U%' - 200 lines of case statements like this:

when meds (data_var like '%200 U%') then 200
when meds (data_var like '%199 U%') then 199
when meds (data_var like '%198 U%') then 198
when meds (data_var like '%197 U%' ) then 197

Intent is to identify a dosage string from a free-text prescription that can be 400 to 500 characters long, with explanations for evening/morning after eating, etc.

Looking for numeric value(s) PRECEEDING the letter "U", (I did an upcase on the whole string, which my predecessor did not have the foresight to do, so looking for upper case "U" only), as in "10 Units". Can be found anywhere in the string.

Am NOT looking for "by MOUTH", "USE", "SUBCUTANEOUSLY", "UNDER", etc. The "U" MUST HAVE a numeric value(s) preceding the string as in 10-15 Units.

When the string is found and assigned to new variable "UNITS", (as an example), I would like to do a loop over the data set. If UNITS is "99 Units", or "99Units" then dosage var is 99, if UNITS is "98 Units" or "98Units" then dosage var is 98, etc.

There may be a way to loop over the entire data set, extract the dosage from the string, and assign the numeric dosage value all in one step. But I don't know how.

Below is what I've seen similar (changing the var names to protect the innocent), and seems like a good starting point:

DATA want;
SET have;
ARRAY UNITS UNITS_1-UNITS_200;
count=0;
DO i = 1 to dim(UNITS) ;
IF '1 U' >=: UNITS[i]
then count= count+1
;
END;

CASE = (count> 1);
run;

Another option is index as in:

data UNITS;
set patients;
U = index(scripts, "Units");
dosage = some number of characters immediately to the LEFT of the position where string "Units", ("Units" is not always spelled correctly. We're talking Doctors here) was found.
run;

ballardw · Posted 06-15-2021 06:08 PM

It might be a good idea to post 10 or so of the looooong string values you are searching and what the expected result from those would be.

Does the value you are interested in always have the word "Units"? It would be better to search for Units than U, or even " U".

Since you mention a value like 10-15 Units is of interest, what would the "numeric" value from 10-15 be in the result?

FINDW would likely be a better choice than INDEX.

And since your doctors can't spell you likely want to provide a likely list of things that you will consider to be equivalent to "Unit" to search with.

Are they semi-consistent with providing a space between the numerals and "unit"? or do you need to find "98unts" where units is misspelled.

Plus a question about the kind of number you are looking for: Is the number of units always an INTEGER?

I suspect you are going to get a few really slick and not obvious to me PRX solutions for at least part of this.

Jumboshrimps · Posted 06-16-2021 10:32 AM

Thanx for all your help.

There are TWO HUNDRED lines of case statements as below (1 to 200), assigning a numeric value, (in the example below, 8), to any records where the prescription string contains 8 U or 8U or 8 u or 8u:

(var1 like '%8 U%' or VAR1 like '%8U%' or VAR1 like '%8 u%' or VAR1 like '%8u%') then 8.

Next line of code is a case statement for 9u or 9U or 9 U or 9 u, then 10, then 11, and so on.

The code does not select ranges (5 - 15 Units, for example), but I thought I could improve on it. (That may not be possible). I will bring the range issue to the Medical Director. It appears my predecessor simply threw up his hands and excluded them, as I see ranges in the original set of data, but not in the final data set.

Final result of the existing code is a variable with a numeric value ranging from 0 to 200. Each record (patient) has one value. Doing a group by on the final data set of the numeric value results in 114 unique rows, ranging from zero (0 represents 80% of all records in the final data set - so ideally, I would like to exclude any records that return 0), to 200. 65% of the all the values are 10 or 15 or 20 or 30 or 40 or 50, so distribution is centered around a dozen or so values.

A fellow programmer suggested something like this:

Units=var1 in (‘1-200’) ||’ U’

Units=var1 in (‘1-200’) ||’U’

Can't get that code by itself to work.

Again, thank you for your efforts.

ballardw · Posted 06-16-2021 03:47 PM

@Jumboshrimps wrote:

Thanx for all your help.

There are TWO HUNDRED lines of case statements as below (1 to 200), assigning a numeric value, (in the example below, 8), to any records where the prescription string contains 8 U or 8U or 8 u or 8u:

Which with UPCASE of the values would be reduced to 8 U or 8U

Regular expressions, which in SAS use functions whose names start with PRX or call routines with similar names are used to search strings for patterns. Patterns can be any digit followed by one or more specific characters.

PRXMATCH finds the position in a string for example. Which position could be fed to one of the other functions to extract the "word" before that position. But it will help to have some examples to test what specific code works best with your data.

I am not a guru with the regular expression syntax but this is what would likely best replace 200 lines of (garbage-like) code. And likely run in a data step instead of SQL and may be quicker.