I have some messy prescription data that looks like this:
Dose
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml
10-10-20 30 mg / ml
10-20- 500 mg/mL
2-30-7-10-30 mgMy goal is to extract the first number, or if available, subsequent numbers after the hyphen. So I want to get rid of the characters, including /, as well as the number corresponding to the mg dosage (e.g, 5 mg), which follows a space, not a hyphen. I tried using prxchange, but couldn't come up with an expression which a) encompassed all the different cases and, b) understood that the 500 after 10-20- is part of 10-20- 500, not 500 mg/mL (it deleted 500 mg/mL, leaving 10-20-). The desired output is:
Dose
15
10-18-27
20-8
12
130
12 - 10 - 12
5-10-6
10
12-10
10-10-20
10-20- 500
2-30-7-10-30
I appreciate any advice!
This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.
.
data have;
infile cards truncover;
input Dose $char100.;
cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;
data want (drop= Dose i); set have;
length Dose_Clean $50;
do i = 1 to count(Dose,'/')+1;
* Split string in blocks separated by slash;
Dose_Clean = scan(Dose,i,'/');
* remove everything in parantheses before further tests and replacements;
Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);
* IF the block is number one in a line OR contains the abbreviation "mg"...;
if i = 1 or index(lowcase(Dose_Clean),'mg') then do;
* remove all letters;
Dose_Clean = compbl(compress(Dose_Clean,,'UL'));
* remove spaces before and after a hyphen;
Dose_Clean = tranwrd(Dose_Clean,' -','-');
Dose_Clean = tranwrd(Dose_Clean,'- ','-');
* keep only first "word" in block;
Dose_Clean = scan(Dose_Clean,1,' ');
if Dose_Clean ne '' then output;
end;
end;
run;
Just to be sure to understand well, could you please show the expected output?
Best,
This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.
.
data have;
infile cards truncover;
input Dose $char100.;
cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;
data want (drop= Dose i); set have;
length Dose_Clean $50;
do i = 1 to count(Dose,'/')+1;
* Split string in blocks separated by slash;
Dose_Clean = scan(Dose,i,'/');
* remove everything in parantheses before further tests and replacements;
Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);
* IF the block is number one in a line OR contains the abbreviation "mg"...;
if i = 1 or index(lowcase(Dose_Clean),'mg') then do;
* remove all letters;
Dose_Clean = compbl(compress(Dose_Clean,,'UL'));
* remove spaces before and after a hyphen;
Dose_Clean = tranwrd(Dose_Clean,' -','-');
Dose_Clean = tranwrd(Dose_Clean,'- ','-');
* keep only first "word" in block;
Dose_Clean = scan(Dose_Clean,1,' ');
if Dose_Clean ne '' then output;
end;
end;
run;
That worked for my data, thanks!
I discovered another way using prx expressions:
data new;
set old;
*get rid of the word mg;
dose_new=compbl(compress(dose, "mg"));
*delete space between hyphens;
dose_new=tranwrd(dose_new, " -", "-";
dose_new=tranwrd(dose_new, "- ", "-";
*prx expression to locate digit string. ? means character or digit are optional, \ before special characters, such as period. Reads as: optional decimal, followed by digit of any length, followed by optional comma,
followed by a digit of any length followed by an optional hyphen; repeat sequence;
digit_string=prxparse("/\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*/");
*extract digit string based on prx expression using substring function;
*create two new variables, position and length of digit variable;
call prxsubstr(digit_string, dose_new, position, length);
*extract digit string based on position and length;
new = substrn(dose_new, position, length);
length dose_new $32.;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.