I have some messy prescription data that looks like this:
Dose
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml
10-10-20 30 mg / ml
10-20- 500 mg/mL
2-30-7-10-30 mgMy goal is to extract the first number, or if available, subsequent numbers after the hyphen. So I want to get rid of the characters, including /, as well as the number corresponding to the mg dosage (e.g, 5 mg), which follows a space, not a hyphen. I tried using prxchange, but couldn't come up with an expression which a) encompassed all the different cases and, b) understood that the 500 after 10-20- is part of 10-20- 500, not 500 mg/mL (it deleted 500 mg/mL, leaving 10-20-). The desired output is:
Dose
15
10-18-27
20-8
12
130
12 - 10 - 12
5-10-6
10
12-10
10-10-20
10-20- 500
2-30-7-10-30
I appreciate any advice!
This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.
.
data have;
infile cards truncover;
input Dose $char100.;
cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;
data want (drop= Dose i); set have;
length Dose_Clean $50;
do i = 1 to count(Dose,'/')+1;
* Split string in blocks separated by slash;
Dose_Clean = scan(Dose,i,'/');
* remove everything in parantheses before further tests and replacements;
Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);
* IF the block is number one in a line OR contains the abbreviation "mg"...;
if i = 1 or index(lowcase(Dose_Clean),'mg') then do;
* remove all letters;
Dose_Clean = compbl(compress(Dose_Clean,,'UL'));
* remove spaces before and after a hyphen;
Dose_Clean = tranwrd(Dose_Clean,' -','-');
Dose_Clean = tranwrd(Dose_Clean,'- ','-');
* keep only first "word" in block;
Dose_Clean = scan(Dose_Clean,1,' ');
if Dose_Clean ne '' then output;
end;
end;
run;
Just to be sure to understand well, could you please show the expected output?
Best,
This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.
.
data have;
infile cards truncover;
input Dose $char100.;
cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;
data want (drop= Dose i); set have;
length Dose_Clean $50;
do i = 1 to count(Dose,'/')+1;
* Split string in blocks separated by slash;
Dose_Clean = scan(Dose,i,'/');
* remove everything in parantheses before further tests and replacements;
Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);
* IF the block is number one in a line OR contains the abbreviation "mg"...;
if i = 1 or index(lowcase(Dose_Clean),'mg') then do;
* remove all letters;
Dose_Clean = compbl(compress(Dose_Clean,,'UL'));
* remove spaces before and after a hyphen;
Dose_Clean = tranwrd(Dose_Clean,' -','-');
Dose_Clean = tranwrd(Dose_Clean,'- ','-');
* keep only first "word" in block;
Dose_Clean = scan(Dose_Clean,1,' ');
if Dose_Clean ne '' then output;
end;
end;
run;
That worked for my data, thanks!
I discovered another way using prx expressions:
data new;
set old;
*get rid of the word mg;
dose_new=compbl(compress(dose, "mg"));
*delete space between hyphens;
dose_new=tranwrd(dose_new, " -", "-";
dose_new=tranwrd(dose_new, "- ", "-";
*prx expression to locate digit string. ? means character or digit are optional, \ before special characters, such as period. Reads as: optional decimal, followed by digit of any length, followed by optional comma,
followed by a digit of any length followed by an optional hyphen; repeat sequence;
digit_string=prxparse("/\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*/");
*extract digit string based on prx expression using substring function;
*create two new variables, position and length of digit variable;
call prxsubstr(digit_string, dose_new, position, length);
*extract digit string based on position and length;
new = substrn(dose_new, position, length);
length dose_new $32.;
run;
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.