Solved: Extracting digits with hyphens from character text

Caetreviop543 · Posted 03-08-2020 03:51 AM

I have some messy prescription data that looks like this:

Dose
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml
10-10-20 30 mg / ml
10-20- 500 mg/mL
2-30-7-10-30 mg

My goal is to extract the first number, or if available, subsequent numbers after the hyphen. So I want to get rid of the characters, including /, as well as the number corresponding to the mg dosage (e.g, 5 mg), which follows a space, not a hyphen. I tried using prxchange, but couldn't come up with an expression which a) encompassed all the different cases and, b) understood that the 500 after 10-20- is part of 10-20- 500, not 500 mg/mL (it deleted 500 mg/mL, leaving 10-20-). The desired output is:

Dose
15
10-18-27
20-8
12
130
12 - 10 - 12
5-10-6
10
12-10
10-10-20
10-20- 500
2-30-7-10-30

I appreciate any advice!

ErikLund_Jensen · Posted 03-08-2020 12:23 PM

Hi @Caetreviop543

This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.

.

data have;
	infile cards truncover;
	input Dose $char100.;
	cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;

data want (drop= Dose i); set have;
	length Dose_Clean $50;

	do i = 1 to count(Dose,'/')+1;

		* Split string in blocks separated by slash;
		Dose_Clean = scan(Dose,i,'/');

		* remove everything in parantheses before further tests and replacements;
		Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);

		* IF the block is number one in a line OR contains the abbreviation "mg"...;
		if i = 1 or index(lowcase(Dose_Clean),'mg') then do;

			* remove all letters;
			Dose_Clean = compbl(compress(Dose_Clean,,'UL'));

			* remove spaces before and after a hyphen;
			Dose_Clean = tranwrd(Dose_Clean,' -','-');
			Dose_Clean = tranwrd(Dose_Clean,'- ','-');

			* keep only first "word" in block;
			Dose_Clean = scan(Dose_Clean,1,' ');
			if Dose_Clean ne '' then output;
		end;
	end;
run;

View solution in original post

ed_sas_member · Posted 03-08-2020 04:21 AM

Hi @Caetreviop543

Just to be sure to understand well, could you please show the expected output?

Best,

ErikLund_Jensen · Posted 03-08-2020 12:23 PM

Hi @Caetreviop543

This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.

.

data have;
	infile cards truncover;
	input Dose $char100.;
	cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;

data want (drop= Dose i); set have;
	length Dose_Clean $50;

	do i = 1 to count(Dose,'/')+1;

		* Split string in blocks separated by slash;
		Dose_Clean = scan(Dose,i,'/');

		* remove everything in parantheses before further tests and replacements;
		Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);

		* IF the block is number one in a line OR contains the abbreviation "mg"...;
		if i = 1 or index(lowcase(Dose_Clean),'mg') then do;

			* remove all letters;
			Dose_Clean = compbl(compress(Dose_Clean,,'UL'));

			* remove spaces before and after a hyphen;
			Dose_Clean = tranwrd(Dose_Clean,' -','-');
			Dose_Clean = tranwrd(Dose_Clean,'- ','-');

			* keep only first "word" in block;
			Dose_Clean = scan(Dose_Clean,1,' ');
			if Dose_Clean ne '' then output;
		end;
	end;
run;

Caetreviop543 · Posted 03-08-2020 01:13 PM

That worked for my data, thanks!

Caetreviop543 · Posted 03-09-2020 12:11 PM

I discovered another way using prx expressions:

data new;
set old;

*get rid of the word mg;
dose_new=compbl(compress(dose, "mg"));

*delete space between hyphens;
dose_new=tranwrd(dose_new, " -", "-";
dose_new=tranwrd(dose_new, "- ", "-";

*prx expression to locate digit string. ? means character or digit are optional, \ before special characters, such as period. Reads as: optional decimal, followed by digit of any length, followed by optional comma,
followed by a digit of any length followed by an optional hyphen; repeat sequence;

digit_string=prxparse("/\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*/");

*extract digit string based on prx expression using substring function;
*create two new variables, position and length of digit variable;

call prxsubstr(digit_string, dose_new, position, length);

*extract digit string based on position and length;
new = substrn(dose_new, position, length);
length dose_new $32.;
run;

Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Registration is open

Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Re: Extracting digits with hyphens from character text

Registration is open

SAS Training: Just a Click Away