BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Caetreviop543
Obsidian | Level 7

I have some messy prescription data that looks like this:

Dose
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml
10-10-20 30 mg / ml
10-20- 500 mg/mL
2-30-7-10-30 mg

My goal is to extract the first number, or if available, subsequent numbers after the hyphen. So I want to get rid of the characters, including /, as well as the number corresponding to the mg dosage (e.g, 5 mg), which follows a space, not a hyphen. I tried using prxchange, but couldn't come up with an expression which a) encompassed all the different cases and, b) understood that the 500 after 10-20- is part of 10-20- 500, not 500 mg/mL (it deleted 500 mg/mL, leaving 10-20-).  The desired output is:

Dose
15
10-18-27
20-8
12
130
12 - 10 - 12
5-10-6
10
12-10
10-10-20
10-20- 500
2-30-7-10-30

 

I appreciate any advice!

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Caetreviop543 

 

This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.

 

.

data have;
	infile cards truncover;
	input Dose $char100.;
	cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;

data want (drop= Dose i); set have;
	length Dose_Clean $50;

	do i = 1 to count(Dose,'/')+1;

		* Split string in blocks separated by slash;
		Dose_Clean = scan(Dose,i,'/');

		* remove everything in parantheses before further tests and replacements;
		Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);

		* IF the block is number one in a line OR contains the abbreviation "mg"...;
		if i = 1 or index(lowcase(Dose_Clean),'mg') then do;

			* remove all letters;
			Dose_Clean = compbl(compress(Dose_Clean,,'UL'));

			* remove spaces before and after a hyphen;
			Dose_Clean = tranwrd(Dose_Clean,' -','-');
			Dose_Clean = tranwrd(Dose_Clean,'- ','-');

			* keep only first "word" in block;
			Dose_Clean = scan(Dose_Clean,1,' ');
			if Dose_Clean ne '' then output;
		end;
	end;
run;

 

View solution in original post

4 REPLIES 4
ed_sas_member
Meteorite | Level 14

Hi @Caetreviop543 

 

Just to be sure to understand well, could you please show the expected output?

 

Best,

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Caetreviop543 

 

This is diifficult because a number without a preceding or following hyphen can have a different meaning dependent on the context. I have tried to build a sequence of functions that handles your test input. But there might be other combinations in a full data set where it wouldn't work.

 

.

data have;
	infile cards truncover;
	input Dose $char100.;
	cards;
15 mg
10-18-27 mg / mL
20-8 5 mg / 300 ml
12 (600 mgc spray)
130 (drugname)
12mg - 10 mg - 12 mg
5mg-10mg-6mg
10 mg (nmn)
12-10 mg / ml10-10-20 30 mg / ml10-20- 500 mg/mL2-30-7-10-30 mg
;
run;

data want (drop= Dose i); set have;
	length Dose_Clean $50;

	do i = 1 to count(Dose,'/')+1;

		* Split string in blocks separated by slash;
		Dose_Clean = scan(Dose,i,'/');

		* remove everything in parantheses before further tests and replacements;
		Dose_Clean = prxchange('s/\(.{1,}\)//i', -1, Dose_Clean);

		* IF the block is number one in a line OR contains the abbreviation "mg"...;
		if i = 1 or index(lowcase(Dose_Clean),'mg') then do;

			* remove all letters;
			Dose_Clean = compbl(compress(Dose_Clean,,'UL'));

			* remove spaces before and after a hyphen;
			Dose_Clean = tranwrd(Dose_Clean,' -','-');
			Dose_Clean = tranwrd(Dose_Clean,'- ','-');

			* keep only first "word" in block;
			Dose_Clean = scan(Dose_Clean,1,' ');
			if Dose_Clean ne '' then output;
		end;
	end;
run;

 

Caetreviop543
Obsidian | Level 7

That worked for my data, thanks!

Caetreviop543
Obsidian | Level 7

I discovered another way using prx expressions:

 

 

data new;
set old;

*get rid of the word mg;
dose_new=compbl(compress(dose, "mg"));

*delete space between hyphens;
dose_new=tranwrd(dose_new, " -", "-";
dose_new=tranwrd(dose_new, "- ", "-";

*prx expression to locate digit string. ? means character or digit are optional, \ before special characters, such as period. Reads as: optional decimal, followed by digit of any length, followed by optional comma,
followed by a digit of any length followed by an optional hyphen; repeat sequence;
digit_string=prxparse("/\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*-?\.?\d*,?\d*/"); *extract digit string based on prx expression using substring function;
*create two new variables, position and length of digit variable;
call prxsubstr(digit_string, dose_new, position, length);

*extract digit string based on position and length; new = substrn(dose_new, position, length); length dose_new $32.; run;

 

 

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2510 views
  • 0 likes
  • 3 in conversation