ANYALPHA and ANYDIGIT: Surprisingly useful SAS functions

5 Likes

Sometimes when working and prepping your data, it often can be a need to divide a text string into different parts or maybe just to extract a digit from the text. There is a lot of powerful functions that can be used in such cases. In order, to show these functions, I will use a text string (variable) with item description that often can appear on scanner data from grocery stores. Usually, this kind of data doesn’t contain separate variables with information about unit or quantity of the different items, but the information could be found in an text string connected to each EAN. So, in order to show how to work with a text string like this and give you some tips along the way, let's go through each step in a SAS program that I have developed.

In step 1, I made the dataset by using DATALINES/INPUT-statement:

/* Step 1: The data */

data food0;
  infile datalines dsd truncover;
  input Text:$60.;
  datalines4;
COCA COLA 6 LITER RINGNES 1.5*4L
4.5 KG FLOUR MOELLERENS 4500 GRAM
500 GRAM WHITE CHEESE NORWAY 0.5 KG 1 PACKAGE
0.5 KG BROWN CHEESE 500 GRAM TINE DAIRIES 500 GR.
POTATO GOLD CHIPS 200 GRAM 200 G POTATO GOLD
ORGANIC CARROTS HOLLAND WASHED. 1 KG
CREAM CHEESE 500 G TINE
JARLSBERG CHEESE 500 GRAM
PEPSI 6000ML RINGNES 6 LITER 600CL
200 GRAM MOLLER'S 200 GRAM SUGAR
;;;;

proc print data=food0;
run;

This gives the following output:

The dataset consist of only one variable called 'Text' which is a text string with both characters and digits. So, we want to extract all the different digits and the word coming after that, since that could be the quantity and unit of an item. But, keep in mind that the digit could represent something else than the quantity.

Using ANYALPHA, ANYDIGIT, and other string functions

In step 2, let’s take a look at different functions that could be useful when working with our text string:

/* Step 2: Useful functions to be aware of */

Data fil1;
	set food0;

	*Detecting the presence of alphabetic characters;
	t1=anyalpha(Text);

	*Detecting the presence of digital values;
	t2=anydigit(Text);

	*Return the length of a string;	
    t3=length(Text);

	*Extract a substring from a string;
	t4=substr(Text,5,4);

	*Take a part of a string;
	t5=scan(Text,2,'');

	*Replace a word in a string;
	t6=tranwrd(Text,'RINGNES','MACK');
	
run;

This gives the following results:

To better understand what the different functions does, let's look at the first observation ‘COCA COLA 6 LITER RINGNES 1.5*4L’:

t1=1: The function anyalpha gives the first position of a character in the text string, which is 1 in this string.
t2=11: The function anydigit gives the first position of a digital value, which is at the position 11
t3=32: The function length gives the total length of the string, hence this string has 32 positions.
t4=' COL': The function substr reads from the 5 position of the text string followed by 4 places after that.
t5='COLA ': The function scan extract the nth word from a string, and we asked for the second word.
t6='COCA COLA 6 LITER RINGNES 1.5*4L': The function tranwrd replace a word in a string, hence RINGNES with MACK.

Extracting the digits from unit measurements

In step 3, I try to extract the different digits in the text string using combinations of various SAS functions:

/* Step 3: Abstract Quantity and Unit from the string */

Data fil2;
	set fil1(drop=t1 t2 t3 t4 t5 t6);

	*FIRST PART;
	M1=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
	M2=scan(M1,1,'');
	M3=scan(M1,2,'');

	*NB: SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
	R0=substrn(M1,length(M2)+length(M3)+3,length(M1)-(length(M2)+length(M3)+2));

	if anyalpha(M2)>0 then
		do;
			quantity1=substr(M2,anydigit(M2),anyalpha(M2)-1);
			unit1=substr(M2, anyalpha(M2), length(M2)-anyalpha(M2)+1);
		end;
	else
		do;
			quantity1=M2;
			unit1=M3;
		end;

	*SECOND PART;
	M6=substrn(R0,anydigit(R0),length(R0)-anydigit(R0)+1);
	M7=scan(M6,1,'');
	M8=scan(M6,2,'');

	if anyalpha(M7)>0 then
		do;
			quantity2=substrn(M7,anydigit(M7),anyalpha(M7)-1);
			unit2=substrn(M7, anyalpha(M7), length(M7)-anyalpha(M7)+1);
		end;
	else
		do;
			quantity2=M7;
			unit2=M8;
		end;
run;

This gives the following output:

So, let's go through the result of step 3 by looking at the first observation again ''COCA COLA 6 LITER RINGNES 1.5*4L'':

M1='6 LITER RINGNES 1.5*4L': I used the function substr to extract parts of the text string into a new string. Hence: I wanted to read from the first digit in the text by using the function anydigit to find that position and then defining to read the rest of the text by using the function length of the text string minus the position of the first anydigit plus one.
M2=6: I used the function scan to read the first part of the new text string M1 and assuming that it might be the quantity.
M3='Liter': I also read the second part of the text string M1, that's assumed to be the unit.
R0='RINGNES 1.5*4L': A step that makes a new string that contains the rest of the text after the first quantity and unit. So, then I need to use the substr function again, where I ask to read from the position after the function length of M2 and M3 plus 3. The reason for plus 3, is that there are two spaces and that we want to read the first position after the second space of the text string M1. I am also making sure that we read until the end of the text string.

The next step in the code consist of an if-statement to check if M2 consist of characters, because if it does then I need to divide it more with the same combination of the function substr with the functions anyalpha, anydigit and length. For the first observation, the M2 doesn't contain anything else than digits so it jumps down to the else statement and we get the following:

quantity1=6
unit1='Liter'

In the second part, I do the same once more on the remaining text R0='RINGNES 1.5*4L' to see if there are more quantities and units:

M6=1.5*4L: Reads from the first digit position of the text string R0 and then the rest of the text.
M7=1.5*4L: Ask for the first part of the remaining text string R0.
M8 is blank since there isn't any more text.

But now, the if-statement is true:

quantity2=1.5*4: The function substr reads from the first position of the digit in the string until the first position of a character minus 1.
unit2=L: The function substr reads from the first position of a character in the text string M7 to the end of the string.

I then use a proc print to ''clean'' the output so it looks like this:

From experience, I know that there can be at least 3 digits in the text string or maybe more. The step 3 shows that there is a pattern in the code, so I could benefit of making a macro. In step 4, I ask that the macro will be run four times so I can be sure to find all the quantities and units in the text string, hence the following code:

/* Step 4: Making this with a macro */


* Delete unnecessary variables;
	Data fil0(drop=t1 t2 t3 t4 t5 t6);
	set fil1;
	run;
	
 
%macro part(p0,p1);

	Data fil&p1.;
		set fil&p0.;

		if &p1.=1 then
			do;
				*FIRST PART;
				M1_&p1.=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
				M3_&p1.=scan(M1_&p1.,1,'');
				M4_&p1.=scan(M1_&p1.,2,'');

				*SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
				R0_&p1.=substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+3,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+2));

				if anyalpha(M3_&p1.)>0 then
					do;
						quantity&p1.=substr(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
						unit&p1.=substr(M3_&p1., anyalpha(M3_&p1.),length(M3_&p1.)-anyalpha(M3_&p1.)+1);
					end;
				else
					do;
						quantity&p1.=M3_&p1.;
						unit&p1.=M4_&p1.;
					end;
			end;

		if ( &p1.>1 ) and ( length(R0_&p0.)>0 ) then
			do;
				*MORE PARTS;
				M1_&p1.=substrn(R0_&p0.,anydigit(R0_&p0.),length(R0_&p0.)-anydigit(R0_&p0.)+1);
				M3_&p1.=scan(M1_&p1.,1,'');
				M4_&p1.=scan(M1_&p1.,2,'');
				R0_&p1.= substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+2,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+1));

				if anyalpha(M3_&p1.)>0 then
					do;
						quantity&p1.=substrn(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
						unit&p1.=substrn(M3_&p1., anyalpha(M3_&p1.), length(M3_&p1.)-anyalpha(M3_&p1.)+1);
					end;
				else
					do;
						quantity&p1.=M3_&p1.;
						unit&p1.=M4_&p1.;
					end;
			end;
	run;

%mend;

%part(0,1);
%part(1,2);
%part(2,3);
%part(3,4);


Data output0(keep=Text quantity1 unit1 quantity2 unit2 quantity3 unit3 quantity4 unit4);
set fil4;
run;

The result looks like this:

Generalizing to a macro for repeatability

In step 5, I try to clean up a bit so that the items with the same unit is written in the same way and this is also done with a macro:

/* Step 5: Looks for a word that has the same meaning and replaces them with the same unit */

%macro unit(p0, p1);

	data output&p1.;
		set output&p0.;

		if Unit&p1. in ('K','KG','KG.','KILO') then
			Unit&p1.='KG';

		if Unit&p1. in ('GRAM','G','G.','GR','GR.','GRA','GRAM.') then
			Unit&p1.='GR';

		if Unit&p1. in ('MILIGRAM') then
			Unit&p1.='MGR';

		if Unit&p1. in ('C','CL.','CL') then
			Unit&p1.='CL';

		if Unit&p1. in ('LTR','L','L.','L..','LT','LIT','LITER','LITRE','LTR.','LITE') then
			Unit&p1.='LTR';

		if Unit&p1. in ('ML.','MLT','ML..','ML','MLITER','MILITER') then
			Unit&p1.='MLT';
	run;

%mend;

%unit(0,1);
%unit(1,2);
%unit(2,3);
%unit(3,4);

The output looks like this:

The work might not be completely finished, and further work would be to decide which quantity and unit to use for the different items. But I think the power of these functions might have been proven and hopefully you found this tip on how to use these functions useful!!!