SAS Functions can be very useful when working on a text string. This Juletip will show how you can extract digits out of a text string that contains both letters and digits.
In this example, I will use a variable (text string) with item description that often can appear on scanner data that comes from grocery stores. Usually, this kind of data doesn’t contain separate variables with information about unit or quantity of the different items, but I know that I might find them in the description variable connected to each item.
In step 1, I am using DATALINES/INPUT statement to read a variable with item description:
/* Step 1: The data */
data food0;
infile datalines dsd truncover;
input Text:$60.;
datalines4;
_COCA COLA 6 LITER RINGNES
SOLO RINGNES 6L. 4BOTTLEDS
500 GRAM FLOUR MØLLERENS 0.5 KG
4.5 KG FLOUR MOLLERENS 4500 GRAM
200 GRAM MOLLERENS 200 GRAM SUGAR
FARIS 1.5DL RINGNES 150 CL
CHEESE WHITE 200G PKN
0.5 KG CHEESE 500 GRAM TINE MEIERIER 500GR.
;;;;
proc print data=food0;
run;
The output looks like this:
In step 2, I show different functions that could be useful when working on the text string:
/* Step 2: Useful functions to be aware of when working with text variables */
Data food1;
set food0;
*Detecting the presence of alphabetic characters;
t1=anyalpha(Text);
*Detecting the presence of digital values;
t2=anydigit(Text);
*Return the length of a string;
t3=length(Text);
*Extract a substring from a string;
t4=substr(Text,5,4);
*Take a part of a string;
t5=scan(Text,2,'');
*Replace a word in a string;
t6=tranwrd(Text,'RINGNES','MACK');
run;
proc print data=food1;
run;
This gives the following output:
In step 3, I try to extract the different digits in the text string using combinations of various SAS functions. The function substr helps to extract parts of the text string into a new string, hence; I read from the first digit by using the function anydigit and then defining the length using the function length minus anydigit plus one. After that, I use the function scan to read the first part of the new text string that was made and assuming that’s the quantity. Then I repeat scan so I can read the second part, which I assume is the unit. Since there might be more digits in the text string, I also make a new string that contains the rest of the text after the first quantity and unit. In the second part, I do the same that I did in the first part on remaining text, so I can find more quantities and units.
/* Step 3: Abstract Quantity and Unit from the string */
Data food2;
set food0;
*First part;
M1=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
M3=scan(M1,1,'');
M4=scan(M1,2,'');
*NB: SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
R0=substrn(M1,length(M3)+length(M4)+3,length(M1)-(length(M3)+length(M4)+2));
if anyalpha(M3)>0 then do;
quantity1=substr(M3,anydigit(M3),anyalpha(M3)-1);
unit1=substr(M3, anyalpha(M3), length(M3)-anyalpha(M3)+1);
end;
else do;
quantity1=M3;
unit1=M4;
end;
*Second part;
M7=substrn(R0,anydigit(R0),length(R0)-anydigit(R0)+1);
M8=scan(M7,1,'');
M9=scan(M7,2,'');
if anyalpha(M8)>0 then do;
quantity2=substrn(M8,anydigit(M8),anyalpha(M8)-1);
unit2=substrn(M8, anyalpha(M8), length(M8)-anyalpha(M8)+1);
end;
else do;
quantity2=M8;
unit2=M9;
end;
run;
proc print data=food2(keep=Text quantity1 unit1 quantity2 unit2 );
run;
The ‘cleaned’ output looks like this:
In step 4, I am using a macro since there is a pattern in step 3. Since I know that there can be at least 3 digits in the text string, I decide to run the macro four times so I can be sure to find all the quantities/units:
/* Step 4: Making this with a macro */
%macro part(p0,p1);
Data food&p1.;
set food&p0.;
if &p1.=1 then do;
*FIRST PART;
M1_&p1.=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
M3_&p1.=scan(M1_&p1.,1,'');
M4_&p1.=scan(M1_&p1.,2,'');
*SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
R0_&p1.=substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+3,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+2));
if anyalpha(M3_&p1.)>0 then do;
quantity&p1.=substr(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
unit&p1.=substr(M3_&p1., anyalpha(M3_&p1.),length(M3_&p1.)-anyalpha(M3_&p1.)+1);
end;
else do;
quantity&p1.=M3_&p1.;
unit&p1.=M4_&p1.;
end;
end;
if ( &p1.>1 ) and ( length(R0_&p0.)>0 ) then do;
*MORE PARTS;
M1_&p1.=substrn(R0_&p0.,anydigit(R0_&p0.),length(R0_&p0.)-anydigit(R0_&p0.)+1);
M3_&p1.=scan(M1_&p1.,1,'');
M4_&p1.=scan(M1_&p1.,2,'');
R0_&p1.= substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+2,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+1));
if anyalpha(M3_&p1.)>0 then do;
quantity&p1.=substrn(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
unit&p1.=substrn(M3_&p1., anyalpha(M3_&p1.), length(M3_&p1.)-anyalpha(M3_&p1.)+1);
end;
else do;
quantity&p1.=M3_&p1.;
unit&p1.=M4_&p1.;
end;
end;
run;
%mend;
%part(0,1);
%part(1,2);
%part(2,3);
%part(3,4);
proc print data=food4(keep=Text quantity1 unit1 quantity2 unit2 quantity3 unit3 quantity4 unit4);
run;
The result looks like this:
In step 5, I try to clean up a bit, so that the items with the same unit is written in the same way. This is also done with a macro:
/* Step 5: Looks for a word that has the same meaning and replaces them with the same unit */
Data out0;
set food4(keep=Text quantity1 unit1 quantity2 unit2 quantity3 unit3 quantity4 unit4);
run;
%macro unit(p0, p1);
data out&p1.;
set out&p0.;
if Unit&p1. in ('K','KG','KG.','KILO') then Unit&p1.='KG';
if Unit&p1. in ('GRAM','G','G.','GR','GR.','GRA','GRAM.') then Unit&p1.='GR';
if Unit&p1. in ('MILIGRAM') then Unit&p1.='MGR';
if Unit&p1. in ('C','CL.','CL') then Unit&p1.='CL';
if Unit&p1. in ('LTR','L','L.','L..','LT','LIT','LITER','LITRE','LTR.','LITE') then Unit&p1.='LTR';
if Unit&p1. in ('ML.','MLT','ML..','ML','MLITER','MILITER') then Unit&p1.='MLT';
run;
%mend;
%unit(0,1);
%unit(1,2);
%unit(2,3);
%unit(3,4);
proc print data=out4(keep=Text quantity1 unit1 quantity2 unit2 quantity3 unit3 quantity4 unit4);
run;
The output looks like this:
The work might not be finished, but the power of functions might have been proven. In the last step, I want to show that almost the same result could be achieved by Perl Regular Expression’s in the following way:
/* Step 6: Using Perl regular expression */
data food6;
set food0;
*Create Perl regular expression's that reads one digit or more (+), the digit could contain dot(or comma);
Expression1='/([\d.]+)/';
*The PRXPARSE function returns a pattern identifier number that is used by other PRX functions and call routines;
PatternId1=prxparse(Expression1);
*Use the CALL PRXSUBSTR routine to create a MyStart and MyLength columns, looks for the expression (PatternID) in the Text column;
call prxsubstr(PatternID1,Text,MyStart1,MyLength1);
* Use the SUBSTR function to create a column that contain the unit starting on G and the quantity before, based on the MyStart and MyLength columns;
Unit1=substrn(Text,MyStart1,MyLength1);
extra=substrn(Text,MyStart1+MyLength1,length(text)+1);
Quantity1=scan(extra,1,'');
R0=substr(extra,length(Quantity1)+1,length(Text));
run;
proc print data=food6 (drop=extra);run;
This is the results:
Of course, you could also make a macro with the suggestion in step 6, so you are able to fetch several digits from the text string.
I will not recommend which methods that you should use, but please give comments of which method that you prefer.
I wish you all a Merry Christmas and Happy New Year!
Take a look at the autumn/winter 2024 program and events for FANS and other SAS users!
www.sas.com/fans | #SASFANS #sasnordicusers