Sometimes when working and prepping your data, it often can be a need to divide a text string into different parts or maybe just to extract a digit from the text. There is a lot of powerful functions that can be used in such cases. In order, to show these functions, I will use a text string (variable) with item description that often can appear on scanner data from grocery stores. Usually, this kind of data doesn’t contain separate variables with information about unit or quantity of the different items, but the information could be found in an text string connected to each EAN. So, in order to show how to work with a text string like this and give you some tips along the way, let's go through each step in a SAS program that I have developed.
In step 1, I made the dataset by using DATALINES/INPUT-statement:
/* Step 1: The data */
data food0;
infile datalines dsd truncover;
input Text:$60.;
datalines4;
COCA COLA 6 LITER RINGNES 1.5*4L
4.5 KG FLOUR MOELLERENS 4500 GRAM
500 GRAM WHITE CHEESE NORWAY 0.5 KG 1 PACKAGE
0.5 KG BROWN CHEESE 500 GRAM TINE DAIRIES 500 GR.
POTATO GOLD CHIPS 200 GRAM 200 G POTATO GOLD
ORGANIC CARROTS HOLLAND WASHED. 1 KG
CREAM CHEESE 500 G TINE
JARLSBERG CHEESE 500 GRAM
PEPSI 6000ML RINGNES 6 LITER 600CL
200 GRAM MOLLER'S 200 GRAM SUGAR
;;;;
proc print data=food0;
run;
This gives the following output:
The dataset consist of only one variable called 'Text' which is a text string with both characters and digits. So, we want to extract all the different digits and the word coming after that, since that could be the quantity and unit of an item. But, keep in mind that the digit could represent something else than the quantity.
In step 2, let’s take a look at different functions that could be useful when working with our text string:
/* Step 2: Useful functions to be aware of */
Data fil1;
set food0;
*Detecting the presence of alphabetic characters;
t1=anyalpha(Text);
*Detecting the presence of digital values;
t2=anydigit(Text);
*Return the length of a string;
t3=length(Text);
*Extract a substring from a string;
t4=substr(Text,5,4);
*Take a part of a string;
t5=scan(Text,2,'');
*Replace a word in a string;
t6=tranwrd(Text,'RINGNES','MACK');
run;
This gives the following results:
To better understand what the different functions does, let's look at the first observation ‘COCA COLA 6 LITER RINGNES 1.5*4L’:
In step 3, I try to extract the different digits in the text string using combinations of various SAS functions:
/* Step 3: Abstract Quantity and Unit from the string */
Data fil2;
set fil1(drop=t1 t2 t3 t4 t5 t6);
*FIRST PART;
M1=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
M2=scan(M1,1,'');
M3=scan(M1,2,'');
*NB: SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
R0=substrn(M1,length(M2)+length(M3)+3,length(M1)-(length(M2)+length(M3)+2));
if anyalpha(M2)>0 then
do;
quantity1=substr(M2,anydigit(M2),anyalpha(M2)-1);
unit1=substr(M2, anyalpha(M2), length(M2)-anyalpha(M2)+1);
end;
else
do;
quantity1=M2;
unit1=M3;
end;
*SECOND PART;
M6=substrn(R0,anydigit(R0),length(R0)-anydigit(R0)+1);
M7=scan(M6,1,'');
M8=scan(M6,2,'');
if anyalpha(M7)>0 then
do;
quantity2=substrn(M7,anydigit(M7),anyalpha(M7)-1);
unit2=substrn(M7, anyalpha(M7), length(M7)-anyalpha(M7)+1);
end;
else
do;
quantity2=M7;
unit2=M8;
end;
run;
This gives the following output:So, let's go through the result of step 3 by looking at the first observation again ''COCA COLA 6 LITER RINGNES 1.5*4L'':
The next step in the code consist of an if-statement to check if M2 consist of characters, because if it does then I need to divide it more with the same combination of the function substr with the functions anyalpha, anydigit and length. For the first observation, the M2 doesn't contain anything else than digits so it jumps down to the else statement and we get the following:
In the second part, I do the same once more on the remaining text R0='RINGNES 1.5*4L' to see if there are more quantities and units:
But now, the if-statement is true:
I then use a proc print to ''clean'' the output so it looks like this:
From experience, I know that there can be at least 3 digits in the text string or maybe more. The step 3 shows that there is a pattern in the code, so I could benefit of making a macro. In step 4, I ask that the macro will be run four times so I can be sure to find all the quantities and units in the text string, hence the following code:
/* Step 4: Making this with a macro */
* Delete unnecessary variables;
Data fil0(drop=t1 t2 t3 t4 t5 t6);
set fil1;
run;
%macro part(p0,p1);
Data fil&p1.;
set fil&p0.;
if &p1.=1 then
do;
*FIRST PART;
M1_&p1.=substr(Text,anydigit(Text), length(Text)-anydigit(Text)+1);
M3_&p1.=scan(M1_&p1.,1,'');
M4_&p1.=scan(M1_&p1.,2,'');
*SUBSTRN= behave in the same way as SUBSTR, but will not issue a NOTE message if it returns a null result;
R0_&p1.=substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+3,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+2));
if anyalpha(M3_&p1.)>0 then
do;
quantity&p1.=substr(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
unit&p1.=substr(M3_&p1., anyalpha(M3_&p1.),length(M3_&p1.)-anyalpha(M3_&p1.)+1);
end;
else
do;
quantity&p1.=M3_&p1.;
unit&p1.=M4_&p1.;
end;
end;
if ( &p1.>1 ) and ( length(R0_&p0.)>0 ) then
do;
*MORE PARTS;
M1_&p1.=substrn(R0_&p0.,anydigit(R0_&p0.),length(R0_&p0.)-anydigit(R0_&p0.)+1);
M3_&p1.=scan(M1_&p1.,1,'');
M4_&p1.=scan(M1_&p1.,2,'');
R0_&p1.= substrn(M1_&p1.,length(M3_&p1.)+length(M4_&p1.)+2,length(M1_&p1.)-(length(M3_&p1.)+length(M4_&p1.)+1));
if anyalpha(M3_&p1.)>0 then
do;
quantity&p1.=substrn(M3_&p1.,anydigit(M3_&p1.),anyalpha(M3_&p1.)-1);
unit&p1.=substrn(M3_&p1., anyalpha(M3_&p1.), length(M3_&p1.)-anyalpha(M3_&p1.)+1);
end;
else
do;
quantity&p1.=M3_&p1.;
unit&p1.=M4_&p1.;
end;
end;
run;
%mend;
%part(0,1);
%part(1,2);
%part(2,3);
%part(3,4);
Data output0(keep=Text quantity1 unit1 quantity2 unit2 quantity3 unit3 quantity4 unit4);
set fil4;
run;
The result looks like this:
In step 5, I try to clean up a bit so that the items with the same unit is written in the same way and this is also done with a macro:
/* Step 5: Looks for a word that has the same meaning and replaces them with the same unit */
%macro unit(p0, p1);
data output&p1.;
set output&p0.;
if Unit&p1. in ('K','KG','KG.','KILO') then
Unit&p1.='KG';
if Unit&p1. in ('GRAM','G','G.','GR','GR.','GRA','GRAM.') then
Unit&p1.='GR';
if Unit&p1. in ('MILIGRAM') then
Unit&p1.='MGR';
if Unit&p1. in ('C','CL.','CL') then
Unit&p1.='CL';
if Unit&p1. in ('LTR','L','L.','L..','LT','LIT','LITER','LITRE','LTR.','LITE') then
Unit&p1.='LTR';
if Unit&p1. in ('ML.','MLT','ML..','ML','MLITER','MILITER') then
Unit&p1.='MLT';
run;
%mend;
%unit(0,1);
%unit(1,2);
%unit(2,3);
%unit(3,4);
The output looks like this:
The work might not be completely finished, and further work would be to decide which quantity and unit to use for the different items. But I think the power of these functions might have been proven and hopefully you found this tip on how to use these functions useful!!!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.