Hello all,
I have a dataset with 1000 observations and 6 variables (below is an example). The values of the variables bev_1-bev_5 are open ended text and contain different spelling variations. I'm creating additional variables which classify the responses in each bev_1-bev_5 column as 1=drink or 2=food and the type of drink. My goal is to set up a lookup table or essentially another dataset which includes all of the values of bev_1-bev_5 and the corresponding codes for my new variables. This lookup table will also have to include the spelling variations (e.g., Coke, Coca-cola, CocaCola, coke, etc. should all be coded as 1 for my product_Type variable, and should all be coded the same for any other variables I want to add).
Example dataset:
ID | bev_1 | bev_2 | bev_3 | bev_4 | bev_5 |
1 | Coke | Starbucks | Pepsi | CocaCola | fries |
2 | Coca-Cola | burger | pepsi | coke | redbull |
Question: I'm wondering what the best way would be to structure the lookup table so that for example, anytime SAS sees the word "Coke" or any of its variations (which I would include in my lookup table) in bev_1-bev_5, it will code it appropriately based on the new variables I'd like to add.
Lookup table that I want to link, essentially as a "dictionary" to my original dataset. Columns are the new variables I'll be adding:
Response and variations | product_type | type of drink |
Coke, Coca-cola, Cocacola | 1 | 1 |
burger | 2 | 0 |
redbull, Red Bull, RedBull | 1 | 3 |
As a pair for FORMATS (or INFORMATS) and not as a table at all.
Formats convert values to text. Informats convert text to values. If you want new PRODUCT_TYPE and TYPE_OF_DRINK values to be numbers then define INFORMATs and use the INPUT() function to convert the strings in BEV_1 to BEV_5 into numbers. But if you are ok with the results being strings you can just apply the format to BEV_1 to BEV_5 without having to modify the original table at all.
proc format ;
value $product
'Coke', 'Coca-cola', 'Cocacola'='1'
'burger'='2'
'redbull', 'Red Bull', 'RedBull'='1'
other = 'Unknown'
;
value $drink
'Coke', 'Coca-cola', 'Cocacola'='1'
'burger'='0'
'redbull', 'Red Bull', 'RedBull'='3'
other='Unknown'
;
run;
data have;
input ID (bev_1-bev_5) (:$10.) ;
cards;
1 Coke Starbucks Pepsi CocaCola fries
2 Coca-Cola burger pepsi coke redbull
;
proc print data=have;
format bev: $product.;
run;
Results:
Obs ID bev_1 bev_2 bev_3 bev_4 bev_5 1 1 1 Unknown Unknown Unknown Unknown 2 2 Unknown 2 Unknown Unknown 1
Thanks for your help Tom! As a follow up question, the reason why I was thinking of doing a table of some sort was because my dataset values are open text, with the different spelling variations I have thousands of unique responses. So i was trying to avoid having to manually type out each spelling variation as part of a SAS code but rather create a csv table which I could import into SAS and then link that csv to the following code that I was using to actually create the new variables I want.
data want;
set have;
array x{*} bev_1-bev_5;
array product_type {5};
if x{i} in ('Coke', 'Pepsi') then product_type{i}=1;
else if x{i}='burger' then product_type{i}=2;
end;
drop i;
run;
Thanks Reeza for your input. I've followed all of the steps as you described but I seem to still be stuck on linking the dataset I import from excel with my original dataset after using the cntlin function. This is the SAS code I've used:
proc transpose data=test out=bev_sample;
by id;
var bev_1-bev_5;
run;
*change all values to uppercase;
data bev_sample;
set bev_sample;
beverage=upcase(col1);
run;
*removing extra characters in values of COL1;
data bev_control;
set bev_sample;
beverage=compress(beverage);
run;
proc sort data=bev_control nodupkey;
by id;
run;
*setting fmtname, start, and label variables from bev_format dataset;
data bev_frmt;
rename id=start;
set bev_new;
label=map_code;
fmtname='$beverages';
run;
proc format cntlin=bev_frmt;
run;
*perform look up by reading in original dataset using the format I've created;
data bev_final;
set bev_control (where=(put(ID,$beverages.)='COKE'));
run;
Also just to clarify, this is an example of how I organized the excel file I read in as the format dataset:
ID | _NAME_ | COL1 | beverage | map_code | Response | product_type | drink_type |
03 | bev_1 | Coca cola | COCACOLA | COKE | 1 | 1 | 1 |
04 | bev_1 | Coke | COKE | COKE | 1 | 1 | 1 |
05 | bev_2 | Hamburger | HAMBURGER | HAMBURGER | 1 | 2 | 77 |
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.