Re: Use SAS to create a lookup table with key values

monsterpie · Posted 05-27-2021 04:25 PM

Hello all,

I have a dataset with 1000 observations and 6 variables (below is an example). The values of the variables bev_1-bev_5 are open ended text and contain different spelling variations. I'm creating additional variables which classify the responses in each bev_1-bev_5 column as 1=drink or 2=food and the type of drink. My goal is to set up a lookup table or essentially another dataset which includes all of the values of bev_1-bev_5 and the corresponding codes for my new variables. This lookup table will also have to include the spelling variations (e.g., Coke, Coca-cola, CocaCola, coke, etc. should all be coded as 1 for my product_Type variable, and should all be coded the same for any other variables I want to add).

Example dataset:

ID	bev_1	bev_2	bev_3	bev_4	bev_5
1	Coke	Starbucks	Pepsi	CocaCola	fries
2	Coca-Cola	burger	pepsi	coke	redbull

Question: I'm wondering what the best way would be to structure the lookup table so that for example, anytime SAS sees the word "Coke" or any of its variations (which I would include in my lookup table) in bev_1-bev_5, it will code it appropriately based on the new variables I'd like to add.

Lookup table that I want to link, essentially as a "dictionary" to my original dataset. Columns are the new variables I'll be adding:

Response and variations	product_type	type of drink
Coke, Coca-cola, Cocacola	1	1
burger	2	0
redbull, Red Bull, RedBull	1	3

Tom · Posted 05-27-2021 04:39 PM

As a pair for FORMATS (or INFORMATS) and not as a table at all.

Formats convert values to text. Informats convert text to values. If you want new PRODUCT_TYPE and TYPE_OF_DRINK values to be numbers then define INFORMATs and use the INPUT() function to convert the strings in BEV_1 to BEV_5 into numbers. But if you are ok with the results being strings you can just apply the format to BEV_1 to BEV_5 without having to modify the original table at all.

proc format ;
value $product 
  'Coke', 'Coca-cola', 'Cocacola'='1'
  'burger'='2'
  'redbull', 'Red Bull', 'RedBull'='1'
  other = 'Unknown'
;
value $drink 
  'Coke', 'Coca-cola', 'Cocacola'='1'
  'burger'='0'
  'redbull', 'Red Bull', 'RedBull'='3'
  other='Unknown'
;
run;

data have;
  input ID (bev_1-bev_5) (:$10.) ;
cards;
1 Coke Starbucks Pepsi CocaCola fries
2 Coca-Cola burger pepsi coke redbull
;

proc print data=have;
  format bev: $product.;
run;

Results:

Obs    ID    bev_1      bev_2       bev_3      bev_4     bev_5

 1      1    1          Unknown    Unknown    Unknown    Unknown
 2      2    Unknown    2          Unknown    Unknown    1

monsterpie · Posted 05-27-2021 04:51 PM

Thanks for your help Tom! As a follow up question, the reason why I was thinking of doing a table of some sort was because my dataset values are open text, with the different spelling variations I have thousands of unique responses. So i was trying to avoid having to manually type out each spelling variation as part of a SAS code but rather create a csv table which I could import into SAS and then link that csv to the following code that I was using to actually create the new variables I want.

data want;
set have;
	array x{*} bev_1-bev_5;
	array product_type {5};
		if x{i} in ('Coke', 'Pepsi') then product_type{i}=1;
		else if x{i}='burger' then product_type{i}=2;
	end;
	drop i;
run;

Tom · Posted 05-27-2021 04:58 PM

You can create format/informat from data. Look at the CNTLIN= option of PROC FORMAT.

Reeza · Posted 05-27-2021 05:12 PM

1. Transpose your data so that you have all values in one column (bev1-bev5 => beverage via PROC TRANSPOSE)
2. Upcase/lower case all values (upper/lower), remove extra characters with COMPRESS() -> this way COKE=coke=COke and Coca-Cola = cocacola.
3. Sort and remove duplicates (PROC SORT + NODUPKEY)
4. Export to Excel/CSV
5. Clean file or organize such that you have something as follows:

Response ProductType DrinkType
coke 1 1
cocacola 1 1

6. Reimport the file
7. Create a data set using CNTLIN - think there was an example in the last paper I linked you too - Title: PROC FORMAT: Not Just Another Pretty Face.

monsterpie · Posted 05-30-2021 04:43 PM

Thanks Reeza for your input. I've followed all of the steps as you described but I seem to still be stuck on linking the dataset I import from excel with my original dataset after using the cntlin function. This is the SAS code I've used:

proc transpose data=test out=bev_sample;
by id;
var bev_1-bev_5;
run;

*change all values to uppercase;
data bev_sample;
set bev_sample;
beverage=upcase(col1);
run;

*removing extra characters in values of COL1;
data bev_control;
set bev_sample;
beverage=compress(beverage);
run;

proc sort data=bev_control nodupkey;
by id;
run;

*setting fmtname, start, and label variables from bev_format dataset;
data bev_frmt;
rename id=start;
set bev_new;
label=map_code;
fmtname='$beverages';
run;

proc format cntlin=bev_frmt;
run;

*perform look up by reading in original dataset using the format I've created;
data bev_final;
set bev_control (where=(put(ID,$beverages.)='COKE'));
run;

Also just to clarify, this is an example of how I organized the excel file I read in as the format dataset:

ID	_NAME_	COL1	beverage	map_code	Response	product_type	drink_type
03	bev_1	Coca cola	COCACOLA	COKE	1	1	1
04	bev_1	Coke	COKE	COKE	1	1	1
05	bev_2	Hamburger	HAMBURGER	HAMBURGER	1	2	77

Reeza · Posted 05-31-2021 11:56 AM

You still need to use the array/loop from before but now instead of a thousand different IF/THEN statements your loop becomes a single one.

do i=1 to dim(bev);
_new_value(i) = put(bev(i), $beverages.);
end;

If you want to map multiple values for the same code, ie response, product_type, drink_type then a merge may be a better solution. What you would do in that case is take your transposed data set (bev_sample) and merge with the lookup table you created for the format. Then you can use PROC TRANSPOSE to reflip it back to a wider format, if desired. Hopefully you're realizing that data in a long format can be much easier to work with.

Use SAS to create a lookup table with key values