BookmarkSubscribeRSS Feed
monsterpie
Obsidian | Level 7

Hello all,

 

I have a dataset with 1000 observations and 6 variables (below is an example). The values of the variables bev_1-bev_5 are open ended text and contain different spelling variations. I'm creating additional variables which classify the responses in each bev_1-bev_5 column as 1=drink or 2=food and the type of drink. My goal is to set up a lookup table or essentially another dataset which includes all of the values of bev_1-bev_5 and the corresponding codes for my new variables. This lookup table will also have to include the spelling variations (e.g., Coke, Coca-cola, CocaCola, coke, etc. should all be coded as 1 for my product_Type variable, and should all be coded the same for any other variables I want to add).

Example dataset:

IDbev_1bev_2bev_3bev_4bev_5
1CokeStarbucksPepsiCocaColafries
2Coca-Colaburgerpepsicokeredbull

 

Question: I'm wondering what the best way would be to structure the lookup table so that for example, anytime SAS sees the word "Coke" or any of its variations (which I would include in my lookup table) in bev_1-bev_5, it will code it appropriately based on the new variables I'd like to add.

 

Lookup table that I want to link, essentially as a "dictionary" to my original dataset. Columns are the new variables I'll be adding:

Response and variationsproduct_typetype of drink
Coke, Coca-cola, Cocacola11
burger20
redbull, Red Bull, RedBull13
6 REPLIES 6
Tom
Super User Tom
Super User

As a pair for FORMATS (or INFORMATS) and not as a table at all.

Formats convert values to text.  Informats convert text to values.  If you want new PRODUCT_TYPE and TYPE_OF_DRINK values to be numbers then define INFORMATs and use the INPUT() function to convert the strings in BEV_1 to BEV_5 into numbers.   But if you are ok with the results being strings you can just apply the format to BEV_1 to BEV_5 without having to modify the original table at all.

proc format ;
value $product 
  'Coke', 'Coca-cola', 'Cocacola'='1'
  'burger'='2'
  'redbull', 'Red Bull', 'RedBull'='1'
  other = 'Unknown'
;
value $drink 
  'Coke', 'Coca-cola', 'Cocacola'='1'
  'burger'='0'
  'redbull', 'Red Bull', 'RedBull'='3'
  other='Unknown'
;
run;

data have;
  input ID (bev_1-bev_5) (:$10.) ;
cards;
1 Coke Starbucks Pepsi CocaCola fries
2 Coca-Cola burger pepsi coke redbull
;

proc print data=have;
  format bev: $product.;
run;

Results:

Obs    ID    bev_1      bev_2       bev_3      bev_4     bev_5

 1      1    1          Unknown    Unknown    Unknown    Unknown
 2      2    Unknown    2          Unknown    Unknown    1
monsterpie
Obsidian | Level 7

Thanks for your help Tom! As a follow up question, the reason why I was thinking of doing a table of some sort was because my dataset values are open text, with the different spelling variations I have thousands of unique responses. So i was trying to avoid having to manually type out each spelling variation as part of a SAS code but rather create a csv table which I could import into SAS and then link that csv to the following code that I was using to actually create the new variables I want.

data want;
set have;
	array x{*} bev_1-bev_5;
	array product_type {5};
		if x{i} in ('Coke', 'Pepsi') then product_type{i}=1;
		else if x{i}='burger' then product_type{i}=2;
	end;
	drop i;
run;

 

Tom
Super User Tom
Super User
You can create format/informat from data. Look at the CNTLIN= option of PROC FORMAT.
Reeza
Super User
1. Transpose your data so that you have all values in one column (bev1-bev5 => beverage via PROC TRANSPOSE)
2. Upcase/lower case all values (upper/lower), remove extra characters with COMPRESS() -> this way COKE=coke=COke and Coca-Cola = cocacola.
3. Sort and remove duplicates (PROC SORT + NODUPKEY)
4. Export to Excel/CSV
5. Clean file or organize such that you have something as follows:

Response ProductType DrinkType
coke 1 1
cocacola 1 1


6. Reimport the file
7. Create a data set using CNTLIN - think there was an example in the last paper I linked you too - Title: PROC FORMAT: Not Just Another Pretty Face.


monsterpie
Obsidian | Level 7

Thanks Reeza for your input. I've followed all of the steps as you described but I seem to still be stuck on linking the dataset I import from excel with my original dataset after using the cntlin function. This is the SAS code I've used:

 

proc transpose data=test out=bev_sample;
by id;
var bev_1-bev_5;
run;

*change all values to uppercase;
data bev_sample;
set bev_sample;
beverage=upcase(col1);
run;

*removing extra characters in values of COL1;
data bev_control;
set bev_sample;
beverage=compress(beverage);
run;

proc sort data=bev_control nodupkey;
by id;
run;

*setting fmtname, start, and label variables from bev_format dataset;
data bev_frmt;
rename id=start;
set bev_new;
label=map_code;
fmtname='$beverages';
run;

proc format cntlin=bev_frmt;
run;

*perform look up by reading in original dataset using the format I've created;
data bev_final;
set bev_control (where=(put(ID,$beverages.)='COKE'));
run;

Also just to clarify, this is an example of how I organized the excel file I read in as the format dataset:

ID_NAME_COL1beveragemap_codeResponseproduct_typedrink_type
03bev_1Coca colaCOCACOLACOKE111
04bev_1CokeCOKECOKE111
05bev_2HamburgerHAMBURGERHAMBURGER1277
Reeza
Super User
You still need to use the array/loop from before but now instead of a thousand different IF/THEN statements your loop becomes a single one.

do i=1 to dim(bev);
_new_value(i) = put(bev(i), $beverages.);
end;

If you want to map multiple values for the same code, ie response, product_type, drink_type then a merge may be a better solution. What you would do in that case is take your transposed data set (bev_sample) and merge with the lookup table you created for the format. Then you can use PROC TRANSPOSE to reflip it back to a wider format, if desired. Hopefully you're realizing that data in a long format can be much easier to work with.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1658 views
  • 0 likes
  • 3 in conversation