Is there a way to group nominal observations from a column to create a new variable?
Example:
Month Customer_ID Merchant Amount Count
Apr-18 12345 Walmart $20.00 5
Apr-18 23456 Exxon $35.00 1
May-18 67891 Home Depot $60.14 2
May-18 34567 Lowes $1200 5
Jun-18 45678 Shell $25.00 3
Jun-18 56789 McDonalds $10.00 1
In the above example, if I would like to create a new column 'Category' that has all gas charges as Gas (Exxon, Shell), all home improvement charges as Home Improvement (Lowes, Home Depot), Restaurants (McDonalds) from the above example.
The issue is that there are ~5000 merchants in several months of data and is there an optimum way to group these into specific categories for further analysis/reporting/analytics?
Thank you.
In my view, best way is to have a lookup table, something like below and then picking up category by doing join on Merchant
You can update or add to lookup on regular basis. This helps to maintain things easily.
Merchant Category
Exxon Gas
Shell Gas
Lowes Home Improvement
Home Depot Home Improvement
Start with putting your relationships merchant-category into a dataset. Create a format from that dataset, and use it to create the category variable when you need it.
Edit: see some example code:
data lookup;
infile cards dlm=',';
input merchant :$20. category :$20.;
cards;
Walmart,Retail
Exxon,Gas
Home Depot,Home Improvement
Lowes,Home Improvement
Shell,Gas
McDonalds,Restaurants
;
run;
data cntlin;
set lookup (rename=(merchant=start category=label));
fmtname = 'category';
type = 'C';
run;
proc format cntlin=cntlin;
run;
data have;
infile cards dlm=',';
input month $ customer_id $ merchant :$20. amount count;
category = put(merchant,$category.);
cards;
Apr-18,12345,Walmart,20,5
;
run;
Thank you, Mr. KurtBremser. I appreciate your quick response.
Just to clarify, did you mean to say I have to have a separate dataset that has 2 columns: 1. Category, and 2. Merchant?
As the number of merchants are ~5000 in multiple categories, if there is a way to optimize adding the category column that has multiple merchants? Also, the list of merchants per category could be dynamic as the customers could go to new merchants to do their shopping, so the new merchants need to be added if they do not exist in the current dataset.
Setting up the lookup dataset is of course initial work, but once you have it, you have it, and it only needs maintenance. And it keeps data out of code.
And how does a customer changing merchants influence that merchant's category? That's highly illogical.
Hi Mr. KurtBremser,
I finally used your idea on the lookup table. I appreciate your help.
Thank you,
In my view, best way is to have a lookup table, something like below and then picking up category by doing join on Merchant
You can update or add to lookup on regular basis. This helps to maintain things easily.
Merchant Category
Exxon Gas
Shell Gas
Lowes Home Improvement
Home Depot Home Improvement
Kiran V,
Thank you. I followed the lookup table advise to address the issue. I appreciate your help.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.