I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings.
Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables.
id | prod_1 | time_gap_1 | flag_1 | age_1 | salary_1 | gender_1 | nationality_1 | prod_2 | time_gap_2 | flag_2 | age_2 | salary_2 | gender_2 | nationality_2 | prod_3 | time_gap_3 | flag_3 | age_3 | salary_3 | gender_3 | nationality_3 |
1 | A | 19 | 1 | 37 | 51794.77 | Female | Local | D | 14 | 0 | 37 | 51794.77 | Female | Local | A | 1 | 0 | 37 | 51794.77 | Female | Local |
2 | C | 20 | 1 | 21 | 62124.27 | Male | Expat | B | 30 | 0 | 21 | 62124.27 | Male | Expat | D | 24 | 1 | 21 | 62124.27 | Male | Expat |
3 | C | 15 | 0 | 40 | 79727.85 | Female | Local | A | 23 | 1 | 40 | 79727.85 | Female | Local | A | 8 | 1 | 40 | 79727.85 | Female | Local |
4 | D | 19 | 1 | 38 | 26712.37 | Male | Expat | C | 21 | 0 | 38 | 26712.37 | Male | Expat | D | 12 | 0 | 38 | 26712.37 | Male | Expat |
Why do your want to do this?
Most SAS procedures that support the Class Statement does not require integer representation of class variables.
I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.
I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, PROC GLM documentation, but this applies to all modeling procedures I know of.
Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.
The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?
Great, I will test it out.
A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a missing argument, but only applicable for regression models apparently. And a forceEqualPadding argument but only for convolutional layers?
There are some graphs on this documentation page showing the how tokens are concatenated. https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm I think the padding would need to be done before training.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.