I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings.
Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables.
id | prod_1 | time_gap_1 | flag_1 | age_1 | salary_1 | gender_1 | nationality_1 | prod_2 | time_gap_2 | flag_2 | age_2 | salary_2 | gender_2 | nationality_2 | prod_3 | time_gap_3 | flag_3 | age_3 | salary_3 | gender_3 | nationality_3 |
1 | A | 19 | 1 | 37 | 51794.77 | Female | Local | D | 14 | 0 | 37 | 51794.77 | Female | Local | A | 1 | 0 | 37 | 51794.77 | Female | Local |
2 | C | 20 | 1 | 21 | 62124.27 | Male | Expat | B | 30 | 0 | 21 | 62124.27 | Male | Expat | D | 24 | 1 | 21 | 62124.27 | Male | Expat |
3 | C | 15 | 0 | 40 | 79727.85 | Female | Local | A | 23 | 1 | 40 | 79727.85 | Female | Local | A | 8 | 1 | 40 | 79727.85 | Female | Local |
4 | D | 19 | 1 | 38 | 26712.37 | Male | Expat | C | 21 | 0 | 38 | 26712.37 | Male | Expat | D | 12 | 0 | 38 | 26712.37 | Male | Expat |
Why do your want to do this?
Most SAS procedures that support the Class Statement does not require integer representation of class variables.
I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.
I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, PROC GLM documentation, but this applies to all modeling procedures I know of.
Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.
The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?
Great, I will test it out.
A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a missing argument, but only applicable for regression models apparently. And a forceEqualPadding argument but only for convolutional layers?
There are some graphs on this documentation page showing the how tokens are concatenated. https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm I think the padding would need to be done before training.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.