- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings.
Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables.
id | prod_1 | time_gap_1 | flag_1 | age_1 | salary_1 | gender_1 | nationality_1 | prod_2 | time_gap_2 | flag_2 | age_2 | salary_2 | gender_2 | nationality_2 | prod_3 | time_gap_3 | flag_3 | age_3 | salary_3 | gender_3 | nationality_3 |
1 | A | 19 | 1 | 37 | 51794.77 | Female | Local | D | 14 | 0 | 37 | 51794.77 | Female | Local | A | 1 | 0 | 37 | 51794.77 | Female | Local |
2 | C | 20 | 1 | 21 | 62124.27 | Male | Expat | B | 30 | 0 | 21 | 62124.27 | Male | Expat | D | 24 | 1 | 21 | 62124.27 | Male | Expat |
3 | C | 15 | 0 | 40 | 79727.85 | Female | Local | A | 23 | 1 | 40 | 79727.85 | Female | Local | A | 8 | 1 | 40 | 79727.85 | Female | Local |
4 | D | 19 | 1 | 38 | 26712.37 | Male | Expat | C | 21 | 0 | 38 | 26712.37 | Male | Expat | D | 12 | 0 | 38 | 26712.37 | Male | Expat |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why do your want to do this?
Most SAS procedures that support the Class Statement does not require integer representation of class variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is the documentation I'm following: https://documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/cas-deeplearn-dltrain.htm
All my inputs are the columns (and the sequence parameter for tokenSize). Can I just use the categorical features as-is as inputs? LSTM models, as far as I know, require you to have your features encoded. That's at least what I did in python.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.
I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, PROC GLM documentation, but this applies to all modeling procedures I know of.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can I just include categorical features in this as well? Still new to this one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.
The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Great, I will test it out.
A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a missing argument, but only applicable for regression models apparently. And a forceEqualPadding argument but only for convolutional layers?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There are some graphs on this documentation page showing the how tokens are concatenated. https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm I think the padding would need to be done before training.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I wish there way a way to just input variable length sequences.
@WendyCzika if you have any ideas on how to do this, please share. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tried inputting categorical variables into my LSTM model, and it errored out saying only target variables can be nominals. Encoding the products into integers gave me terrible results. Not sure what can be done in this case.