BookmarkSubscribeRSS Feed
KJazem
Obsidian | Level 7

I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings. 

 

Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables. 

 

idprod_1time_gap_1flag_1age_1salary_1gender_1nationality_1prod_2time_gap_2flag_2age_2salary_2gender_2nationality_2prod_3time_gap_3flag_3age_3salary_3gender_3nationality_3
1A1913751794.77FemaleLocalD1403751794.77FemaleLocalA103751794.77FemaleLocal
2C2012162124.27MaleExpatB3002162124.27MaleExpatD2412162124.27MaleExpat
3C1504079727.85FemaleLocalA2314079727.85FemaleLocalA814079727.85FemaleLocal
4D1913826712.37MaleExpatC2103826712.37MaleExpatD1203826712.37MaleExpat
11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Why do your want to do this?

 

Most SAS procedures that support the Class Statement does not require integer representation of class variables.

KJazem
Obsidian | Level 7
I'm using the deeplearn action set in a proc cas, so this also supports class variables as input?

This is the documentation I'm following: https://documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/cas-deeplearn-dltrain.htm

All my inputs are the columns (and the sequence parameter for tokenSize). Can I just use the categorical features as-is as inputs? LSTM models, as far as I know, require you to have your features encoded. That's at least what I did in python.
PaigeMiller
Diamond | Level 26

I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.

 

I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, PROC GLM documentation, but this applies to all modeling procedures I know of.

--
Paige Miller
KJazem
Obsidian | Level 7
I'm using the deeplearn action set to train my model (LSTM), and I assumed you have to encode categorical variables. My inputs will be the columns (features), and a target variable (which I forgot to add but it's just one column of products).

Can I just include categorical features in this as well? Still new to this one.
WendyCzika
SAS Employee
Yes I think you can use the nominals= argument in the dlTrain action to indicate which inputs/target are categorical.
KJazem
Obsidian | Level 7

Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals. 

The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?

WendyCzika
SAS Employee
I think so - I know that's true for other actions.
KJazem
Obsidian | Level 7

Great, I will test it out.

A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a missing argument, but only applicable for regression models apparently.  And a forceEqualPadding argument but only for convolutional layers?

lipcai
SAS Employee

There are some graphs on this documentation page showing the how tokens are concatenated. https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm I think the padding would need to be done before training. 

KJazem
Obsidian | Level 7
If I pad with zeros for example, how would the model training know to 'ignore' the padding? In tensorflow, you had a masking layer, or I think the model would implicitly ignore zeros (not positive).

I wish there way a way to just input variable length sequences.

@WendyCzika if you have any ideas on how to do this, please share. Thanks.

KJazem
Obsidian | Level 7
A quick follow-up.

I tried inputting categorical variables into my LSTM model, and it errored out saying only target variables can be nominals. Encoding the products into integers gave me terrible results. Not sure what can be done in this case.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 2731 views
  • 5 likes
  • 5 in conversation