Re: Encoding categorical features in dataset

KJazem · Posted 09-09-2022 09:21 AM

I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings.

Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables.

id	prod_1	time_gap_1	flag_1	age_1	salary_1	gender_1	nationality_1	prod_2	time_gap_2	flag_2	age_2	salary_2	gender_2	nationality_2	prod_3	time_gap_3	flag_3	age_3	salary_3	gender_3	nationality_3
1	A	19	1	37	51794.77	Female	Local	D	14	0	37	51794.77	Female	Local	A	1	0	37	51794.77	Female	Local
2	C	20	1	21	62124.27	Male	Expat	B	30	0	21	62124.27	Male	Expat	D	24	1	21	62124.27	Male	Expat
3	C	15	0	40	79727.85	Female	Local	A	23	1	40	79727.85	Female	Local	A	8	1	40	79727.85	Female	Local
4	D	19	1	38	26712.37	Male	Expat	C	21	0	38	26712.37	Male	Expat	D	12	0	38	26712.37	Male	Expat

PeterClemmensen · Posted 09-09-2022 09:23 AM

Why do your want to do this?

Most SAS procedures that support the Class Statement does not require integer representation of class variables.

The DATA to DATA Step Macro
Blog: SASnrd

KJazem · Posted 09-09-2022 09:41 AM

I'm using the deeplearn action set in a proc cas, so this also supports class variables as input?

This is the documentation I'm following: https://documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/cas-deeplearn-dltrain.htm

All my inputs are the columns (and the sequence parameter for tokenSize). Can I just use the categorical features as-is as inputs? LSTM models, as far as I know, require you to have your features encoded. That's at least what I did in python.

PaigeMiller · Posted 09-09-2022 09:26 AM

I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.

I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, PROC GLM documentation, but this applies to all modeling procedures I know of.

--
Paige Miller

KJazem · Posted 09-09-2022 09:37 AM

I'm using the deeplearn action set to train my model (LSTM), and I assumed you have to encode categorical variables. My inputs will be the columns (features), and a target variable (which I forgot to add but it's just one column of products).

Can I just include categorical features in this as well? Still new to this one.

WendyCzika · Posted 09-09-2022 11:29 AM

Yes I think you can use the nominals= argument in the dlTrain action to indicate which inputs/target are categorical.

KJazem · Posted 09-09-2022 12:04 PM

Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.

The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?

WendyCzika · Posted 09-09-2022 12:10 PM

I think so - I know that's true for other actions.

KJazem · Posted 09-09-2022 12:14 PM

Great, I will test it out.

A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a missing argument, but only applicable for regression models apparently. And a forceEqualPadding argument but only for convolutional layers?

lipcai · Posted 09-09-2022 02:58 PM

There are some graphs on this documentation page showing the how tokens are concatenated. https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm I think the padding would need to be done before training.

KJazem · Posted 09-10-2022 02:58 PM

If I pad with zeros for example, how would the model training know to 'ignore' the padding? In tensorflow, you had a masking layer, or I think the model would implicitly ignore zeros (not positive).

I wish there way a way to just input variable length sequences.

@WendyCzika if you have any ideas on how to do this, please share. Thanks.

KJazem · Posted 09-12-2022 02:18 PM

A quick follow-up.

I tried inputting categorical variables into my LSTM model, and it errored out saying only target variables can be nominals. Encoding the products into integers gave me terrible results. Not sure what can be done in this case.

Registration is open