<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Encoding categorical features in dataset in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832524#M10295</link>
    <description>&lt;P&gt;Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?&lt;/P&gt;</description>
    <pubDate>Fri, 09 Sep 2022 16:12:27 GMT</pubDate>
    <dc:creator>KJazem</dc:creator>
    <dc:date>2022-09-09T16:12:27Z</dc:date>
    <item>
      <title>Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832492#M10289</link>
      <description>&lt;P&gt;I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables. How would go about encoding them into integer labels or one-hot? I've seen macros for these but they seem impractical as I have to do it for every column separately from what I've seen and similar columns (ex: prod_1 and prod_2) might have different encodings.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there an action set or anything that when given a dataset, does the encoding for you? Similarly for normalizing continuous variables.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE border="0" cellspacing="0" cellpadding="0"&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;id&lt;/TD&gt;&lt;TD&gt;prod_1&lt;/TD&gt;&lt;TD&gt;time_gap_1&lt;/TD&gt;&lt;TD&gt;flag_1&lt;/TD&gt;&lt;TD&gt;age_1&lt;/TD&gt;&lt;TD&gt;salary_1&lt;/TD&gt;&lt;TD&gt;gender_1&lt;/TD&gt;&lt;TD&gt;nationality_1&lt;/TD&gt;&lt;TD&gt;prod_2&lt;/TD&gt;&lt;TD&gt;time_gap_2&lt;/TD&gt;&lt;TD&gt;flag_2&lt;/TD&gt;&lt;TD&gt;age_2&lt;/TD&gt;&lt;TD&gt;salary_2&lt;/TD&gt;&lt;TD&gt;gender_2&lt;/TD&gt;&lt;TD&gt;nationality_2&lt;/TD&gt;&lt;TD&gt;prod_3&lt;/TD&gt;&lt;TD&gt;time_gap_3&lt;/TD&gt;&lt;TD&gt;flag_3&lt;/TD&gt;&lt;TD&gt;age_3&lt;/TD&gt;&lt;TD&gt;salary_3&lt;/TD&gt;&lt;TD&gt;gender_3&lt;/TD&gt;&lt;TD&gt;nationality_3&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;19&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;37&lt;/TD&gt;&lt;TD&gt;51794.77&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;TD&gt;D&lt;/TD&gt;&lt;TD&gt;14&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;37&lt;/TD&gt;&lt;TD&gt;51794.77&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;37&lt;/TD&gt;&lt;TD&gt;51794.77&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;2&lt;/TD&gt;&lt;TD&gt;C&lt;/TD&gt;&lt;TD&gt;20&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;21&lt;/TD&gt;&lt;TD&gt;62124.27&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;30&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;21&lt;/TD&gt;&lt;TD&gt;62124.27&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;TD&gt;D&lt;/TD&gt;&lt;TD&gt;24&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;21&lt;/TD&gt;&lt;TD&gt;62124.27&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;3&lt;/TD&gt;&lt;TD&gt;C&lt;/TD&gt;&lt;TD&gt;15&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;40&lt;/TD&gt;&lt;TD&gt;79727.85&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;23&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;40&lt;/TD&gt;&lt;TD&gt;79727.85&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;8&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;40&lt;/TD&gt;&lt;TD&gt;79727.85&lt;/TD&gt;&lt;TD&gt;Female&lt;/TD&gt;&lt;TD&gt;Local&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;4&lt;/TD&gt;&lt;TD&gt;D&lt;/TD&gt;&lt;TD&gt;19&lt;/TD&gt;&lt;TD&gt;1&lt;/TD&gt;&lt;TD&gt;38&lt;/TD&gt;&lt;TD&gt;26712.37&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;TD&gt;C&lt;/TD&gt;&lt;TD&gt;21&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;38&lt;/TD&gt;&lt;TD&gt;26712.37&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;TD&gt;D&lt;/TD&gt;&lt;TD&gt;12&lt;/TD&gt;&lt;TD&gt;0&lt;/TD&gt;&lt;TD&gt;38&lt;/TD&gt;&lt;TD&gt;26712.37&lt;/TD&gt;&lt;TD&gt;Male&lt;/TD&gt;&lt;TD&gt;Expat&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Fri, 09 Sep 2022 13:22:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832492#M10289</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-09T13:22:21Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832495#M10290</link>
      <description>&lt;P&gt;Why do your want to do this?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Most SAS procedures that support the Class Statement does not require integer representation of class variables.&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 13:23:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832495#M10290</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2022-09-09T13:23:19Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832496#M10291</link>
      <description>&lt;DIV id="bodyDisplay_6e778f9e9decc2" class="lia-message-body lia-component-message-view-widget-body lia-component-body-signature-highlight-escalation lia-component-message-view-widget-body-signature-highlight-escalation"&gt;
&lt;DIV class="lia-message-body-content"&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;I have a dataset that looks like this: for every ID, they have a list of products and features related to that product. I want to input that into my model, so I have to encode the categorical variables.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I think this is incorrect in general. SAS has created a method to include categorical variables in a model, so you don't have to do this encoding. This method is the CLASS statement. See, for example, &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_glm_syntax04.htm" target="_self"&gt;PROC GLM documentation&lt;/A&gt;, but this applies to all modeling procedures I know of.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;</description>
      <pubDate>Fri, 09 Sep 2022 13:26:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832496#M10291</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2022-09-09T13:26:45Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832498#M10292</link>
      <description>I'm using the deeplearn action set to train my model (LSTM), and I assumed you have to encode categorical variables. My inputs will be the columns (features), and a target variable (which I forgot to add but it's just one column of products).&lt;BR /&gt;&lt;BR /&gt;Can I just include categorical features in this as well? Still new to this one.</description>
      <pubDate>Fri, 09 Sep 2022 13:37:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832498#M10292</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-09T13:37:07Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832500#M10293</link>
      <description>I'm using the deeplearn action set in a proc cas, so this also supports class variables as input?&lt;BR /&gt;&lt;BR /&gt;This is the documentation I'm following: &lt;A href="https://documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/cas-deeplearn-dltrain.htm" target="_blank"&gt;https://documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/cas-deeplearn-dltrain.htm&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;All my inputs are the columns (and the sequence parameter for tokenSize). Can I just use the categorical features as-is as inputs? LSTM models, as far as I know, require you to have your features encoded. That's at least what I did in python.</description>
      <pubDate>Fri, 09 Sep 2022 13:41:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832500#M10293</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-09T13:41:19Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832520#M10294</link>
      <description>Yes I think you can use the nominals= argument in the dlTrain action to indicate which inputs/target are categorical.</description>
      <pubDate>Fri, 09 Sep 2022 15:29:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832520#M10294</guid>
      <dc:creator>WendyCzika</dc:creator>
      <dc:date>2022-09-09T15:29:03Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832524#M10295</link>
      <description>&lt;P&gt;Thanks for the reply. I'll try it. I remember getting a warning on how only the target variable (or was it inputs?) can be a categorical variable when trying to train a LSTM model. Maybe I didn't add it/the inputs in the nominals.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;The target which is also a categorical variable has to be specified in the nominals argument and as the target argument, right?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 16:12:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832524#M10295</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-09T16:12:27Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832527#M10297</link>
      <description>I think so - I know that's true for other actions.</description>
      <pubDate>Fri, 09 Sep 2022 16:10:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832527#M10297</guid>
      <dc:creator>WendyCzika</dc:creator>
      <dc:date>2022-09-09T16:10:43Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832528#M10298</link>
      <description>&lt;P&gt;Great, I will test it out.&lt;BR /&gt;&lt;BR /&gt;A simple follow-up to this: if customers can have different number of products in the sequence, I would have to pad until the sequences are of the same length (similar to pad_sequences from tensorflow). Should I be doing this before training or is there a way to have it train/learn from variable length sequences? There's a &lt;EM&gt;&lt;STRONG&gt;missing&lt;/STRONG&gt; &lt;/EM&gt;argument, but only applicable for regression models apparently.&amp;nbsp; And a &lt;STRONG&gt;&lt;EM&gt;forceEqualPadding&lt;/EM&gt; &lt;/STRONG&gt;argument&amp;nbsp;but only for convolutional layers?&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 16:26:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832528#M10298</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-09T16:26:09Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832563#M10299</link>
      <description>&lt;P&gt;There are some graphs on this documentation page showing the how tokens are c&lt;SPAN&gt;oncatenated.&amp;nbsp;&lt;A href="https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm" target="_blank"&gt;https://go.documentation.sas.com/doc/en/pgmsascdc/v_007/casdlpg/p0fvrm760lp8i7n1ofx84a92wa9s.htm&lt;/A&gt;&amp;nbsp;&lt;/SPAN&gt;I think the padding would need to be done before training.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 18:58:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832563#M10299</guid>
      <dc:creator>lipcai</dc:creator>
      <dc:date>2022-09-09T18:58:34Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832675#M10300</link>
      <description>If I pad with zeros for example, how would the model training know to 'ignore' the padding? In tensorflow, you had a masking layer, or I think the model would implicitly ignore zeros (not positive).&lt;BR /&gt;&lt;BR /&gt;I wish there way a way to just input variable length sequences.&lt;BR /&gt;&lt;BR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/14512"&gt;@WendyCzika&lt;/a&gt; if you have any ideas on how to do this, please share. Thanks.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Sat, 10 Sep 2022 18:58:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832675#M10300</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-10T18:58:11Z</dc:date>
    </item>
    <item>
      <title>Re: Encoding categorical features in dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832967#M10305</link>
      <description>A quick follow-up.&lt;BR /&gt;&lt;BR /&gt;I tried inputting categorical variables into my LSTM model, and it errored out saying only target variables can be nominals. Encoding the products into integers gave me terrible results. Not sure what can be done in this case.</description>
      <pubDate>Mon, 12 Sep 2022 18:18:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Encoding-categorical-features-in-dataset/m-p/832967#M10305</guid>
      <dc:creator>KJazem</dc:creator>
      <dc:date>2022-09-12T18:18:53Z</dc:date>
    </item>
  </channel>
</rss>

