<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Missing values in a column in SAS Data Science</title>
    <link>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/375908#M5595</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I was wondering if someone could help me?&lt;/P&gt;&lt;P&gt;I am applying machine learning algorithms on my dataset using SAS enterprise miner,&amp;nbsp;&lt;/P&gt;&lt;P&gt;my dataset consists of three columns named file name, feature name&amp;nbsp;and feature type. Each feature name has a distinct feature type. A file may have multiple feature names and obviously feature type as well. for example file name "A" has sometimes 3 or more rows having &amp;nbsp;different features (But not more than 15). There are in total &lt;STRONG&gt;70&lt;/STRONG&gt; unique feature names and &lt;STRONG&gt;24&lt;/STRONG&gt; feature types.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;One person suggested me to input missing values by inserting remaining missing columns and types as missing. But my point is, for example file "A" had only 3 rows and has 67 missing, file "B" had 11 rows and 59 missing, in that case if I insert 67 or 59 more rows for each feature name and declare them as missing then I would have more missing values than my existing original values which may impact on my results when i apply classifiers on them.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could anyone tell me whether it is right or wrong to calculate these kinds of missing values? could you tell me why?&lt;/P&gt;&lt;P&gt;A rough table shows what I am trying to figure out&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;File name&lt;/TD&gt;&lt;TD&gt;Feature name&lt;/TD&gt;&lt;TD&gt;Feature type&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F1&lt;/TD&gt;&lt;TD&gt;D1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F2&lt;/TD&gt;&lt;TD&gt;D15&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F3&lt;/TD&gt;&lt;TD&gt;D7&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F1&lt;/TD&gt;&lt;TD&gt;D1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F5&lt;/TD&gt;&lt;TD&gt;D18&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F35&lt;/TD&gt;&lt;TD&gt;D10&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F20&lt;/TD&gt;&lt;TD&gt;D13&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F45&lt;/TD&gt;&lt;TD&gt;D16&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F4&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F5&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F2&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Jul 2017 02:36:34 GMT</pubDate>
    <dc:creator>geniusgenie</dc:creator>
    <dc:date>2017-07-14T02:36:34Z</dc:date>
    <item>
      <title>Missing values in a column</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/375908#M5595</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I was wondering if someone could help me?&lt;/P&gt;&lt;P&gt;I am applying machine learning algorithms on my dataset using SAS enterprise miner,&amp;nbsp;&lt;/P&gt;&lt;P&gt;my dataset consists of three columns named file name, feature name&amp;nbsp;and feature type. Each feature name has a distinct feature type. A file may have multiple feature names and obviously feature type as well. for example file name "A" has sometimes 3 or more rows having &amp;nbsp;different features (But not more than 15). There are in total &lt;STRONG&gt;70&lt;/STRONG&gt; unique feature names and &lt;STRONG&gt;24&lt;/STRONG&gt; feature types.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;One person suggested me to input missing values by inserting remaining missing columns and types as missing. But my point is, for example file "A" had only 3 rows and has 67 missing, file "B" had 11 rows and 59 missing, in that case if I insert 67 or 59 more rows for each feature name and declare them as missing then I would have more missing values than my existing original values which may impact on my results when i apply classifiers on them.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could anyone tell me whether it is right or wrong to calculate these kinds of missing values? could you tell me why?&lt;/P&gt;&lt;P&gt;A rough table shows what I am trying to figure out&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;File name&lt;/TD&gt;&lt;TD&gt;Feature name&lt;/TD&gt;&lt;TD&gt;Feature type&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F1&lt;/TD&gt;&lt;TD&gt;D1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F2&lt;/TD&gt;&lt;TD&gt;D15&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F3&lt;/TD&gt;&lt;TD&gt;D7&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F1&lt;/TD&gt;&lt;TD&gt;D1&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F5&lt;/TD&gt;&lt;TD&gt;D18&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F35&lt;/TD&gt;&lt;TD&gt;D10&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F20&lt;/TD&gt;&lt;TD&gt;D13&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F45&lt;/TD&gt;&lt;TD&gt;D16&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F4&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;A&lt;/TD&gt;&lt;TD&gt;F5&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;B&lt;/TD&gt;&lt;TD&gt;F2&lt;/TD&gt;&lt;TD&gt;Missing&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jul 2017 02:36:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/375908#M5595</guid>
      <dc:creator>geniusgenie</dc:creator>
      <dc:date>2017-07-14T02:36:34Z</dc:date>
    </item>
    <item>
      <title>Re: Missing values in a column</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/385770#M5686</link>
      <description>&lt;P&gt;geniusgenie,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It is not clear to me which data mining algorithms you wish to apply. &amp;nbsp; In most of the algorithms in SAS Enterprise Miner, you need to have a single observation per line so adding multiple lines of data for the same observations is not likely to be helpful. &amp;nbsp; Certain analyses such as the Association node or the Market Basket Analysis node expect the data to be in a form where there are multiple rows per ID but most predictive modeling methods treat each row as a different observation independent of the others. &amp;nbsp;If your data has several rows for each observation, your analysis will be questionable since the rows are not independent of one another. &amp;nbsp; If you could help me understand what analyses you wish to perform, &amp;nbsp;I will try and make some suggestions how to proceed forward. &amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Cordially,&lt;BR /&gt;Doug&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 04 Aug 2017 21:07:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/385770#M5686</guid>
      <dc:creator>DougWielenga</dc:creator>
      <dc:date>2017-08-04T21:07:18Z</dc:date>
    </item>
    <item>
      <title>Re: Missing values in a column</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/385895#M5691</link>
      <description>&lt;P&gt;Hi Doug,&lt;/P&gt;&lt;P&gt;I am using Neural Network, SVM and Linear Regression. for easier understanding, I have got this data after reading static information from files, each file contains multiple records called sections. each section has a type, address, size etc.&amp;nbsp;&lt;/P&gt;&lt;P&gt;There are in total 70 unique sections. But not a single file contains sections more than 20 sections (records). Some files have 10 sections, some 15 and some files have 3 sections. In my views these are not missing values. But I am not sure whether this difference of sections is classified as a missingness or a normal thing. As in my views for example every person in real life situation has a different height and they are not supposed to have same height. Difference in height does not qualify this as missing value of height.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Plz correct me if I am wrong.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 06 Aug 2017 14:39:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/385895#M5691</guid>
      <dc:creator>geniusgenie</dc:creator>
      <dc:date>2017-08-06T14:39:31Z</dc:date>
    </item>
    <item>
      <title>Re: Missing values in a column</title>
      <link>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/386013#M5694</link>
      <description>&lt;P&gt;It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not. &amp;nbsp; The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row. &amp;nbsp;So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc... &amp;nbsp;It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information. &amp;nbsp; It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model. &amp;nbsp;Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model. &amp;nbsp; If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful. &amp;nbsp; From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result. &lt;BR /&gt;&lt;BR /&gt;I hope this helps,&lt;/P&gt;
&lt;P&gt;Doug&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 07 Aug 2017 13:55:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Science/Missing-values-in-a-column/m-p/386013#M5694</guid>
      <dc:creator>DougWielenga</dc:creator>
      <dc:date>2017-08-07T13:55:13Z</dc:date>
    </item>
  </channel>
</rss>

