BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
geniusgenie
Obsidian | Level 7

Hi,

I was wondering if someone could help me?

I am applying machine learning algorithms on my dataset using SAS enterprise miner, 

my dataset consists of three columns named file name, feature name and feature type. Each feature name has a distinct feature type. A file may have multiple feature names and obviously feature type as well. for example file name "A" has sometimes 3 or more rows having  different features (But not more than 15). There are in total 70 unique feature names and 24 feature types.

 

One person suggested me to input missing values by inserting remaining missing columns and types as missing. But my point is, for example file "A" had only 3 rows and has 67 missing, file "B" had 11 rows and 59 missing, in that case if I insert 67 or 59 more rows for each feature name and declare them as missing then I would have more missing values than my existing original values which may impact on my results when i apply classifiers on them.

 

Could anyone tell me whether it is right or wrong to calculate these kinds of missing values? could you tell me why?

A rough table shows what I am trying to figure out

 

File nameFeature nameFeature type
AF1D1
AF2D15
AF3D7
BF1D1
BF5D18
BF35D10
BF20D13
BF45D16
AF4Missing
AF5Missing
BF2Missing

 

Regards

 

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not.   The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row.  So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc...  It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information.   It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model.  Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model.   If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful.   From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result.

I hope this helps,

Doug 

View solution in original post

3 REPLIES 3
DougWielenga
SAS Employee

geniusgenie,

 

It is not clear to me which data mining algorithms you wish to apply.   In most of the algorithms in SAS Enterprise Miner, you need to have a single observation per line so adding multiple lines of data for the same observations is not likely to be helpful.   Certain analyses such as the Association node or the Market Basket Analysis node expect the data to be in a form where there are multiple rows per ID but most predictive modeling methods treat each row as a different observation independent of the others.  If your data has several rows for each observation, your analysis will be questionable since the rows are not independent of one another.   If you could help me understand what analyses you wish to perform,  I will try and make some suggestions how to proceed forward.  

 

Cordially,
Doug

 

 

geniusgenie
Obsidian | Level 7

Hi Doug,

I am using Neural Network, SVM and Linear Regression. for easier understanding, I have got this data after reading static information from files, each file contains multiple records called sections. each section has a type, address, size etc. 

There are in total 70 unique sections. But not a single file contains sections more than 20 sections (records). Some files have 10 sections, some 15 and some files have 3 sections. In my views these are not missing values. But I am not sure whether this difference of sections is classified as a missingness or a normal thing. As in my views for example every person in real life situation has a different height and they are not supposed to have same height. Difference in height does not qualify this as missing value of height.

 

Plz correct me if I am wrong.

 

Regards

 

DougWielenga
SAS Employee

It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not.   The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row.  So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc...  It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information.   It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model.  Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model.   If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful.   From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result.

I hope this helps,

Doug 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1649 views
  • 1 like
  • 2 in conversation