topic Missing values in a column in SAS Data Science

Missing values in a column

geniusgenie — Fri, 14 Jul 2017 02:36:34 GMT

Hi,

I was wondering if someone could help me?

I am applying machine learning algorithms on my dataset using SAS enterprise miner,

my dataset consists of three columns named file name, feature name and feature type. Each feature name has a distinct feature type. A file may have multiple feature names and obviously feature type as well. for example file name "A" has sometimes 3 or more rows having different features (But not more than 15). There are in total 70 unique feature names and 24 feature types.

One person suggested me to input missing values by inserting remaining missing columns and types as missing. But my point is, for example file "A" had only 3 rows and has 67 missing, file "B" had 11 rows and 59 missing, in that case if I insert 67 or 59 more rows for each feature name and declare them as missing then I would have more missing values than my existing original values which may impact on my results when i apply classifiers on them.

Could anyone tell me whether it is right or wrong to calculate these kinds of missing values? could you tell me why?

A rough table shows what I am trying to figure out

File name	Feature name	Feature type
A	F1	D1
A	F2	D15
A	F3	D7
B	F1	D1
B	F5	D18
B	F35	D10
B	F20	D13
B	F45	D16
A	F4	Missing
A	F5	Missing
B	F2	Missing

Regards

Re: Missing values in a column

DougWielenga — Fri, 04 Aug 2017 21:07:18 GMT

geniusgenie,

It is not clear to me which data mining algorithms you wish to apply. In most of the algorithms in SAS Enterprise Miner, you need to have a single observation per line so adding multiple lines of data for the same observations is not likely to be helpful. Certain analyses such as the Association node or the Market Basket Analysis node expect the data to be in a form where there are multiple rows per ID but most predictive modeling methods treat each row as a different observation independent of the others. If your data has several rows for each observation, your analysis will be questionable since the rows are not independent of one another. If you could help me understand what analyses you wish to perform, I will try and make some suggestions how to proceed forward.

Cordially,
Doug

Re: Missing values in a column

geniusgenie — Sun, 06 Aug 2017 14:39:31 GMT

Hi Doug,

I am using Neural Network, SVM and Linear Regression. for easier understanding, I have got this data after reading static information from files, each file contains multiple records called sections. each section has a type, address, size etc.

There are in total 70 unique sections. But not a single file contains sections more than 20 sections (records). Some files have 10 sections, some 15 and some files have 3 sections. In my views these are not missing values. But I am not sure whether this difference of sections is classified as a missingness or a normal thing. As in my views for example every person in real life situation has a different height and they are not supposed to have same height. Difference in height does not qualify this as missing value of height.

Plz correct me if I am wrong.

Regards

Re: Missing values in a column

DougWielenga — Mon, 07 Aug 2017 13:55:13 GMT

It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not. The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row. So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc... It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information. It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model. Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model. If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful. From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result.

I hope this helps,

Doug