Solved
New Contributor
Posts: 2

# Classification: K nearest neighbors (MBR)

Hi all, I'm a math student who must pass a data mining exam in a week. I can not fix any point of this exercise using SAS Miner, someone can help me?

Create in SAS Enterprise Miner classifier MBR on Intrusion dataset Detection (a .csv file I downloaded from my university website), after:
1. Performed an exploratory analysis of the data which presented the main results.
2. Partitioned the dataset into training set, test and validation set (50% -30% -20%).
3. Eliminated the non-numerical features and have explained the reasons for such exclusion.
4. Chose a value of K (indicate why you chose it).
5. Calculate the total error of prediction on the test set.
6. Repeat steps 3 and 4 by changing the value of K chosen for a total of 3 iterations.
7. Viewing the curve of total error varying K. For which value of K you obtain a minor error?
8. For this value of k, show the confusion matrix on the validation set.

I have problems mainly with step 3, but also with steps 4 (how can I justify my choice of K?), 5, 7 (how can I see the curve of total error?) and 8 (How can I find the confusion matrix?).

I apologize for my poor English, and I hope for your help.

Teodoro

Accepted Solutions
Solution
‎07-07-2017 03:14 PM
Super Contributor
Posts: 338

## Re: Classification: K nearest neighbors (MBR)

[ Edited ]

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

All Replies
Solution
‎07-07-2017 03:14 PM
Super Contributor
Posts: 338

## Re: Classification: K nearest neighbors (MBR)

[ Edited ]

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

New Contributor
Posts: 2

## Re: Classification: K nearest neighbors (MBR)

Thank you very much Miguel! Your suggestions are very helpful, now I can do at least half exercise.

Do you know how eliminate non-numerical features from my dataset analysis? I thought to use a Filter node, but I don't know if it's useful, since I must eliminate all non-numerical features, not some of their values. Without this I cannot do the rest of the exercise but explain only the procedure to follow.

Super Contributor
Posts: 338

## Re: Classification: K nearest neighbors (MBR)

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks.

In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

Good luck!

Occasional Contributor
Posts: 15

## Re: Classification: K nearest neighbors (MBR)

Hi, Miguel:

Do you know the difference between PROC DISCRIM and MBR node in terms of KNN?  I used both but got a totally different result.

I don't know which one to use for scoring my new data now.

Super User
Posts: 23,776

## Re: Classification: K nearest neighbors (MBR)

Discrimination analysis assumes you know the outcome to create your model, K nearest neighbour methods assume you don't.

DA is a supervised learning algorithm while KNN is an unsupervised learning algorithm.

Your new data gets scored with the original method you used.

Occasional Contributor
Posts: 15

## Re: Classification: K nearest neighbors (MBR)

Hi, Reeza:

I used PROC DISCRIM with METHOD=NPAR, which in terms gives me KNN (k-nearest neighbors) algorithm.

KNN is NOT an unsupervised learning algorithm; it is a supervised learning algorithm.

Super User
Posts: 23,776

## Re: Classification: K nearest neighbors (MBR)

I mixed up K-Means and KNN.

SAS Employee
Posts: 122

## Re: Classification: K nearest neighbors (MBR)

You can use Metadata node to drop variables in the middle of EM workflow. You are right that Filter node is to  'cut values' of a variable. Metadata node, as the name suggests, is about managing data sets.

Jason Xin

New Contributor
Posts: 3

## Re: Classification: K nearest neighbors (MBR)

Is there a way to program a grid search for K instead of having to manually set different model nodes?