Solved: Classification: K nearest neighbors (MBR)

teodoro_stefanello · Posted 07-03-2014 03:57 AM

Hi all, I'm a math student who must pass a data mining exam in a week. I can not fix any point of this exercise using SAS Miner, someone can help me?

Create in SAS Enterprise Miner classifier MBR on Intrusion dataset Detection (a .csv file I downloaded from my university website), after:
1. Performed an exploratory analysis of the data which presented the main results.
2. Partitioned the dataset into training set, test and validation set (50% -30% -20%).
3. Eliminated the non-numerical features and have explained the reasons for such exclusion.
4. Chose a value of K (indicate why you chose it).
5. Calculate the total error of prediction on the test set.
6. Repeat steps 3 and 4 by changing the value of K chosen for a total of 3 iterations.
7. Viewing the curve of total error varying K. For which value of K you obtain a minor error?
8. For this value of k, show the confusion matrix on the validation set.

I have problems mainly with step 3, but also with steps 4 (how can I justify my choice of K?), 5, 7 (how can I see the curve of total error?) and 8 (How can I find the confusion matrix?).

I apologize for my poor English, and I hope for your help.

Teodoro

M_Maldonado · Posted 07-03-2014 11:37 AM

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

View solution in original post

M_Maldonado · Posted 07-03-2014 11:37 AM

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

teodoro_stefanello · Posted 07-03-2014 01:21 PM

Thank you very much Miguel! Your suggestions are very helpful, now I can do at least half exercise.

Do you know how eliminate non-numerical features from my dataset analysis? I thought to use a Filter node, but I don't know if it's useful, since I must eliminate all non-numerical features, not some of their values. Without this I cannot do the rest of the exercise but explain only the procedure to follow.

M_Maldonado · Posted 07-03-2014 01:37 PM

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks.

In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

Good luck!

EricTsai · Posted 06-08-2015 07:52 PM

Hi, Miguel:

Do you know the difference between PROC DISCRIM and MBR node in terms of KNN? I used both but got a totally different result.

I don't know which one to use for scoring my new data now.

Reeza · Posted 06-10-2015 12:23 PM

Discrimination analysis assumes you know the outcome to create your model, K nearest neighbour methods assume you don't.

DA is a supervised learning algorithm while KNN is an unsupervised learning algorithm.

Your new data gets scored with the original method you used.

Also, please start your own discussion in the future.

EricTsai · Posted 06-10-2015 01:48 PM

Hi, Reeza:

I used PROC DISCRIM with METHOD=NPAR, which in terms gives me KNN (k-nearest neighbors) algorithm.

KNN is NOT an unsupervised learning algorithm; it is a supervised learning algorithm.

Reeza · Posted 06-10-2015 03:03 PM

I mixed up K-Means and KNN.

JasonXin · Posted 07-08-2014 10:39 AM

You can use Metadata node to drop variables in the middle of EM workflow. You are right that Filter node is to 'cut values' of a variable. Metadata node, as the name suggests, is about managing data sets.

Jason Xin

IvanGV · Posted 06-10-2015 12:02 PM

Hi Miguel, thank you for your answer.

Is there a way to program a grid search for K instead of having to manually set different model nodes?

Thanks for your reply, best regards.

Ivan.

Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

Re: Classification: K nearest neighbors (MBR)

SAS Innovate 2025: Call for Content