turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Classification: K nearest neighbors (MBR)

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-03-2014 03:57 AM

Hi all, I'm a math student who must pass a data mining exam in a week. I can not fix any point of this exercise using SAS Miner, someone can help me?

Create in SAS Enterprise Miner classifier MBR on Intrusion dataset Detection (a .csv file I downloaded from my university website), after:

1. Performed an exploratory analysis of the data which presented the main results.

2. Partitioned the dataset into training set, test and validation set (50% -30% -20%).

3. Eliminated the non-numerical features and have explained the reasons for such exclusion.

4. Chose a value of K (indicate why you chose it).

5. Calculate the total error of prediction on the test set.

6. Repeat steps 3 and 4 by changing the value of K chosen for a total of 3 iterations.

7. Viewing the curve of total error varying K. For which value of K you obtain a minor error?

8. For this value of k, show the confusion matrix on the validation set.

I have problems mainly with step 3, but also with steps 4 (how can I justify my choice of K?), 5, 7 (how can I see the curve of total error?) and 8 (How can I find the confusion matrix?).

I apologize for my poor English, and I hope for your help.

Teodoro

Accepted Solutions

Solution

07-07-2017
03:14 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to teodoro_stefanello

07-03-2014 11:37 AM - last edited on 07-07-2017 03:15 PM by DougWielenga

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

All Replies

Solution

07-07-2017
03:14 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to teodoro_stefanello

07-03-2014 11:37 AM - last edited on 07-07-2017 03:15 PM by DougWielenga

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to M_Maldonado

07-03-2014 01:21 PM

Thank you very much Miguel! Your suggestions are very helpful, now I can do at least half exercise.

Do you know how eliminate non-numerical features from my dataset analysis? I thought to use a Filter node, but I don't know if it's useful, since I must eliminate all non-numerical features, not some of their values. Without this I cannot do the rest of the exercise but explain only the procedure to follow.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to teodoro_stefanello

07-03-2014 01:37 PM

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks.

In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

Good luck!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to M_Maldonado

06-08-2015 07:52 PM

Hi, Miguel:

Do you know the difference between PROC DISCRIM and MBR node in terms of KNN? I used both but got a totally different result.

I don't know which one to use for scoring my new data now.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to EricTsai

06-10-2015 12:23 PM

Discrimination analysis assumes you know the outcome to create your model, K nearest neighbour methods assume you don't.

DA is a supervised learning algorithm while KNN is an unsupervised learning algorithm.

Your new data gets scored with the original method you used.

Also, please start your own discussion in the future.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

06-10-2015 01:48 PM

Hi, Reeza:

I used PROC DISCRIM with METHOD=NPAR, which in terms gives me KNN (k-nearest neighbors) algorithm.

KNN is NOT an unsupervised learning algorithm; it is a supervised learning algorithm.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to EricTsai

06-10-2015 03:03 PM

I mixed up K-Means and KNN.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to teodoro_stefanello

07-08-2014 10:39 AM

You can use Metadata node to drop variables in the middle of EM workflow. You are right that Filter node is to 'cut values' of a variable. Metadata node, as the name suggests, is about managing data sets.

Jason Xin

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to M_Maldonado

06-10-2015 12:02 PM

Hi Miguel, thank you for your answer.

Is there a way to program a grid search for K instead of having to manually set different model nodes?

Thanks for your reply, best regards.

Ivan.