Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- SAS Data Science
- /
- Classification: K nearest neighbors (MBR)

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-03-2014 03:57 AM
(8257 views)

Hi all, I'm a math student who must pass a data mining exam in a week. I can not fix any point of this exercise using SAS Miner, someone can help me?

Create in SAS Enterprise Miner classifier MBR on Intrusion dataset Detection (a .csv file I downloaded from my university website), after:

1. Performed an exploratory analysis of the data which presented the main results.

2. Partitioned the dataset into training set, test and validation set (50% -30% -20%).

3. Eliminated the non-numerical features and have explained the reasons for such exclusion.

4. Chose a value of K (indicate why you chose it).

5. Calculate the total error of prediction on the test set.

6. Repeat steps 3 and 4 by changing the value of K chosen for a total of 3 iterations.

7. Viewing the curve of total error varying K. For which value of K you obtain a minor error?

8. For this value of k, show the confusion matrix on the validation set.

I have problems mainly with step 3, but also with steps 4 (how can I justify my choice of K?), 5, 7 (how can I see the curve of total error?) and 8 (How can I find the confusion matrix?).

I apologize for my poor English, and I hope for your help.

Teodoro

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

I hope it helps,

Miguel

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you very much Miguel! Your suggestions are very helpful, now I can do at least half exercise.

Do you know how eliminate non-numerical features from my dataset analysis? I thought to use a Filter node, but I don't know if it's useful, since I must eliminate all non-numerical features, not some of their values. Without this I cannot do the rest of the exercise but explain only the procedure to follow.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks.

In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

Good luck!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, Miguel:

Do you know the difference between PROC DISCRIM and MBR node in terms of KNN? I used both but got a totally different result.

I don't know which one to use for scoring my new data now.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Discrimination analysis assumes you know the outcome to create your model, K nearest neighbour methods assume you don't.

DA is a supervised learning algorithm while KNN is an unsupervised learning algorithm.

Your new data gets scored with the original method you used.

Also, please start your own discussion in the future.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, Reeza:

I used PROC DISCRIM with METHOD=NPAR, which in terms gives me KNN (k-nearest neighbors) algorithm.

KNN is NOT an unsupervised learning algorithm; it is a supervised learning algorithm.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I mixed up K-Means and KNN.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can use Metadata node to drop variables in the middle of EM workflow. You are right that Filter node is to 'cut values' of a variable. Metadata node, as the name suggests, is about managing data sets.

Jason Xin

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Miguel, thank you for your answer.

Is there a way to program a grid search for K instead of having to manually set different model nodes?

Thanks for your reply, best regards.

Ivan.

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.