BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Hi all, I'm a math student who must pass a data mining exam in a week. I can not fix any point of this exercise using SAS Miner, someone can help me?


Create in SAS Enterprise Miner classifier MBR on Intrusion dataset Detection (a .csv file I downloaded from my university website), after:
1. Performed an exploratory analysis of the data which presented the main results.
2. Partitioned the dataset into training set, test and validation set (50% -30% -20%).
3. Eliminated the non-numerical features and have explained the reasons for such exclusion.
4. Chose a value of K (indicate why you chose it).
5. Calculate the total error of prediction on the test set.
6. Repeat steps 3 and 4 by changing the value of K chosen for a total of 3 iterations.
7. Viewing the curve of total error varying K. For which value of K you obtain a minor error?
8. For this value of k, show the confusion matrix on the validation set.


I have problems mainly with step 3, but also with steps 4 (how can I justify my choice of K?), 5, 7 (how can I see the curve of total error?) and 8 (How can I find the confusion matrix?).


I apologize for my poor English, and I hope for your help.

Teodoro

1 ACCEPTED SOLUTION

Accepted Solutions
M_Maldonado
Barite | Level 11

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

 

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

 

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

 

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

 

I hope it helps,

Miguel

View solution in original post

9 REPLIES 9
M_Maldonado
Barite | Level 11

Hi Teodoro,

To find a suitable number of nearest neighbors, I would run several MBR nodes with different number of neighbors, and then use a Model Comparison node to compare their fit statistics, and their score distribution. This is just my preference, not sure if there is a more theoretical way to do it.

 

Two options to see the classification matrix:

1. For any node in the Model tab, you can see the classification matrix in your results. Go ti View->Assessment->Classification Chart. If you want to see the numbers, click on the fourth icon (table button).

2. Another option, you can connect your MBR to a model comparison node. You will see the classification matrix in the results of your model comparison node.

 

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

 

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

 

I hope it helps,

Miguel

teodoro_stefanello
Calcite | Level 5

Thank you very much Miguel! Your suggestions are very helpful, now I can do at least half exercise.

Do you know how eliminate non-numerical features from my dataset analysis? I thought to use a Filter node, but I don't know if it's useful, since I must eliminate all non-numerical features, not some of their values. Without this I cannot do the rest of the exercise but explain only the procedure to follow.

M_Maldonado
Barite | Level 11

I highly recommend you to take the course Advanced Analytics Using SAS Enterprise Miner to learn solid foundations on most Enterprise Miner Analytics tasks.

In the meantime you can read the Getting Started with SAS Enterprise Miner section in the reference help (Help->Contents menu, or press key F1), and other sections as you need them.

There are several options to eliminate your non-numerical inputs. One of them is to click on the Variables ellipsis (...) in the properties of your MBR node. A menu will open. Specify "Rejected" as the role of all variables that are not your binary target, or your interval inputs.

You use the filter node to filter outliers and observations that can throw off your model. Up to you if you want to use it in your MBR flow. More info about Filter node on the reference help.

Good luck!

EricTsai
Calcite | Level 5

Hi, Miguel:

Do you know the difference between PROC DISCRIM and MBR node in terms of KNN?  I used both but got a totally different result.

I don't know which one to use for scoring my new data now.

Reeza
Super User

Discrimination analysis assumes you know the outcome to create your model, K nearest neighbour methods assume you don't.

DA is a supervised learning algorithm while KNN is an unsupervised learning algorithm.

Your new data gets scored with the original method you used.

Also, please start your own discussion in the future.

EricTsai
Calcite | Level 5

Hi, Reeza:

I used PROC DISCRIM with METHOD=NPAR, which in terms gives me KNN (k-nearest neighbors) algorithm.

KNN is NOT an unsupervised learning algorithm; it is a supervised learning algorithm.

Reeza
Super User

I mixed up K-Means and KNN. 

JasonXin
SAS Employee

You can use Metadata node to drop variables in the middle of EM workflow. You are right that Filter node is to  'cut values' of a variable. Metadata node, as the name suggests, is about managing data sets.

Jason Xin

IvanGV
Calcite | Level 5

Hi Miguel, thank you for your answer.

Is there a way to program a grid search for K instead of having to manually set different model nodes?

Thanks for your reply, best regards.

Ivan.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 7963 views
  • 3 likes
  • 6 in conversation