BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.

Dears,

I need to know how MBR node work in E-Miner ? what is the difference between (Scan, RD-tree) algorithms in MBR?

Please send me any support materials.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
mohamed_zaki
Barite | Level 11

As you know MBR is also called ""Nearest Neighbor", it assign a score to new data observation using the average of the K nearest neighbor.

You may return to EM Help "F1", to get more information about the MBR node.

This method works in multiple dimensions, but requires that the inputs have equal variances and no correlation.(return to a previous thread for more details).

Regarding your second Question about the difference between Scan and RD-tree:

-Scan: choose the nearest neighbor by the smallest linear squared distance between the observation and all possible k neighbors.

-RD-tree: use a tree structure to organize data observation in dimension space.

>> A brief description is mentioned in the EM help, and also it mention that the RD-tree works better than scan to dimensional of 100 .

>>Also the only referance i find mention RD-tree in more details is Data Mining Using SAS Enterprise Miner By Randall Matignon , which refer to RD-tree as improvement over the famous KD-tree algorithm and mention the differences between the two. But i find SAS refer to the RD-tree as the Patented Reduced Dimensionality Tree in the SAS Enterprise Miner Fact Sheet and i did find a lot about it.

If you are looking for case studies where MBR node were used i found such two papers:

Improving Credit Risk Scorecards with Memory-Based Reasoning to Reject. Inference with SAS. ®. Enter...

Prediction of Rise in Sea Level by Memory-Based Reasoning Model using SAS® Enterprise Miner™ 12.1, where scan and RD-tree used and results compared to each other.

View solution in original post

6 REPLIES 6
mohamed_zaki
Barite | Level 11

As you know MBR is also called ""Nearest Neighbor", it assign a score to new data observation using the average of the K nearest neighbor.

You may return to EM Help "F1", to get more information about the MBR node.

This method works in multiple dimensions, but requires that the inputs have equal variances and no correlation.(return to a previous thread for more details).

Regarding your second Question about the difference between Scan and RD-tree:

-Scan: choose the nearest neighbor by the smallest linear squared distance between the observation and all possible k neighbors.

-RD-tree: use a tree structure to organize data observation in dimension space.

>> A brief description is mentioned in the EM help, and also it mention that the RD-tree works better than scan to dimensional of 100 .

>>Also the only referance i find mention RD-tree in more details is Data Mining Using SAS Enterprise Miner By Randall Matignon , which refer to RD-tree as improvement over the famous KD-tree algorithm and mention the differences between the two. But i find SAS refer to the RD-tree as the Patented Reduced Dimensionality Tree in the SAS Enterprise Miner Fact Sheet and i did find a lot about it.

If you are looking for case studies where MBR node were used i found such two papers:

Improving Credit Risk Scorecards with Memory-Based Reasoning to Reject. Inference with SAS. ®. Enter...

Prediction of Rise in Sea Level by Memory-Based Reasoning Model using SAS® Enterprise Miner™ 12.1, where scan and RD-tree used and results compared to each other.

husseinmazaar
Quartz | Level 8

Thanks mohamed zaki

Any comment  RalphAbbey

RalphAbbey
SAS Employee

Mohamed made a bunch of good points. The EM help does describe generally how the RD-Tree works, while the Scan method is just and exhaustive search.

I do want to add that for the MBR Node, all of the inputs need to be interval. In addition, the k-nearest neighbors method (in general, not just the MBR Node) assumes that the variables are orthogonal and standardized, so we recommend using principal components prior to MBR unless you know that your variables are orthogonal and standardized. If you don't have orthogonal or standardized variables, then some variables will skew the k-nearest neighbors more than others (the node will still run, but your results may not be as useful as you would like).

I hope this helps.

EricTsai
Calcite | Level 5

Hi, Ralph:

Do you know the difference between MBR node and PROC DISCRIM procedure?  I used both to run my KNN but I got totally different result, but I don't know which one to use for scoring my new data set.

Thanks for your help.

PatrickHall
Obsidian | Level 7

Hi,

This is a very good question. To echo the comments of many above, the basic difference between PROC DISCRIM in SAS/STAT and the MBR node in SAS Enterprise Miner is that the MBR node can use the RDTREE method to search for nearest neighbor observations with a known class in training data to classify new observations in test data. (The RDTREE method is a proprietary version of the popular KDTREE algorithm.) The tree-based neighbor search can be faster, but you may get different results between PROC DISCRIM and the MBR node.

If you want to have the closest correspondence between PROC DISCRIM and the MBR node, use to the SCAN option in the MBR node's METHOD property to request that the MBR node use a conventional distance calculation to classify new observations. However, the traditional distance calculation may be unsuited for big data.

Here is an example of how different your results could be in a low-dimensional simulated data set. (Your results could be even more different with real data.)


discrim_v_mbr.png

Notice that the classifications made by PROC DISCRIM and PROC PMBR (i.e. the MBR node) using the SCAN method are almost identical. Using the RDTREE method with a small EPSILON, you can also closely replicate the results of PROC DISCRIM on this sample data. But, if you change the value for EPSILON then your results can be noticeably different from PROC DISCRIM using its default settings.

Changing the BUCKETS property does not appear to change the classification results, but changing the EPSILON property does change the classification results. So what is going on here?

The BUCKETS property:

  • The value of the BUCKETS option should not affect classification results.
  • It is a parameter that is used to balance the speed vs. memory trade-off for the tree structure. It is basically the number of observations allowed to be in each node of the RDTREE.
  • Nodes of the tree are searched in O(log N) time; within each node the search for neighbors is slower. However building more nodes requires more memory.
  • A lower value for BUCKETS will result in a faster calculation, but more memory being used.
  • A higher value for BUCKETS will result in a slower calculation, but less memory being used.

The EPSILON property:

  • EPSILON controls the approximate nearest neighbor search; changing EPSILON can affect classification results.
  • A larger value for EPSILON will allow more points that may not be actual nearest neighbors to be used to classify a new observation.
  • A smaller value for EPSILON will use more points that are guaranteed to be nearest neighbors to classify a new observation.
  • Using a larger value for EPSILON should decrease execution time.

If you would like to try this experiment for yourself, the code is available here: PROC DISCRIM vs. the MBR node in Enterprise Miner

Message was edited by: Patrick Hall; updated code and added details for BUCKETS and EPSILON.

EricTsai
Calcite | Level 5

Thank you very much Patrick for sharing this information and specifically your code.

It really helps in explaining the difference between the output from PROC DISCRIM and MBR node.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 5038 views
  • 2 likes
  • 5 in conversation