Quartz | Level 8

## RD-Tree algorithm in MBR node

Dears,

I need to know how MBR node work in E-Miner ? what is the difference between (Scan, RD-tree) algorithms in MBR?

Please send me any support materials.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Barite | Level 11

## Re: RD-Tree algorithm in MBR node

As you know MBR is also called ""Nearest Neighbor", it assign a score to new data observation using the average of the K nearest neighbor.

This method works in multiple dimensions, but requires that the inputs have equal variances and no correlation.(return to a previous thread for more details).

-Scan: choose the nearest neighbor by the smallest linear squared distance between the observation and all possible k neighbors.

-RD-tree: use a tree structure to organize data observation in dimension space.

>> A brief description is mentioned in the EM help, and also it mention that the RD-tree works better than scan to dimensional of 100 .

>>Also the only referance i find mention RD-tree in more details is Data Mining Using SAS Enterprise Miner By Randall Matignon , which refer to RD-tree as improvement over the famous KD-tree algorithm and mention the differences between the two. But i find SAS refer to the RD-tree as the Patented Reduced Dimensionality Tree in the SAS Enterprise Miner Fact Sheet and i did find a lot about it.

If you are looking for case studies where MBR node were used i found such two papers:

Improving Credit Risk Scorecards with Memory-Based Reasoning to Reject. Inference with SAS. ®. Enter...

Prediction of Rise in Sea Level by Memory-Based Reasoning Model using SAS® Enterprise Miner™ 12.1, where scan and RD-tree used and results compared to each other.

6 REPLIES 6
Barite | Level 11

## Re: RD-Tree algorithm in MBR node

As you know MBR is also called ""Nearest Neighbor", it assign a score to new data observation using the average of the K nearest neighbor.

This method works in multiple dimensions, but requires that the inputs have equal variances and no correlation.(return to a previous thread for more details).

-Scan: choose the nearest neighbor by the smallest linear squared distance between the observation and all possible k neighbors.

-RD-tree: use a tree structure to organize data observation in dimension space.

>> A brief description is mentioned in the EM help, and also it mention that the RD-tree works better than scan to dimensional of 100 .

>>Also the only referance i find mention RD-tree in more details is Data Mining Using SAS Enterprise Miner By Randall Matignon , which refer to RD-tree as improvement over the famous KD-tree algorithm and mention the differences between the two. But i find SAS refer to the RD-tree as the Patented Reduced Dimensionality Tree in the SAS Enterprise Miner Fact Sheet and i did find a lot about it.

If you are looking for case studies where MBR node were used i found such two papers:

Improving Credit Risk Scorecards with Memory-Based Reasoning to Reject. Inference with SAS. ®. Enter...

Prediction of Rise in Sea Level by Memory-Based Reasoning Model using SAS® Enterprise Miner™ 12.1, where scan and RD-tree used and results compared to each other.

Quartz | Level 8

## Re: RD-Tree algorithm in MBR node

Thanks mohamed zaki

Any comment  RalphAbbey

SAS Employee

## Re: RD-Tree algorithm in MBR node

Mohamed made a bunch of good points. The EM help does describe generally how the RD-Tree works, while the Scan method is just and exhaustive search.

I do want to add that for the MBR Node, all of the inputs need to be interval. In addition, the k-nearest neighbors method (in general, not just the MBR Node) assumes that the variables are orthogonal and standardized, so we recommend using principal components prior to MBR unless you know that your variables are orthogonal and standardized. If you don't have orthogonal or standardized variables, then some variables will skew the k-nearest neighbors more than others (the node will still run, but your results may not be as useful as you would like).

I hope this helps.

Calcite | Level 5

## Re: RD-Tree algorithm in MBR node

Hi, Ralph:

Do you know the difference between MBR node and PROC DISCRIM procedure?  I used both to run my KNN but I got totally different result, but I don't know which one to use for scoring my new data set.

Obsidian | Level 7

## Re: RD-Tree algorithm in MBR node

Hi,

This is a very good question. To echo the comments of many above, the basic difference between PROC DISCRIM in SAS/STAT and the MBR node in SAS Enterprise Miner is that the MBR node can use the RDTREE method to search for nearest neighbor observations with a known class in training data to classify new observations in test data. (The RDTREE method is a proprietary version of the popular KDTREE algorithm.) The tree-based neighbor search can be faster, but you may get different results between PROC DISCRIM and the MBR node.

If you want to have the closest correspondence between PROC DISCRIM and the MBR node, use to the SCAN option in the MBR node's METHOD property to request that the MBR node use a conventional distance calculation to classify new observations. However, the traditional distance calculation may be unsuited for big data.

Here is an example of how different your results could be in a low-dimensional simulated data set. (Your results could be even more different with real data.)

Notice that the classifications made by PROC DISCRIM and PROC PMBR (i.e. the MBR node) using the SCAN method are almost identical. Using the RDTREE method with a small EPSILON, you can also closely replicate the results of PROC DISCRIM on this sample data. But, if you change the value for EPSILON then your results can be noticeably different from PROC DISCRIM using its default settings.

Changing the BUCKETS property does not appear to change the classification results, but changing the EPSILON property does change the classification results. So what is going on here?

The BUCKETS property:

• The value of the BUCKETS option should not affect classification results.
• It is a parameter that is used to balance the speed vs. memory trade-off for the tree structure. It is basically the number of observations allowed to be in each node of the RDTREE.
• Nodes of the tree are searched in O(log N) time; within each node the search for neighbors is slower. However building more nodes requires more memory.
• A lower value for BUCKETS will result in a faster calculation, but more memory being used.
• A higher value for BUCKETS will result in a slower calculation, but less memory being used.

The EPSILON property:

• EPSILON controls the approximate nearest neighbor search; changing EPSILON can affect classification results.
• A larger value for EPSILON will allow more points that may not be actual nearest neighbors to be used to classify a new observation.
• A smaller value for EPSILON will use more points that are guaranteed to be nearest neighbors to classify a new observation.
• Using a larger value for EPSILON should decrease execution time.

If you would like to try this experiment for yourself, the code is available here: PROC DISCRIM vs. the MBR node in Enterprise Miner

Message was edited by: Patrick Hall; updated code and added details for BUCKETS and EPSILON.

Calcite | Level 5

## Re: RD-Tree algorithm in MBR node

Thank you very much Patrick for sharing this information and specifically your code.

It really helps in explaining the difference between the output from PROC DISCRIM and MBR node.

Discussion stats
• 6 replies
• 5272 views
• 2 likes
• 5 in conversation