Have you heard that Enterprise Miner is multi-threaded or that it can be run in a distributed environment? Have you ever wondered what exactly was meant when you heard this? In this tip we explore what these concepts mean and how they can help you use Enterprise Miner more effectively.
This is a topic that could quickly end up very technical, but I will try to keep this at a pretty high level. The goal is to gain a general understanding, as opposed to rigorous technical detail and accuracy. We will be using the terms “process” and “thread” throughout the discussion and it’s good to start off with an analogy for how they relate to each other.
Let’s think of our computer as a kitchen, making cookies as our “process,” and the people working on making cookies as our “threads.” You may only need one person making cookies (one thread working), but if you want to make a lot of cookies, you may want multiple people helping out! We will revisit this analogy later.
Enterprise Miner and Multiple Processes:
When you create an Enterprise Miner flow, many times you may connect one node to another, in a sequence. When you run this flow, each node runes one after another until completion. You may already know, however, that you can branch your flow and then put a Control Point Node at the end to run multiple flows at once. Let’s look at the sequence of pictures below. I’ve included a portion of my Windows Task Manager for us to examine a well.
Before the nodes are run, there is only one SAS process running.
After the nodes have started running, additional SAS processes have been started for each node.
In the above pictures we see three modeling nodes running concurrently, as opposed to one after another. Before running the nodes, only one SAS process exists. After the nodes start to run, Enterprise Miner creates a SAS process for each node, and your operating system handles the scheduling and running of these processes! Most nodes in Enterprise Miner only use one thread per process. A general exception however, is that the nodes under the HPDM tab can use multiple threads in the underlying procedures – we’ll look at this next.
Think of this as making cookies, brownies, and pie all at the same time. Each is a separate process in your kitchen, but each can be done concurrently (as long as you have multiple cooks). You are technically running with multiple threads, because each process has its own thread.
However, often times when we talk about multi-threading we mean multiple threads per process.
Enterprise Miner and Multi-Threading:
As of Enterprise Miner 12.3, everyone has access to the high-performance data mining nodes, found under the HPDM tab. These nodes are special because these nodes can run in a distributed environment (we’ll talk about this soon), and also because the underlying SAS procedures are multi-threaded.
If you run the HP Nodes, Enterprise Miner creates a SAS process for each just as we discussed before. As we can see, there are three HP Nodes running at the same time. Enterprise Miner creates a SAS process for each HP Node just as it did with the non HP nodes.
Instead of only having one thread for each process, the underlying procedures behind the HP Nodes support working with multiple threads. By default Enterprise Miner automatically sets the number of threads used in the high performance procedures, however you can manually set this under the project macro variables.
The option HPDM_NTHREADS here has been set to 256. That means whenever a high performance procedure is called in the HP nodes, that procedure will run with 256 threads.
NOTE: We should stop and discuss this for a moment. More threads DOES NOT necessarily mean faster! Think about our kitchen analogy. We can use the cliché of “too many cooks in the kitchen” to refer to our threads too. If we have a large kitchen (a computer with many cores) we can accommodate more threads. At a certain point, however, adding more threads will slow down the overall process. Unless you are aware of your system, it may be better to let Enterprise Miner handle this by default. Generally the default is related to the setup of your system. The number capabilities of your computer’s processor, and the number of cores it contains will dictate what the best number of threads for your system is.
Let’s think about our cookies, brownies, and pie again. With the HP Nodes we may have 2 people making cookies, 2 people making brownies, and 2 people making pie, instead of just 1 per task.
Enterprise Miner and Distributed Computing:
The HPDM Nodes in Enterprise Miner are also capable of running in a distributed environment (provided that you have the correct license and hardware setup). What this means is that the data is spread out in multiple locations, and the analytics are taking place in this spread out environment.
When the data is spread out over multiple machines you can tackle larger problems than you could in a normal desktop environment. This is because some underlying procedures for SAS need to load portions of the data into memory. If you have a very large data set (“big data”), the procedure may run out of memory on a single machine before it can handle the data set. When you have a distributed grid set up, the data are distributed, and as such, when data is loaded you will not run out of memory on a single machine (because each machine handles a smaller overall amount of data).
The high performance procedures work on the distributed data using multiple threads per node in your grid. These procedures perform the necessary communication across your grid to obtain your results, and then the data is saved back to the grid – in a distributed manner.
NOTE: Distributed Computing has the potential to solve larger problems than using a single machine, and in addition, it can often run faster than a single machine as well. However, it is important to note that for smaller problems (that aren’t “big data”) the grid communication can actually increase the time it takes to run a problem. In general, you may find that if the data would easily fit into your RAM on a single machine, then it will not run faster in a distributed environment.
Let’s now consider our analogy.
If you want to host a very large event with food, you may contract out the cooking to other companies. These companies may cook various portions of the food at their own facilities independently, and then bring the food to your event. In this case Enterprise Miner is cooking the various portions of the model development on different computers and finally bringing you back a model and results to your single client!
Enterprise Miner can create multiple SAS processes to concurrently work on your data mining flow. Take advantage of this by splitting your flows, and then putting a Control Point Node at the end. By running the Control Point Node, Enterprise Miner will automatically run multiple nodes at once for you.
In addition, Enterprise Miner HP Nodes allow you to use multi-threaded SAS procedures. Enterprise Miner can automatically set the number of threads, or you can choose to by changing the HPDM_NTHREADS macro variable in the project macro variables.
Finally, if you have the correct license and hardware setup, Enterprise Miner can run on distributed data. This allows for the solving of very large problems quickly.
Hopefully this helps you understand multi-threading and distributed computing in Enterprise Miner.