09-28-2016 01:07 PM
Yesterday I tried running PROC FOREST with auto tuned hyperparameters with the example bank dataset. It ran for over three hours without completing and I eventually terminated the run. To be fair, in the auto tuning subsection of the Options tab for random forests, it warns the user that auto tuning may take a long time to run. After trying it yesterday, I had a few additional suggestions for the production release:
1. Progress - it would be very helpful to have some indication of progress - and especially estimated time to completion - for the user to get a feel for whether the job will complete in the time the user can allow or whether the estimated run time would be far too long. For instance, if the job has been running 15 minutes and the estimated completion time is 3 hours and I have 4 hours of time, I would let the job run. On the other hand, if it would take an estimated 50 hours and I only have 4 hours of time, I would cancel the job and pursue another analysis option. It seems to me that once the analysis has been running for a little while, it would be possible for the software to be able to estimate with sufficient accuracy what the estimated progress and run time would be (understanding that it would be just an approximate estimate rather than a super precise estimate of when the run would finish).
2. Multiple sessions - I also noticed while running the random forest anaysis yesterday that I was unable to take any other actions within the Viya environment and wondered if it would be possible to somehow continuing working on other things while the random forest analysis continues to run in the background?
With best wishes,
10-03-2016 04:01 PM
Hi Tor - Thanks for testing this out and providing the feedback.
First some comments on the autotuning process in general. This initial implementation of autotuning only supported sequential evaluation (training) of each candidate model, which is why the runtimes are so long for large data sets. It is trying to train many models, some of them expensive configurations (e.e., many trees in a forest, large neural nets) - and it is training them sequentially right now. Rest assured we have some significant enhancements coming that will support training multiple candidate models in parallel across the compute resources under control of the Viya distributed execution engine. So if you are running your GA with population size 10 it will train all 10 in parallel instead of sequentially (thus a 10x speedup!...well, assuming you have compute nodes to distribute to).
That being said, let's get to your specific questions/suggestions:
1) Lack of progress indication is a big sore point with many of us right now...we do plan to enhance this for sure. There are 2 aspects to this really - (a) just seeing that it is making progress, and (b) getting a sense for how much longer it will run.
For (a), unfortunately SAS Studio only flushes the log output at the end of a proc run in this release...that is being worked on. There is intermediate progress information as far as printing out all of the candidate model info available from the underlying action - we are working on exposing that through the proc. Overall, I think you should see more progress/status info in the upcoming releases.
As for (b), this is challenging as I'm sure you realize. Take, for example, one batch of candidate models that we are training, all with different combinations of hyperparameter values. First, these might be running in parallel - or some might be held up in a queue waiting for resources. But even assuming unlimited available compute resources, each of these models will potentially take a significantly different amount of time to train (e.g. a 500 tree forest vs a 50 tree forest, or a neural net with hidden layers with 20 neurons vs 200 neurons). So each modeling algorithm would have to provide an estimate of its computational time based on the hyperparameter values at hand. Certainly possible, but not on our radar right now. The best we could do I think is track it based on # complete vs #total (which would be an estimate).
Either way - I completely get your point. This thing is running...we know it's going to take a while, and right now it doesn't give you any sense of whether it's making progress or how much longer it has to go. We'll continue to improve this user experience.
2) SAS Studio does currently go modal (ie locks you out) when you run a program/task. For now what I suggest when running anything that you expect to take a while is to submit it in batch. You need to save your program to a file and then in the navigation pane on the left select "Server Files and Folders" and find your program, right-click and select "Background Submit". You will notice a message pop up in the lower right saying it was submitted, and then you can check the status of it at any point by selecting "Background Job Status" under the "More application options" menu (button next to the "?" button in the upper right).
Hope this info helps you.
Keep plugging away and keep the feedback coming.