I would like to implement usage note 22607, which suggests building a logistic regression model using a small sample, and passing the parameter estimates to a regression node that uses the full training/validation sets. The idea is that the models build quickly with a small sample, and the full model will build much faster when you start with estimates that are close to what you need.
I have been unable to find any documentation that says how to implement this. Simply hooking the nodes up doesn't work - where the full regression node has two inputs, the full training/validation data set, and the output from the first regression node. When you click on the lines connecting the nodes you see the data that's being passed, but you can't change it. I have spent a lot of time searching in vain for an example of how to do this with an SAS Code Node.
I posted this question in a different format last Saturday, and I have received no responses. If I don't get a response by tomorrow I will be unable to include a proper logistic regression model in my assignment. My target is categorical, with 6 levels, there are 70,000 totals rows, and 11 inputs, mostly categorical. The modeling does not complete in several hours when two-factor interactions are set to yes (there are such interactions).
Thanks for any help.
Short answer -- there is not a way to do this in SAS Enterprise Miner unless you want to write the code to call the LOGISTIC procedure itself in a SAS Code node.
It is true that starting with parameters closer to optimal values when such values exist might lead to a solution in fewer iterations, but the problem you are attempting to solve is not (in my experience) an issue in SAS Enterprise Miner when it used as intended. As you have already found, you are able to build far superior models (in this case) using other more flexible modeling strategies. I have also seen situations where regression models have done as well or better than far more flexible models.
In either case, your question was how to do this in SAS Enterprise Miner, and my answer was to try and explain why...
... the problem addressed by the Usage Note should not be an issue in SAS Enterprise Miner
... if the problem occurs, there are better ways to address it in SAS Enterprise Miner
... for this reason and others (see discussion about scoring, assessment, etc...) there is no functionality to pass the parameters from one Regression node to a subsequent Regression node
... the technique proposed for the LOGISTIC procedure is problematic for the data mining data sets that SAS Enterprise Miner was designed for since there is a much higher likelihood of quasi-separation when the number of levels of a categorical variable increase, particularly when such variables are considered in interactions.
Hope this helps!
Doug
Why can't you ask your instructor for assistance?
The instructor is not an SAS expert. The course allows students to use the products they choose, whether it be R, SAS EM, Python, IBM Watson Analytics, etc.
Fair enough.
Here's how that would work for a code node, not sure how it works for the EM tool to connect the nodes and such.
I don't know if you're stuck with SAS EM, SAS UE is also available which supports SAS/STAT and logistic regression.
title 'Example 2. Modeling with Categorical Predictors';
data Neuralgia;
input Treatment $ Sex $ Age Duration Pain $ @@;
datalines;
P F 68 1 No B M 74 16 No P F 67 30 No
P M 66 26 Yes B F 67 28 No B F 77 16 No
A F 71 12 No B F 72 50 No B F 76 9 Yes
A M 71 17 Yes A F 63 27 No A F 69 18 Yes
B F 66 12 No A M 62 42 No P F 64 1 Yes
A F 64 17 No P M 74 4 No A F 72 25 No
P M 70 1 Yes B M 66 19 No B M 59 29 No
A F 64 30 No A M 70 28 No A M 69 1 No
B F 78 1 No P M 83 1 Yes B F 69 42 No
B M 75 30 Yes P M 77 29 Yes P F 79 20 Yes
A M 70 12 No A F 69 12 No B F 65 14 No
B M 70 1 No B M 67 23 No A M 76 25 Yes
P M 78 12 Yes B M 77 1 Yes B F 69 24 No
P M 66 4 Yes P F 65 29 No P M 60 26 Yes
A M 78 15 Yes B M 75 21 Yes A F 67 11 No
P F 72 27 No P F 70 13 Yes A M 75 6 Yes
B F 65 7 No P F 68 27 Yes P M 68 11 Yes
P M 67 17 Yes B M 70 22 No A M 65 15 No
P F 67 1 Yes A M 67 10 No P F 72 11 Yes
A F 74 1 No B M 80 21 Yes A F 69 3 No
;
proc surveyselect data=neuralgia out=sample1 samprate=0.25;
run;
proc logistic data=sample1 outest=demo1;
class Treatment ;
model Pain= Treatment Age / expb ;
run;
proc logistic data=neuralgia inest=demo1;
class Treatment ;
model Pain= Treatment Age / expb ;
run;
Thanks. I was able to find an SAS/STAT example, which I included in my other other post. I don't have access to SAS/STAT.
I have access to SAS Enterprise Miner 14.2 (OnDemand), SAS Enterprise Guide (which I haven't installed), and SAS Studio.
I attempted to export the Logistic Regression node as a model, so I could edit the code to add 'inest', and then import the code as a new model, however the code is in an entirely different format that the SAS/STAT examples, and there isn't even a call to reg, dmreg, etc.
My other post describes my attempts to drop the Train and Validate tables from the first regression node, and to assign the parameter estimates, which exist, in such a way as the next node uses them.
I just need an example where someone uses the parameter estimates in the next node, so I can adapt that to my variable/table names. All of my Google searches for the elements such an example would contain have come up empty.
SAS Studio on Academics on Demand includes SAS/STAT, it's built in, not another additional installation.
In SAS Studio, SAS Programmer is greyed out. I can select "New SAS Program."
I our previous course we learned to use SAS EM. If SAS/STAT is available, I would have much preferred to learn that. (My undergrad degree is in Computer Science.)
If SAS/STAT is available, learning how to completely rebuild my project using that is outside of the scope of this project.
Not ideal at all, but consider building the model in SAS/STAT similar to the example I provided and then you can use the variables and such to build your model in EM. I’m assuming you’re using some sort of selection method as well, so if the variables are not significant in Base you can remove them in EM which reduces the number of parameters and then it will run. That’s an approach that we can definitely help you complete today. I’m assuming you’ve tried HPLOGISTIC as well.
Within SAS Studio, look at the Tasks and see if there’s a Logistic Regression option.
I agree that moving everything to one tool would be too much work at this point in time.
Given the parameters you’ve stated, you’re running a multi nominal or ordinal logistic regression model but you have 70000 obs. Assuming that your categorical variables aren’t like 50 levels each it should be fine so I’m surprised that its not completing.
@Mike90 wrote:
In SAS Studio, SAS Programmer is greyed out. I can select "New SAS Program."
I our previous course we learned to use SAS EM. If SAS/STAT is available, I would have much preferred to learn that. (My undergrad degree is in Computer Science.)
If SAS/STAT is available, learning how to completely rebuild my project using that is outside of the scope of this project.
Thanks for the offer. At this point, unless I find a way to just hook it up (with a prior regression node) to speed up training, logistic regression is simply being beaten by every other modeling technique: Decision Trees, HP Trees, HP Forest, and Neural Networks - techniques that all quickly build models.
Hey Mike,
Not sure if I am reading your question right. But I think that you are asking about training a model with a few observations and using that model to score a larger data set.
There are a couple ways to do this and it all depends on whether you have one big data set that you want to split into Train/Validate/Testing or if you are going to import 2 data sets (one to be splitted into Training/Validation and the other one get scored by a score node).
From your comment about "Simply hooking the nodes up doesn't work - where the full regression node has two input" it sounds like you are trying to do this with 2 data sets. To do this you will use your smaller data set to train/validate your model, and after that you will use a Score node to score your larger data set. You will see quick examples on how to do this on the Score node section.
Give it a try, and if you get into any trouble, add a screenshot of your diagram to make it easier to help you out.
Good luck!
>> But I think that you are asking about training a model with a few observations and using that model to score a larger data set.
No.
I am trying to implement the following usage note:
"Usage Note 22607: Preventing excessive time or memory use by PROC LOGISTIC"
http://support.sas.com/kb/22/607.html
This usage note says to train a logistic regression model using a small stratified sample of the data, and then to pass the parameter estimates to a regression model that uses the full data set, so it continues training from that point. The idea is that by starting closer to the eventual estimate results, the logistic regression model using the full data set will be able to complete in a reasonable amount of time.
The parameter estimate table is listed as an output from the first regression table. I just need to get the 2nd logistic regression node to use it, and to use the training/validation data from the main data source node. (In SAS/STAT this appears to be simple, you use outest = x on the first regression model, and then inest = x on the second one.)
Mike90,
The technique you are referring to seeks to address a difficulty with LOGISTIC that used to occur when you had excessively large data sets and limited computing power. The goal of the technique is to simplify the input data so that you are running the model against only selected key variables. This simplification might make it possible to run LOGISTIC in situations where running it on the full data set is too time consuming. If you were using SAS/STAT though, you would use HPLOGISTIC or HPGENSELECT rather than using that technique since these new procedures are multithreaded which can allow for faster performance, particularly when run in a distributed mode on an appliance that consists of a cluster of nodes. Faster performance might also be obtained on a single machine with several processors.
SAS Enterprise Miner is designed with large data sets in mind and therefore it provides much more elegant ways to do variable selection (Variable Selection node, Tree node) while limiting some of the classical statistical output which streamlines the modeling process so that you don't encounter the issue described above. SAS Enterprise Miner also includes High Performance modeling nodes including HP Variable Selection, HP Tree, HP GLM and HP Regression which can take advantage of distributed computing environments when available. If you are encountering any time issues with SAS Enterprise Miner, you can likely overcome it without using this somewhat dated approach to dealing with a problem that has largely been overcome with the advanced modeling procedures underlying SAS Enterprise Miner.
Are you encountering performance issues with your data using SAS Enterprise Miner? Have you tried using the Variable Selection node or one of the High Performance nodes? These are more modern approaches to handling this issue.
Cordially,
Doug
You are saying something very different from what the usage note says. You are saying the model is trained on a small set of data, and that model is the one used. The usage note says the technique shortens the training time using the full data set.
Does Variable Selection, set for two-way interactions, accomplish the same thing that would be accomplished with logistic regression being set to two-way?
I have selected the relevant variables. A couple of the class inputs have a large number of levels (200 and 60), which I have greatly reduced just for the logistic regression models, but they still don't build. HP Trees, HP Forest, and Neural Networks all build very quickly, without reducing those levels. I understand that the logistic regression technique gets exponentially larger with class variables with a high number of levels, as opposed to the other techniques. That is why I was trying to "start closer" to the final parameters by implementing that usage note.
Anyway, I have moved on. All of the other modeling techniques are simply beating logistic regression, the way I can run it. (I've already used the Variable Selection node in front of it, but I still have to greatly reduce the number of levels of those two inputs for it to complete, and setting two-way interactions in the Regression node is a no-go.)
You are saying something very different from what the usage note says. You are saying the model is trained on a small set of data, and that model is the one used. The usage note says the technique shortens the training time using the full data set.
I did not mean to imply that the model is built on the small set of data (meaning fewer observations), I am saying that by using the Variable Selection node you can identify potentially helpful main-effects and (optionally) interactions since they will automatically be set to the role of Input in a subsequent modeling node while less useful effects will be set to Rejected and not considered. Modeling nodes in SAS Enterprise Miner do a great deal more than logistic including scoring the input data and calculating assessment measures but this extra work is not efficiently done when there is are a lot of extra potentially useless effects included. The Variable Selection node is not saddled with assessment and scoring functionality so it runs much faster. Doing initial variable selection in a Variable Selection node allows the modeling to be done on a "smaller" data set with regards to the number of variables involved/considered which greatly speeds up the process. There is not a way to pass initial parameters to the Regression node.
Does Variable Selection, set for two-way interactions, accomplish the same thing that would be accomplished with logistic regression being set to two-way?
Think of it as a different approach to get at essentially the same information. For categorical effects, the interaction between an m-level variable and an n-level variable could be evaluated as a single variable with m times n levels (one for each combination). If the m=3 and n=2, you could create a 6-level variable which accounted for all combinations of m and n. This is evaluated in the same way as a main effect to evaluate importance. It is done in a univariate way initially to rapidly remove all useless combinations from consideration so it runs much faster than any stepwise methods. It is still helpful to do some stepwise selection in a subsequent modeling node when using a Variable Selection node to avoid overfitting since not all variables will add information over and above the other approaches.
HP Trees, HP Forest, and Neural Networks all build very quickly, without reducing those levels.
Tree based methods (HP Tree, HP Forest) handle interactions automatically rather than by fitting parameters. Logistic regression models can suffer from quasi-separation which is common in data mining problems since in order to fit a full logistic model, you must have events and non-events for every level of every variable. If you have interactions, then you must have sufficient events for every level of each combination of variables included as an interaction in the model. With data mining problems, there is typically a lot of data but it is not necessarily distributed so conveniently. Variables with a large number of levels are much more likely to encounter this problem, especially if they are brought in as interactions as well.
You should also note that adding interactions to a regression model is an attempt to model the structure of the data. The regression model is not flexible like Neural Network models so if you don't have the precise structure of the model represented, the model will perform poorly. Neural Network models can adapt to complex surfaces without the need to create interactions due to how many times each variable is transformed and used. It is still advisable to consider variable selection prior to Neural Network models since Neural Network models have far more parameters required than the typical regression model.
Hope this helps,
Doug
>> There is not a way to pass initial parameters to the Regression node.
So you can't do this with SAS EM. That doesn't make sense to me.
proc logistic data=mydata outest=parms; ... run; ...
... proc logistic data=mydata inest=parms; ... run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.