Dear All,
I would like to generate hundreds of logistic regressions. My data includes a dependent variable and hundreds of quantitative variables that in the format of "_"+random numbers, it looks like this:
Dependent variable _123 _234 _341 _234 .........
0 23 12 0 1
1 45 48 9 12
1 0 23 6 23
0 89 12 34 7
1 .......
0
0
1
0
So I would like to have hundreds of logistic regression that with each of the _XXX variables, maybe similar like this, but I'm not sure how to put it in the macro
proc logistic; model dependent variable=_XXX; run;
Appreciate any advice! Thank you so much!!
Here's one approach. It relies on using any and all variable names that begin with "_" as an independent variable:
proc contents data=have noprint out=varnames (keep=name);
run;
data _null_;
set varnames;
where name =: '_';
call execute ('proc logistic data=have; model DEPENDENT = ' || name || '; run;');
run;
You'll need to replace the word DEPENDENT with the actual name of your dependent variable.
Don't. Transpose your data and use BY group processing instead. Then all your results can be captured into a single data set as well.
Hi @huhuhu!
You can easily make a macro for what you are talking about, but if your variables are not numbered sequentially you will have to generate a macro call for each one, like this:
data example;
input dependent_var _123 _234 _341;
datalines;
0 23 12 0
1 45 48 9
1 0 23 6
0 89 12 34
;
run;
%macro regression(var);
proc logistic data=example;
model dependent_var = _&var;
run;
%mend;
%regression(123);
%regression(234);
%regression(341);
However, if your variables were numbered sequentially for example, from 1 to 100, you could generate the regression code like this:
%macro regression(numVars);
%do i = 1 %to &numVars;
proc reg data=example;
model dependent_var = _&i;
run;
%end;
%mend;
%regression(100);
Hope that helps!
Thank you for your suggestion, unfortunately my variables are not numbered sequentially and it's so time consuming to write it one by one, since I have hundreds of it. Are you familiar with how to change my variables' name to sequence? I really appreciated!
In that case, I think it might be helpful to use @Astounding's suggestion. 🙂
Here's one approach. It relies on using any and all variable names that begin with "_" as an independent variable:
proc contents data=have noprint out=varnames (keep=name);
run;
data _null_;
set varnames;
where name =: '_';
call execute ('proc logistic data=have; model DEPENDENT = ' || name || '; run;');
run;
You'll need to replace the word DEPENDENT with the actual name of your dependent variable.
I consider the approach of performing many logistic regressions and picking the best to be extremely suboptimal (and that's ignoring the programming issues stated here) to the point you could easily be misled by the results. Any time you have hundreds of variables, they will be correlated with one another, and this causes logistic regression (and ordinary least squares regression) to provide parameter estimates and predicted values that have HUGE variances, to the point where you can get the wrong sign on a model parameter estimate.
A better approach is to use a modeling method that performs better in the presence of large numbers of correlated variables. That method is called Partial Least Squares regression — in SAS, it is PROC PLS. This method produces a model which is less susceptible to correlation between the variables, and it produces model coefficients and predicted values with much smaller root mean square errors than regression or logistic regression.
Thank you for your reply. Unfortunately I would like to have many separate regressions that are generated with each of the_XXX variable.
Since I have hundreds of _XXX variables, I will have hundreds of logistic regressions. Do you have some experience with it?
Really appreciate your help!
I would like to have many separate regressions that are generated with each of the_XXX variable.
PROC PLS makes this unnecessary. It is one model that has ALL of your input variables; the variables that are not predictive of your response will get very low weights, and PLS still produces models with the lower mean square error of parameter estimates that I mentioned above.
And, it's a bazillion (that's a technical term) times easier than doing hundreds of logistic regressions.
PROC PLS can do Logistic Regression ? Can you show me an example ?
You use a binary response variable.
How do define which level to model ? like :
model sex(event='F')= .....
And logistic regression is using MLE , but PLS is using OLS .
OLS could apply to logistic regression ?
PLS is not using OLS. It is using Partial Least Squares, a completely different algorithm.
The response variable takes on values 0 or 1.
Logistic regression is a modeling method that uses continuous x-variables to predict binary (or multi-nomial responses). PLS with binary responses is a modeling method that uses continuous x-variables to predict binary (or multi-nomial responses). So far, they are identical. However, under the hood, they are different algorithms, and will not produce the same answers. However, PLS is less susceptible to the problem of collinearity among the x-variables, and so will produce models that fit better (lower mean square error of regression coefficients and lower mean square error of predicted values).
But I don't find any example in PROC PLS 's documentation.
Can you show an example to do logistic regression?
take SASHELP.CLASS as an example, and I want modl sex='M' ?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.