BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
huhuhu
Obsidian | Level 7

Dear All,

 

I would like to generate hundreds of logistic regressions. My data includes a dependent variable and hundreds of quantitative variables that in the format of "_"+random numbers, it looks like this:

 

Dependent variable  _123    _234    _341     _234  .........   

0                                23        12        0           1

1                                45        48        9          12

1                                0          23        6           23

0                                89        12        34          7

1                                .......

0

0

1

0

 

So I would like to have hundreds of logistic regression that with each of the _XXX variables, maybe similar like this, but I'm not sure how to put it in the macro 

 

proc logistic;
      model dependent variable=_XXX;
   run;

Appreciate any advice! Thank you so much!!

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Here's one approach.  It relies on using any and all variable names that begin with "_" as an independent variable:

 

proc contents data=have noprint out=varnames (keep=name);

run;

 

data _null_;

set varnames;

where name =: '_';

call execute ('proc logistic data=have; model DEPENDENT = ' || name || '; run;');

run;

 

You'll need to replace the word DEPENDENT with the actual name of your dependent variable.

View solution in original post

22 REPLIES 22
Reeza
Super User

Don't. Transpose your data and use BY group processing instead. Then all your results can be captured into a single data set as well.

 

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-run-multiple-regres...

OliviaWright
SAS Employee

Hi @huhuhu!

 

You can easily make a macro for what you are talking about, but if your variables are not numbered sequentially you will have to generate a macro call for each one, like this: 

 

data example;
input dependent_var _123 _234 _341;
datalines;
0 23 12 0
1 45 48 9
1 0 23 6
0 89 12 34
;
run;

%macro regression(var);
proc logistic data=example;
 model dependent_var = _&var;
run;
%mend;

%regression(123);
%regression(234);
%regression(341);

However, if your variables were numbered sequentially for example, from 1 to 100, you could generate the regression code like this:

%macro regression(numVars);
%do i = 1 %to &numVars;                    
   	proc reg data=example;
   		model dependent_var = _&i;                      
 	run;
%end;
%mend;
%regression(100);

Hope that helps!

huhuhu
Obsidian | Level 7

Thank you for your suggestion, unfortunately my variables are not numbered sequentially and it's so time consuming to write it one by one, since I have hundreds of it. Are you familiar with how to change my variables' name to sequence? I really appreciated!

OliviaWright
SAS Employee

In that case, I think it might be helpful to use @Astounding's suggestion. 🙂

Astounding
PROC Star

Here's one approach.  It relies on using any and all variable names that begin with "_" as an independent variable:

 

proc contents data=have noprint out=varnames (keep=name);

run;

 

data _null_;

set varnames;

where name =: '_';

call execute ('proc logistic data=have; model DEPENDENT = ' || name || '; run;');

run;

 

You'll need to replace the word DEPENDENT with the actual name of your dependent variable.

PaigeMiller
Diamond | Level 26

I consider the approach of performing many logistic regressions and picking the best to be extremely suboptimal (and that's ignoring the programming issues stated here) to the point you could easily be misled by the results. Any time you have hundreds of variables, they will be correlated with one another, and this causes logistic regression (and ordinary least squares regression) to provide parameter estimates and predicted values that have HUGE variances, to the point where you can get the wrong sign on a model parameter estimate.

 

A better approach is to use a modeling method that performs better in the presence of large numbers of correlated variables. That method is called Partial Least Squares regression — in SAS, it is PROC PLS. This method produces a model which is less susceptible to correlation between the variables, and it produces model coefficients and predicted values with much smaller root mean square errors than regression or logistic regression.

--
Paige Miller
huhuhu
Obsidian | Level 7

Thank you for your reply. Unfortunately I would like to have many separate regressions that are generated with each of the_XXX variable.

Since I have hundreds of _XXX variables, I will have hundreds of logistic regressions. Do you have some experience with it?

Really appreciate your help!

PaigeMiller
Diamond | Level 26

I would like to have many separate regressions that are generated with each of the_XXX variable.

 

PROC PLS makes this unnecessary. It is one model that has ALL of your input variables; the variables that are not predictive of your response will get very low weights, and PLS still produces models with the lower mean square error of parameter estimates that I mentioned above.

 

And, it's a bazillion (that's a technical term) times easier than doing hundreds of logistic regressions.

--
Paige Miller
Ksharp
Super User

@PaigeMiller

PROC PLS can do Logistic Regression ? Can you show me an example ?

PaigeMiller
Diamond | Level 26

You use a binary response variable.

--
Paige Miller
Ksharp
Super User

How do define which level to model ? like :

 

model  sex(event='F')= .....

Ksharp
Super User

And logistic regression is using MLE , but PLS is using OLS . 

OLS could apply to logistic regression ?

PaigeMiller
Diamond | Level 26

PLS is not using OLS. It is using Partial Least Squares, a completely different algorithm.

 

The response variable takes on values 0 or 1. 

 

Logistic regression is a modeling method that uses continuous x-variables to predict binary (or multi-nomial responses). PLS with binary responses is a modeling method that uses continuous x-variables to predict binary (or multi-nomial responses). So far, they are identical. However, under the hood, they are different algorithms, and will not produce the same answers. However, PLS is less susceptible to the problem of collinearity among the x-variables, and so will produce models that fit better (lower mean square error of regression coefficients and lower mean square error of predicted values).

--
Paige Miller
Ksharp
Super User

But I don't find any example in PROC PLS 's documentation.

Can you show an example to do logistic regression?

take SASHELP.CLASS as an example, and I want modl sex='M' ?

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 22 replies
  • 5275 views
  • 11 likes
  • 7 in conversation