I have 2 datasets; original
ID TARGET myVAR1 myVAR2 myVAR3 myVAR4 ... myVAR10000
1 yyy xx xx xx xx xx
2 yyy xx xx xx xx xx
3 yyy xx xx xx xx xx
and summary:
varNames NLevels NMissing
myVAR1 100 5
myVAR2 50 3
myVAR3 10 9
.
.
.
myVAR10000 5 0
I'm trying to run thousands of univariate regression by PROC GLM, but all of my predictors are categorical variables and my TARGET variable is numeric and TARGET is in the original dataset only, and not in 'summary.xls'. I cannot add it to my array, because array should contain the variables of the same type. I don't know how to include my TARGET in the proc glm, so I'm able to run thousands regressions.
The code that I'm able to try so far is:
* select desired variables from summary table;
proc import out=vars datafile = 'tables/summary.xls' DBMS = xls replace;
GETNAMES = YES;
run;
proc sql noprint;
select varNames, NLevels into :VarList separated by ' ', :nlevel
from vars
where NLevels > 2;
quit;
/* 1. transpose from wide (Y, X1 ,...,X1000) to long (varNum VarName Y Value) */ data Long; set original; array x [*] &VarList; do varNum = 1 to dim(x); VarName = vname(x[varNum]); /* variable name in char var */ Value = x[varNum]; /* value for each variable for each obs */ output; end; drop x:; run;
/* 2. Sort by BY-group variable */ proc sort data=Long; by VarName; run;
/* 3. Call PROC REG and use BY statement to compute all regressions */
ods graphics off;
ods exclude all; proc GLM data=Long;
CLASS Value; by VarName; model TARGET = Value;
ods output ParameterEstimates = PEOut; quit;
ods exclude none; /* Look at the results */ proc print data=PE(obs=5); var VarName Intercept Value; run;
Desired output:
obs VarName Intercept Value R2
1 myVAR1 xxx xxx xx
2 myVAR2 xxx xxx xx
3 myVAR3 xxx xxx xx
4 myVAR4 xxx xxx xx
5 myVAR5 xxx xxx xx
... .... .... .... ...
I'm very new to SAS. Your help would be greatly appreciated!
Well, it looks like the data is correctly shaped (at least to me). The error in the PROC GLM call arises from including outest=PE at this point. This looks like a relic from PROC REG code. In GLM, so far as I can tell, you have to use an ODS OUTPUT statement. To get parameter estimates and Rsquared, you would need this
ODS OUTPUT ParameterEstimates=ParameterEstimates FitStatistics=FitStatistics.
To get the estimates and Rsquared into the same dataset, you will have to do some shaping, and then a many-to-one merge or sql join, as you will have a lot of lines of parameter estimates for each by variable, but only one line in FitStatistics with the Rsquared value.
SteveDenham
The overall programming approach is sound. But since variable value in your long dataset is categorical, you will get an observation for every level of value in the ParameterEstimates output dataset. If what you are interested in is the RSquare, you should request the FitStatistics table in the ODS OUTPUT statement. You might also want to look at the ModelANOVA and OverallANOVA output tables.
On the statistics side, if you are going to fit that many ANOVAs, you should worry about spurious correlations. Unless you have a huge number of observations, there is almost certainly a predictor that will yield a fabulous fit, just by chance.
My main concern is how to include TARGET variable into the code which is numeric.
What problem did you encounter? Please show the log for one or two variables.
Would following your transpose with a merge on VarName solve the issue of getting the response variable into the analysis dataset?
There are some other considerations. The dataset PEout will only contain the values of the parameter estimates. If you want Rsquared values you will also have to output FitStatistics. These two datasets will have to be merged on the BY variable to get what you want for your PROC PRINT.
Also, consider looking at information criteria (AIC, corrected AIC) as a measure of fit. A larger number of levels for the categorical variable is going to artificially inflate the R-squared values. To get these, you would have to run the regressions in a likelihood based program (GENMOD, MIXED, etc.), but they would give you some idea of how much information is preserved from the raw data in the modeled results. Just a thought.
SteveDenham
Thank you @SteveDenham for your response. I don't know how to merge response variable with VarName in Long dataset.
@PGStats my problem is that, I don't know how to include the response variable (TARGET) from original table. @SteveDenham suggested to merge TARGET on VarName in Long table, which not sure how to implement it in SAS.
You shouldn't have to do the merge. Your transposition code should be keeping Target (since it is not in the DROP statement). It may be that the code needs some refinement. Please do the following: Given the first 3 records of original, what does the output of Long look like, following your code? Show the input (suitably truncated to 3 or 4 myVar's since 10K myVar's values would give 50K lines in Long) and the first 10 lines of Long, as well as the log. From that we should be able to come up with the refinements you need.
SteveDenham
I'm trying to run thousands of univariate regression by PROC GLM
Taking a step back, I know you want code to do this, but I'm wondering if there really is a good reason to run thousands of univariate regressions. What is the benefit? What will you do with them once you have them?
If your x-variable is categorical and your y-variable is continuous, maybe you could use PROC SUMMARY to get the means and standard deviations to see if the group means are different (provided you don't run out of memory), and this will give you the ability to do one more computation to compare the means. This ought to be the equivalent to your univariate regressions. And much simpler to program. But again, why?
Excellent point @PaigeMiller . As @PGStats mentioned, some of these regressions are going to be highly significant, if it were done regressing a continuous variable on a continuous independent variable and the independent variable was simply a random sample of numbers. Now consider that with categorical variables, there are N-1 regression coefficients for a categorical variable with N levels. Without ever doing the regression, I can say that the analyses with N large (and it looks like there are some where N>100) are going to give a higher Rsquared than analyses with a small number of levels. For example, look at the difference in Rsquared for a simple regression, a regression for the same data but on a 2 level class variable, and a regression for the same data on a 60 level class variable:
data two;
call streaminit(23062020);
do i = 1 to 100;
x = rand('uniform');
y = 2 + 3*x + 0.1*rand('uniform');
if x>0.5 then group=1;
else group=2;
grp3 = floor(100*x);
output;
end;
run;
proc glm data=two;
model y= x/solution;
quit;
run;
proc glm data=two;
class group;
model y=group/solution ;
quit;
run;
proc glm data=two;
class grp3;
model y=grp3/solution ;
quit;
run;
The Rsquared values for the three scenarios are 0.9988, 0.6967 and 0.9995. So just by making the grouping more granular, I greatly increased Rsquared, to the point it is indistinguishable from the continuous case.
OK - that is a long way around Robin Hood's barn to saying (as did @PaigeMiller ) - why? What is the benefit?
SteveDenham
@SteveDenham I ran my code on 3 variables over 250k random obs:
PROC IMPORT OUT= vars DATAFILE= './summary.xls'
DBMS=xls REPLACE;
GETNAMES=YES;
RUN;
/* Use PROC SQL to create a macro variable (MissingVarList) that contains
the list of variables that have a property such as missing values */
RSUBMIT;
proc sql noprint;
select Variable,NLevels into :VarList separated by ' ', :nlevel
from vars
where NLevels = 3;
quit;
%put &=VarList;
/* 1. transpose;
data Long;
set proj.original; /* <== specify data set name HERE */
array x [*] &VarList; /* <== specify explanatory variables HERE */
do varNum = 1 to 3;
VarName = vname(x[varNum]); /* variable name in char var */
Value = x[varNum]; /* value for each variable for each obs */
output;
end;
drop &VarList;
run;
/* 2. Sort by BY-group variable */
proc sort data=Long; by VarName; run;
ods graphics off;
ods exclude all;
proc glm data=Long outest=PE;
CLASS Value;
by VarName;
model target = Value / RSQUARE;
quit;
ods exclude none;
/* Look at the results */
proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
run;quit;
here's my log file after transpose (summary importing is working correctly):
31 /* 1. transpose from wide (Y, X1 ,...,X10000) to long (varNum VarName Y Value) */
54 data Long;
55 set proj.original; /* <== specify data set name HERE */
56 array x [*] &VarList; /* <== specify explanatory variables HERE */
57 do varNum = 1 to 3;
58 VarName = vname(x[varNum]); /* variable name in char var */
59 Value = x[varNum]; /* value for each variable for each obs */
60 output;
61 end;
65 drop &VarList;
66 run;
NOTE: There were 250000 observations read from the data set PROJ.ORIGINAL.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: DATA statement used (Total process time):
real time 17.61 seconds
cpu time 17.61 seconds
68
69 /* 2. Sort by BY-group variable */
70 proc sort data=Long; by VarName; run;
NOTE: There were 750000 observations read from the data set WORK.LONG.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 38.00 seconds
cpu time 59.61 seconds
71
72
73 ods graphics off;
74 ods exclude all;
75 proc glm data=Long outest=PE;
------
22
76
ERROR 22-322: Syntax error, expecting one of the following: ;, (, ALPHA, DATA, MANOVA,
MULTIPASS, NAMELEN, NOPRINT, ORDER, OUTSTAT, PLOTS.
ERROR 76-322: Syntax error, statement will be ignored.
76 CLASS Value;
77 by VarName;
78 model target = Value / RSQUARE;
-------
22
202
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
ERROR 22-322: Syntax error, expecting one of the following: ;, ALIASING, ALPHA, CLI, CLM,
CLPARM, COVBYCLASS, E, E1, E2, E3, E4, EST, I, INTERCEPT, INVERSE, NOINT, NOUNI,
P, PREDICTED, SINGULAR, SOLUTION, SS1, SS2, SS3, SS4, TOLERANCE, X, XPX, ZETA.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
79 quit;
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
80 ods exclude none;
81
91
92 /* Look at the results */
93 proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
ERROR: File WORK.PE.DATA does not exist.
94 run;
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
94 ! quit;
my Long dataset looks like this:
ID TARGET VAR1 ... VAR10000 VarNum VarName Value
1 yyy xx xx 1 Var1 var1_level1
2 yyy xx xx 2 Var2 var2_level1
3 yyy xx xx 3 Var3 var3_level1
4 yyy xx xx 1 Var1 var1_level2
5 yyy xx xx 2 Var2 var2_level2
6 yyy xx xx 3 Var3 var3_level1
7 yyy xx xx 1 Var1 var1_level1
8 yyy xx xx 2 Var2 var2_level2
9 yyy xx xx 3 Var3 var3_level2
PROC IMPORT OUT= vars DATAFILE= './summary.xls'
DBMS=xls REPLACE;
GETNAMES=YES;
RUN;
/* Use PROC SQL to create a macro variable (MissingVarList) that contains
the list of variables that have a property such as missing values */
RSUBMIT;
proc sql noprint;
select Variable,NLevels into :VarList separated by ' ', :nlevel
from vars
where NLevels = 3;
quit;
%put &=VarList;
/* 1. transpose from wide (Y, X1 ,...,X10000) to long (varNum VarName Y Value) */
data Long;
set proj.original; /* <== specify data set name HERE */
array x [*] &VarList; /* <== specify explanatory variables HERE */
do varNum = 1 to 3;
VarName = vname(x[varNum]); /* variable name in char var */
Value = x[varNum]; /* value for each variable for each obs */
output;
end;
drop &VarList;
run;
/* 2. Sort by BY-group variable */
proc sort data=Long; by VarName; run;
ods graphics off;
ods exclude all;
proc glm data=Long outest=PE;
CLASS Value;
by VarName;
model target = Value / RSQUARE;
quit;
ods exclude none;
/* Look at the results */
proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
run;quit;
Well, it looks like the data is correctly shaped (at least to me). The error in the PROC GLM call arises from including outest=PE at this point. This looks like a relic from PROC REG code. In GLM, so far as I can tell, you have to use an ODS OUTPUT statement. To get parameter estimates and Rsquared, you would need this
ODS OUTPUT ParameterEstimates=ParameterEstimates FitStatistics=FitStatistics.
To get the estimates and Rsquared into the same dataset, you will have to do some shaping, and then a many-to-one merge or sql join, as you will have a lot of lines of parameter estimates for each by variable, but only one line in FitStatistics with the Rsquared value.
SteveDenham
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.