BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mh2t
Obsidian | Level 7

I have 2 datasets; original

 

ID             TARGET   myVAR1 myVAR2  myVAR3  myVAR4   ...    myVAR10000

1                   yyy            xx            xx                 xx          xx                       xx

2                   yyy            xx             xx                xx          xx                       xx

3                   yyy            xx             xx                xx          xx                       xx

 

and summary:

 

varNames             NLevels   NMissing 

myVAR1                   100            5            

myVAR2                    50             3             

myVAR3                    10             9             

.

.

.

myVAR10000               5            0 

 

I'm trying to run thousands of univariate regression by PROC GLM, but all of my predictors are categorical variables and my TARGET variable is numeric and TARGET is in the original dataset only, and not in 'summary.xls'. I cannot add it to my array, because array should contain the variables of the same type. I don't know how to include my TARGET in the proc glm, so I'm able to run thousands regressions.

 

The code that I'm able to try so far is:

 

* select desired variables from summary table;
proc import out=vars datafile = 'tables/summary.xls' DBMS = xls replace;
GETNAMES = YES;
run;

proc sql noprint;
select varNames, NLevels into :VarList separated by ' ', :nlevel
from vars
where NLevels > 2;
quit;

/* 1. transpose from wide (Y, X1 ,...,X1000) to long (varNum VarName Y Value) */ data Long; set original; array x [*] &VarList; do varNum = 1 to dim(x); VarName = vname(x[varNum]); /* variable name in char var */ Value = x[varNum]; /* value for each variable for each obs */ output; end; drop x:; run;

/* 2. Sort by BY-group variable */ proc sort data=Long; by VarName; run;

/* 3. Call PROC REG and use BY statement to compute all regressions */
ods graphics off;
ods exclude all; proc GLM data=Long;
CLASS Value; by VarName; model TARGET = Value;
ods output ParameterEstimates = PEOut; quit;
ods exclude none;   /* Look at the results */ proc print data=PE(obs=5); var VarName Intercept Value; run;

 

 Desired output:

obs       VarName   Intercept   Value   R2 

1            myVAR1        xxx        xxx      xx

2            myVAR2        xxx         xxx     xx

3            myVAR3        xxx         xxx     xx 

4            myVAR4        xxx         xxx     xx

5            myVAR5        xxx         xxx     xx

...               ....              ....         ....       ...

 

I'm very new to SAS. Your help would be greatly appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Well, it looks like the data is correctly shaped (at least to me). The error in the PROC GLM call arises from including outest=PE at this point.  This looks like a relic from PROC REG code.  In GLM, so far as I can tell, you have to use an ODS OUTPUT statement. To get parameter estimates and Rsquared, you would need this

ODS OUTPUT ParameterEstimates=ParameterEstimates FitStatistics=FitStatistics.

 

To get the estimates and Rsquared into the same dataset, you will have to do some shaping, and then a many-to-one merge or sql join, as you will have a lot of lines of parameter estimates for each by variable, but only one line in FitStatistics with the Rsquared value.

 

SteveDenham

View solution in original post

12 REPLIES 12
PGStats
Opal | Level 21

The overall programming approach is sound. But since variable value in your long dataset is categorical, you will get an observation for every level of value in the ParameterEstimates output dataset. If what you are interested in is the RSquare, you should request the FitStatistics table in the ODS OUTPUT statement. You might also want to look at the ModelANOVA and OverallANOVA output tables.

 

On the statistics side, if you are going to fit that many ANOVAs, you should worry about spurious correlations. Unless you have a huge number of observations, there is almost certainly a predictor that will yield a fabulous fit, just by chance.

 

PG
mh2t
Obsidian | Level 7

My main concern is how to include TARGET variable into the code which is numeric.

PGStats
Opal | Level 21

What problem did you encounter? Please show the log for one or two variables.

PG
SteveDenham
Jade | Level 19

Would following your transpose with a merge on VarName solve the issue of getting the response variable into the analysis dataset?

 

There are some other considerations.  The dataset PEout will only contain the  values of the parameter estimates.  If you want Rsquared values you will also have to output FitStatistics.  These two datasets will have to be merged on the BY variable to get what you want for your PROC PRINT.

 

Also, consider looking at information criteria (AIC, corrected AIC) as a measure of fit.  A larger number of levels for the categorical variable is going to artificially inflate the R-squared values. To get these, you would have to run the regressions in a likelihood based program (GENMOD, MIXED, etc.), but they would give you some idea of how much information is preserved from the raw data in the modeled results.  Just a thought.

 

SteveDenham

mh2t
Obsidian | Level 7

Thank you @SteveDenham for your response. I don't know how to merge response variable with VarName in Long dataset.

mh2t
Obsidian | Level 7

@PGStats my problem is that, I don't know how to include the response variable (TARGET) from original table. @SteveDenham suggested to merge TARGET on VarName in Long table, which not sure how to implement it in SAS.

SteveDenham
Jade | Level 19

You shouldn't have to do the merge.  Your transposition code should be keeping Target (since it is not in the DROP statement).  It may be that the code needs some refinement.  Please do the following: Given the first 3 records of original, what does the output of Long look like, following your code? Show the input (suitably truncated to 3 or 4 myVar's since 10K myVar's values would give 50K lines in Long) and the first 10 lines of Long, as well as the log.  From that we should be able to come up with the refinements you need.

 

SteveDenham

PaigeMiller
Diamond | Level 26

I'm trying to run thousands of univariate regression by PROC GLM

 

Taking a step back, I know you want code to do this, but I'm wondering if there really is a good reason to run thousands of univariate regressions. What is the benefit? What will you do with them once you have them?

 

If your x-variable is categorical and your y-variable is continuous, maybe you could use PROC SUMMARY to get the means and standard deviations to see if the group means are different (provided you don't run out of memory), and this will give you the ability to do one more computation to compare the means. This ought to be the equivalent to your univariate regressions. And much simpler to program. But again, why?

--
Paige Miller
SteveDenham
Jade | Level 19

Excellent point @PaigeMiller . As @PGStats mentioned, some of these regressions are going to be highly significant, if it were done regressing a continuous variable on a continuous independent variable and the independent variable was simply a random sample of numbers.  Now consider that with categorical variables, there are N-1 regression coefficients for a categorical variable with N levels.  Without ever doing the regression, I can say that the analyses with N large (and it looks like there are some where N>100) are going to give a higher Rsquared than analyses with a small number of levels.  For example, look at the difference in Rsquared for a simple regression, a regression for the same data but on a 2 level class variable, and a regression for the same data on a 60 level class variable:

 

data two;
call streaminit(23062020);
do i = 1 to 100;
	x = rand('uniform');
	y = 2 + 3*x + 0.1*rand('uniform');
	if x>0.5 then group=1;
	else group=2;
	grp3 = floor(100*x);
	output;
	end;
run;

proc glm data=two;
model y= x/solution;
quit;
run;

proc glm data=two;
class group;
model y=group/solution ;
quit;
run;

proc glm data=two;
class grp3;
model y=grp3/solution ;
quit;
run;

The Rsquared values for the three scenarios are 0.9988, 0.6967 and 0.9995.  So just by making the grouping more granular, I greatly increased Rsquared, to the point it is indistinguishable from the continuous case.

 

OK - that is a long way around Robin Hood's barn to saying (as did @PaigeMiller ) - why?  What is the benefit?

 

SteveDenham

 

mh2t
Obsidian | Level 7

@SteveDenham I ran my code on 3 variables over 250k random obs:

 

PROC IMPORT OUT= vars DATAFILE= './summary.xls'
            DBMS=xls REPLACE;
     GETNAMES=YES;
RUN;

/* Use PROC SQL to create a macro variable (MissingVarList) that contains
   the list of variables that have a property such as missing values */
RSUBMIT;
proc sql noprint;                              
 select Variable,NLevels into :VarList separated by ' ', :nlevel
 from vars
 where NLevels = 3;
quit;
%put &=VarList;

/* 1. transpose;
data Long;
set proj.original;             /* <== specify data set name HERE  */
array x [*] &VarList;        /* <== specify explanatory variables HERE */
do varNum = 1 to 3;
   VarName = vname(x[varNum]);  /* variable name in char var */
   Value = x[varNum];          /* value for each variable for each obs */
   output;
end;
drop &VarList;
run;


/* 2. Sort by BY-group variable */
proc sort data=Long;  by VarName;  run;

ods graphics off;
ods exclude all;
proc glm data=Long outest=PE;
   CLASS Value;
   by VarName;
   model target = Value / RSQUARE;
quit;
ods exclude none;

/* Look at the results */
proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
run;quit;

here's my log file after transpose (summary importing is working correctly):

 

31 /* 1. transpose from wide (Y, X1 ,...,X10000) to long (varNum VarName Y Value) */

54 data Long;
55 set proj.original; /* <== specify data set name HERE */
56 array x [*] &VarList; /* <== specify explanatory variables HERE */
57 do varNum = 1 to 3;
58 VarName = vname(x[varNum]); /* variable name in char var */
59 Value = x[varNum]; /* value for each variable for each obs */
60 output;
61 end;
65 drop &VarList;
66 run;

NOTE: There were 250000 observations read from the data set PROJ.ORIGINAL.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: DATA statement used (Total process time):
real time 17.61 seconds
cpu time 17.61 seconds

68
69 /* 2. Sort by BY-group variable */
70 proc sort data=Long; by VarName; run;

NOTE: There were 750000 observations read from the data set WORK.LONG.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 38.00 seconds
cpu time 59.61 seconds


71
72
73 ods graphics off;
74 ods exclude all;
75 proc glm data=Long outest=PE;
------
22
76
ERROR 22-322: Syntax error, expecting one of the following: ;, (, ALPHA, DATA, MANOVA,
MULTIPASS, NAMELEN, NOPRINT, ORDER, OUTSTAT, PLOTS.
ERROR 76-322: Syntax error, statement will be ignored.
76 CLASS Value;
77 by VarName;
78 model target = Value / RSQUARE;
-------
22
202
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
ERROR 22-322: Syntax error, expecting one of the following: ;, ALIASING, ALPHA, CLI, CLM,
CLPARM, COVBYCLASS, E, E1, E2, E3, E4, EST, I, INTERCEPT, INVERSE, NOINT, NOUNI,
P, PREDICTED, SINGULAR, SOLUTION, SS1, SS2, SS3, SS4, TOLERANCE, X, XPX, ZETA.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
79 quit;

NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds


80 ods exclude none;
81
91
92 /* Look at the results */
93 proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
ERROR: File WORK.PE.DATA does not exist.
94 run;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

94 ! quit;

 

my Long dataset looks like this:

 

ID             TARGET   VAR1  ...    VAR10000  VarNum        VarName           Value

1                   yyy            xx                 xx                 1              Var1           var1_level1

2                   yyy            xx                 xx                 2              Var2           var2_level1

3                   yyy            xx                 xx                 3              Var3           var3_level1

4                   yyy            xx                 xx                 1              Var1           var1_level2

5                   yyy            xx                 xx                 2              Var2           var2_level2

6                   yyy            xx                 xx                 3              Var3           var3_level1

7                   yyy            xx                 xx                 1              Var1           var1_level1

8                   yyy            xx                 xx                 2              Var2            var2_level2

9                   yyy            xx                 xx                 3              Var3           var3_level2

mh2t
Obsidian | Level 7
@SteveDenham  I ran my code on 3 variables over 250k random obs:
PROC IMPORT OUT= vars DATAFILE= './summary.xls'
DBMS=xls REPLACE;
GETNAMES=YES;
RUN;

/* Use PROC SQL to create a macro variable (MissingVarList) that contains
the list of variables that have a property such as missing values */
RSUBMIT;
proc sql noprint;
select Variable,NLevels into :VarList separated by ' ', :nlevel
from vars
where NLevels = 3;
quit;
%put &=VarList;

/* 1. transpose from wide (Y, X1 ,...,X10000) to long (varNum VarName Y Value) */
data Long;
set proj.original; /* <== specify data set name HERE */
array x [*] &VarList; /* <== specify explanatory variables HERE */
do varNum = 1 to 3;
VarName = vname(x[varNum]); /* variable name in char var */
Value = x[varNum]; /* value for each variable for each obs */
output;
end;
drop &VarList;
run;


/* 2. Sort by BY-group variable */
proc sort data=Long; by VarName; run;

ods graphics off;
ods exclude all;
proc glm data=Long outest=PE;
CLASS Value;
by VarName;
model target = Value / RSQUARE;
quit;
ods exclude none;

/* Look at the results */
proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
run;quit;
 
 
here's my log file after transpose (summary importing is working correctly):
 
 
 
 
 
31 /* 1. transpose from wide (Y, X1 ,...,X10000) to long (varNum VarName Y Value) */
 
54 data Long;
55 set proj.original; /* <== specify data set name HERE */
56 array x [*] &VarList; /* <== specify explanatory variables HERE */
57 do varNum = 1 to 3;
58 VarName = vname(x[varNum]); /* variable name in char var */
59 Value = x[varNum]; /* value for each variable for each obs */
60 output;
61 end;
65 drop &VarList;
66 run;
 
NOTE: There were 250000 observations read from the data set PROJ.ORIGINAL.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: DATA statement used (Total process time):
real time 17.61 seconds
cpu time 17.61 seconds
 
68
69 /* 2. Sort by BY-group variable */
70 proc sort data=Long; by VarName; run;
 
NOTE: There were 750000 observations read from the data set WORK.LONG.
NOTE: The data set WORK.LONG has 750000 observations and 10027 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 38.00 seconds
cpu time 59.61 seconds
 
 
71
72
73 ods graphics off;
74 ods exclude all;
75 proc glm data=Long outest=PE;
------
22
76
ERROR 22-322: Syntax error, expecting one of the following: ;, (, ALPHA, DATA, MANOVA,
MULTIPASS, NAMELEN, NOPRINT, ORDER, OUTSTAT, PLOTS.
ERROR 76-322: Syntax error, statement will be ignored.
76 CLASS Value;
77 by VarName;
78 model target = Value / RSQUARE;
-------
22
202
ERROR: No data set open to look up variables.
NOTE: The previous statement has been deleted.
ERROR 22-322: Syntax error, expecting one of the following: ;, ALIASING, ALPHA, CLI, CLM,
CLPARM, COVBYCLASS, E, E1, E2, E3, E4, EST, I, INTERCEPT, INVERSE, NOINT, NOUNI,
P, PREDICTED, SINGULAR, SOLUTION, SS1, SS2, SS3, SS4, TOLERANCE, X, XPX, ZETA.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
79 quit;
 
NOTE: PROCEDURE GLM used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
 
 
80 ods exclude none;
81
91
92 /* Look at the results */
93 proc print data=PE(obs=5 keep=VarName Intercept Value _RSQ_);
ERROR: File WORK.PE.DATA does not exist.
94 run;
 
NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
 
94 ! quit;
 
ID    TARGET   VAR1  ...  VAR10000  VarNum  VarName    Value
 
1      yyy              xx             xx                    1          Var1    var1_level1
 
2      yyy              xx             xx                     2           Var2    var2_level1
 
3      yyy              xx             xx                    3           Var3    var3_level1
 
4      yyy             xx             xx                    1            Var1    var1_level2
 
5      yyy            xx             xx                    2             Var2     var2_level2
 
6      yyy           xx             xx                     3             Var3     var3_level1
 
7      yyy          xx             xx                     1              Var1    var1_level1
 
8      yyy           xx             xx                    2               Var2    var2_level2
 
9      yyy          xx             xx                     3               Var3    var3_level2
SteveDenham
Jade | Level 19

Well, it looks like the data is correctly shaped (at least to me). The error in the PROC GLM call arises from including outest=PE at this point.  This looks like a relic from PROC REG code.  In GLM, so far as I can tell, you have to use an ODS OUTPUT statement. To get parameter estimates and Rsquared, you would need this

ODS OUTPUT ParameterEstimates=ParameterEstimates FitStatistics=FitStatistics.

 

To get the estimates and Rsquared into the same dataset, you will have to do some shaping, and then a many-to-one merge or sql join, as you will have a lot of lines of parameter estimates for each by variable, but only one line in FitStatistics with the Rsquared value.

 

SteveDenham

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1508 views
  • 6 likes
  • 4 in conversation