About RobF

RobF · ‎04-24-2015

Hi FriedEgg - No, with the fast approaching deadlines for the global forum I didn't have time to include stopping criteria, or include wrapper code for using 5 or 10-fold cross validation to find optimal values of the alpha or lambda penalty values. Instead I concentrated on just getting the algorithm to converge to a solution (:-P) by comparing output with glmnet in R using the diabetes test dataset for logistic and Poisson regressions. The code matches glmnet output reasonably well for ~1,000 iterations (but not perfect). I'm hoping the code is useful as a niche market solution for SAS users without access to Revolution Analytics who want to run a lasso on a dataset too big for R to handle. Robert

RobF · ‎04-24-2015

Thanks Kurt - is there a way to curtail the runtime checking in SAS to increase performance?

RobF · ‎04-23-2015

Right, I am converging to a good solution after about 1,000 iterations which takes only about 3 seconds. I chose 100,000 iterations since that's the baseline # of iterations in the R glmnet routine. However, for much larger datasets which I foresee I'll be running in the future, the runtime could conceivably take hours. I don't know, maybe this is the best I can do in Base SAS. I just saw this article by Rick Wicklin which may help out even if I'm not using proc iml: http://blogs.sas.com/content/iml/2013/05/15/vectorize-computations.html

RobF · ‎04-23-2015

No - my company only has SAS EG loaded on the server. I'm hoping a fast & efficient program can be wriiten in base SAS without bringing in other modules.

RobF · ‎04-23-2015

I need help improving the execution speed of a numerical optimization program (see below) I've written in Base SAS. My program works fine - it successfully calculates parameters for a penalized logistic regression using a coordinate descent routine. The program is based on the algorithm used in the glmnet package in R, which is written in Fortran. Running glmnet in R is lightning fast, however my code takes significantly longer. With the help of others I've streamlined my program considerably - it reads my data into a two-dimensional temporary array and processes the array in a single data step before reading out the final coefficient estimates to an output dataset. So I/O processing ought to be minimized. And my calculations should be conducted in memory, which ought to minimize execution time. For trial runs I'm running the program on the publically available diabetes dataset (http://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt) after standardizing the 10 predictor variables x1-x10, adding an intercept constant x0=1, and creating a binary response variable y_gt140=1 if y>140, else y_gt140=0. Running the code below takes ~5 minutes in SAS, compared with < 1 second to run glmnet in R. What else can I do to improve the execution time of my program? Thanks in advance! ********************************************************************** * Logistic regression coordinate descent code for elastic net * **********************************************************************; %let nobs=442; %let numvars=10; %let numvars2=11; %let numiter=100000; %let lambda_list=.1; %let alpha_list=.5; %macro parmlist; %do i = 0 %to &numvars; &&p&i %end; %mend parmlist; %macro meanlist; %do i = 0 %to &numvars; &&mean&i %end; %mend meanlist; %macro stdlist; %do i = 0 %to &numvars; &&std&i %end; %mend stdlist; data coord_descent_output (keep=alpha lambda parm_unstnd_0-parm_unstnd_&numvars2.); array xx[0:&numvars.] x0-x&numvars.; array x_[&nobs.,0:&numvars.] _temporary_; array y_[&nobs] _temporary_; array mean_[0:&numvars.] (%meanlist); array std_[0:&numvars.] (%stdlist); array parm_unstnd_[0:&numvars2.] (&numvars2.*0); *** Load data into two dimensional array ***; do _n_ = 1 to &nobs.; set work.diabetes_stnd_array2 nobs=nobs; do j=0 to &numvars.; x_[_n_,j] = xx ; y_[_n_] = y_gt140 ; end; end; do alpha=&alpha_list; do lambda=&lambda_list; gamma=alpha*lambda; array p_[0:&numvars.] (&numvars2.*1); do i=1 to &numiter; do j=0 to &numvars; z = 0; sum_wtx_sq = 0; do _nn_ = 1 to &nobs; yhat = p_[0]; do k=1 to &numvars; yhat = sum(yhat, p_ *x_[_nn_,k]); end; proby = 1/(1 + exp(-yhat)); if proby <= .00001 then do; proby = 0; weight = .00001; end; else if proby >= .99999 then do; proby = 1; weight = .00001; end; else weight = proby*(1 - proby); z = sum(z, (x_[_nn_,j]*(y_[_nn_] - proby) + weight*p_ *(x_[_nn_,j])** 2)); sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_nn_,j])**2); end; if j=0 then do; p_ = z/sum_wtx_sq; end; else if j>0 then do; if (z/&nobs > 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs - gamma)/(sum_wtx_sq/&nobs + lambda - gamma); else if (z/&nobs < 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs + gamma)/(sum_wtx_sq/&nobs + lambda - gamma); else if gamma >= abs(z/&nobs) then p_ = 0; end; end; * Inner loop ; end; * Outer loop ; *** Calculate "destandardized" regression coeff.'s from standardized predictor variables (Mean(x)= 0, Var(x)=1). ******; parm_unstnd_[0] = p_[0]; do l=1 to &numvars.; parm_unstnd_[0] = parm_unstnd_[0] - p_ *mean_ /std_ ; parm_unstnd_ = p_ /std_ ; end; output coord_descent_output; put "Final parm_unstnd_ =" parm_unstnd_ ; end; end; run;

RobF · ‎04-13-2015

Thanks guys - unfortunately neither of these suggestions is working. The program runs without error, however the output parameter values are incorrect (compared to running the code without any p_ =0 bypass options at all). I have a feeling there's a logic error rather than a syntax error here, alas.

RobF · ‎04-13-2015

l I've written a routine that calculates a penalized logistic regression by reading my dataset into a two dimensional array, then iteratively looping through the dataset array by columns and rows and updating the values in the regression parameter array. I'm having a surprisingly difficult time successfully writing a statement that gracefully exits the "j=0 to &numvars" do loop if the current value of p_ equals 0 (see >>> arrows indicating the program line below). Basically, if the current value of p_ = 0, then there's no need to update the value of p_ and the program should continue to the (j+1)th parameter in the do loop. I don't want to exit the "j=0 to &numvars" do loop - just skip to the next value of j in the sequence. I've tried leave, continue, and goto statements, but no luck. Running the code below with the goto statement, I receive the following error: 145 end; * End lambda values loop ; 146 147 end; * End alpha values loop ; ___ 161 ERROR 161-185: No matching DO/SELECT statement. 148 149 150 151 run; Any ideas what I'm doing wrong & how I ought to proceed? Thanks in advance! ********************************************************************** * Logistic regression coordinate descent code for elastic net * **********************************************************************; %let nobs=442; %let numvars=10; %let numvars2=11; %let numiter=1000; %let lambda_list=.1; %let alpha_list=1; data coord_descent_output (keep=alpha lambda p_0-p_&numvars2.); array xx[0:&numvars.] x0-x&numvars.; array x_[&nobs.,0:&numvars.] _temporary_; array y_[&nobs] _temporary_; *** Load data into two dimensional array ***; do _n_ = 1 to &nobs.; set diabetes_stnd_array2 nobs=nobs; do j=0 to &numvars.; x_[_n_,j] = xx ; y_[_n_] = y_gt140; end; end; *** Coordinate descent routine ***; do alpha=&alpha_list; * Start alpha values loop ; do lambda=&lambda_list; * Start lambda values loop ; gamma=alpha*lambda; array p_[0:&numvars.] (&numvars2.*1); * Assign initial parameter values ; do i=1 to &numiter; * Start iteration loop ; do j=0 to &numvars; * Start data column loop ; >>> if p_ = 0 then goto endloop; * Bypass calculations and proceed to p_[j+1] if p_ =0 ; z = 0; sum_wtx_sq = 0; do _nn_ = 1 to &nobs; * Start data record loop ; yhat = p_[0]; do k=1 to &numvars; yhat = sum(yhat, p_ *x_[_nn_,k]); end; proby = 1/(1 + exp(-yhat)); if proby <= .00001 then do; proby = 0; weight = .00001; end; else if proby >= .99999 then do; proby = 1; weight = .00001; end; else weight = proby*(1 - proby); z = sum(z, (x_[_nn_,j]*(y_[_nn_] - proby) + weight*p_ *(x_[_nn_,j])** 2)); sum_wtx_sq = sum(sum_wtx_sq, weight*(x_[_nn_,j])**2); end; * End data record loop ; if j=0 then do; p_ = z/sum_wtx_sq; end; else if j>0 then do; if (z/&nobs > 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs - gamma)/(sum_wtx_sq/&nobs + lambda - gamma); else if (z/&nobs < 0 and gamma < abs(z/&nobs)) then p_ = (z/&nobs + gamma)/(sum_wtx_sq/&nobs + lambda - gamma); else if gamma >= abs(z/&nobs) then p_ = 0; end; >>> endloop: end; * End p_ =0 bypass loop ; end; * End data column loop ; end; * End iteration loop ; output coord_descent_output; put "p_ =" p_ ; end; * End lambda values loop ; end; * End alpha values loop ; run;

RobF · ‎03-31-2015

Will do, thanks Tom. So far my program is running fine with the keep statement. Thank you, all, for the suggestions. At the moment my worries about working with really big data that consumes most of the memory on my machine are mostly theoretical, but I'd like to write my code with that contingency in mind for efficiency's sake.

RobF · ‎03-30-2015

Yea, think I'll scratch the "data _null_;" idea and just use a regular "data report (keep=...);" statement to output the end results of my program computations into a separate dataset. That works fine - I just want SAS to avoid creating a duplicate dataset in memory, then dropping the excess columns after reading the keep= line in my data statement.

RobF · ‎03-30-2015

Well I'm actually doing a lot of data processing inside the data _null_ step by reading my data into an array, and then just keeping the final output, so not sure if a proc copy or proc datasets would work inside of the data _null_. (The example I offered in my question may be a bit misleading since it's grossly simplified.) I'll try Tom's idea and see how it works. I suppose the other option is just output the final results to the SAS log with a put statement, then copy and paste to Excel or whatever.

RobF · ‎03-30-2015

That may work . . . but I'm trying to keep overhead memory consumption at a minimum, especially if working with big dataset. Hence the use of the data _null_ statement instead of simply doing what you suggested. Or are my worries unjustified?

RobF · ‎03-30-2015

Thanks ballardw - I was hoping there would be a quick fix while sticking with the data _null_ step.

RobF · ‎03-30-2015

I'm attempting to output observations to a SAS dataset on my company's server in SAS Enterprise Guide from within a data _null_ step. Here's the basic idea using a test dataset: data test_data; input y x1 x2; cards; 1 2 3 10 20 30 100 200 300 ; run; data _NULL_; set test_data; file 'E:\SAS Temporary Files\_TD21860_VO-DCA-VSAS01_\Prc2\report.sas7bdat'; put y; run; The file statement includes the address of the WORK folder on the company server. The code successfully creates a file named "report", however when I attempt to open the file I receive the following error message even though I specified the SAS data set extension ".sas7bdat" in the file statement: The open data operation failed. The following error occurred. [Error] File WORK.REPORT.DATA is not a SAS data set. Neither can I successfully run: proc print data=report; run; What am I doing wrong? Thanks in advance Robert

RobF · ‎02-11-2015

I'm working on a SAS program that runs a coordinate descent algorithm which repetitively processes & updates dataset variables inside a macro loop: %macro test(dataset=, numvars=, numiter=, lambda=, alpha=); %do i=1 %to &numiter; %do j=1 %to &numvars; data &dataset (keep=y x1-x&numvars); set &dataset end=end_data; array x[&numvars] x1-x&numvars; < ... Code for variable manipulations & calculations ...> run; %end; %end; %mend; %test(dataset=&dataset, numvars=10, numiter=100, lambda=.1, alpha=1) I wrote this code out of convenience, realizing that putting a data step inside the macro loop isn't terribly efficient since opening and closing the same dataset eats up CPU resources. Is there a better way to code this algorithm that will save significant time when running?

RobF · ‎01-10-2015

Thank you for the suggestions. In fact, I've also found that cutting and pasting data from Excel into this editor window for a new SAS Discussion question automatically formats the data. I can then cut paste from the editor window directly into a data step & SAS creates a correctly formatted dataset. data hospital_data; length provider $50.; input PROVIDER $ Q1Q2_2011_denom Q1Q2_2011_numer Q3Q4_2013_denom Q3Q4_2013_numer; cards; … run;

Online Status	Offline
Date Last Visited	‎04-13-2016 08:32 AM

Re: How to output proc print to desktop from SAS EG Server?

Re: How to output proc print to desktop from SAS EG Server?

How to output proc print to desktop from SAS EG Server?

Re: Does SAS have LASSO or LAR variable selection implemented for Logi...

Re: Comparing observed to expected values in 2x2 contingency tables

Re: Comparing observed to expected values in 2x2 contingency tables

Comparing observed to expected values in 2x2 contingency tables

Re: reading values into array starting at index 0 instead of 1

Re: reading values into array starting at index 0 instead of 1

Re: SAS documentation website broken?

Re: How to output proc print to desktop from SAS EG Server?

Re: How to output proc print to desktop from SAS EG Server?

Re: Comparing observed to expected values in 2x2 contingency tables

Re: Comparing observed to expected values in 2x2 contingency tables

Re: Comparing observed to expected values in 2x2 contingency tables

How to output proc print to desktop from SAS EG Server?

Re: Does SAS have LASSO or LAR variable selection implemented for Logi...

Re: Does SAS have LASSO or LAR variable selection implemented for Logi...

Re: LASSO in Logistic regression

Re: How to improve execution speed of program in Base SAS?

Re: How to improve execution speed of program in Base SAS?

Re: How to improve execution speed of program in Base SAS?

Re: How to improve execution speed of program in Base SAS?

How to improve execution speed of program in Base SAS?

Re: How to skip to next iteration in do loop if conditional statement ...

How to skip to next iteration in do loop if conditional statement is t...

Re: How to output SAS file to server in data _null_ step?

Re: How to output SAS file to server in data _null_ step?

Re: How to output SAS file to server in data _null_ step?

Re: How to output SAS file to server in data _null_ step?

Re: How to output SAS file to server in data _null_ step?

How to output SAS file to server in data _null_ step?

alternatives to coding multiple data steps within macro loop to improv...

Re: quick & dirty way to cut & paste data from excel into SAS editor?