About DannyModlin

DannyModlin

Imagine that you are given a new data set and a request to create a model that predicts the value of a response using several predictor variables. You start your exploration into the relationships between the predictors and the response. Someone then asks you if any of your variables are confounding with one another. Would you know what they are asking? How would you determine if there was confounding in your data? In this post, we will discuss the what and the how of confounding. We will present two possible ways to determine confounding with nominal/categorical variables and briefly mention Simpson’s Paradox. It is important to not confuse statistical confounding with the presence of a statistical interaction. Statistical confounding is when a covariate is associated with both the response and another predictor variable. The estimate of the effect of the primary predictor variable to the response is distorted because it is mixed with the effect of the confounder. Confounding can be detected by noting changes in the parameter estimates when the covariate is added and when it is removed. Statistical interaction occurs when the effect of one covariate varies at different levels of another covariate. Interactions can be detected using hypothesis testing of a higher-order term that involves both covariates. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Let’s look at a visual representation of confounding. In the first image, where gender is in our model alone, we see that the average response is four units higher for females compared to males. In the second image, we include age in the analysis with gender. The result is now that the average response for females is approximately 1.3 units lower than compared to males. This comparison is made at the average age of 35. When age is present versus absent from the model, we see the relationship between gender and the response change. Depending on the extent of this change, this could be indicative of confounding. How does this compare to an interaction visually? In this image, we see that there is a positive relationship between age and the response for males and a negative relationship between age and the response for females. This is a change of the relationship between age and response at different levels of gender. This is an interaction. So how do we assess confounding? There are two ways that people can proceed. One is to use stratified contingency tables and the other is to perform the Delta method. With stratified contingency tables, we will compare the crude odds ratios with the adjusted odds ratios. Crude odds ratios are the ones calculated when the potentially confounding variable is absent from the analysis. The adjusted odds ratio is calculated when the potentially confounding variable is present in the analysis. In the following diagram, the crude odds ratio is found when just the predictor UI is assessed against the response variable LOW. The adjusted odds ratio is found when we stratify on SMOKE. We can get this information from a PROC FREQ crosstabulation table output. Typically, the row variable is the predictor of interest, and the column variable is the response. In the TABLES line, the row is named first and the column second. The RELRISK option is what produces the table containing the crude odds ratio. proc freq data=birth; tables UI*LOW / RELRISK; title "Association Between Smoking and Low Birth Weight"; run; The adjusted odds ratio will be calculated from the stratified tables where the stratification is made on the variable SMOKE. This is also available in PROC FREQ output. SMOKE is the variable we think may be confounding with UI. On the TABLES line, we start with SMOKE as the stratification variable. We then follow with the UI*LOW part as before. The order is stratification*row*column. The RELRISK option present the odds ratios for each stratified grouping, but it is the CMH option that generates the Cochran-Mantel-Haenszel statistic and the adjusted odds ratio. proc freq data=birth; tables SMOKE*UI*LOW / CMH RELRISK; title "Association Between Smoking and Low Birth Weight"; run; If there is a large change between the crude and the adjusted odds ratios, particularly in the significance, then we conclude there to be a confounding between our two predictor variables. In our example, the crude odds ratio is 2.5778 and the adjusted odds ratio is 2.4570. Both 95 percent confidence intervals do not contain the value of 1. I am not seeing the presence of confounding between SMOKE and UI. Alternatively, we can leave PROC FREQ and perform this assessment of confounding using our regression procedures, like PROC LOGISTIC. This will have us performing the Delta method. In this case, we run the regression analysis both with and without the potential confounding predictor. The leading ODS statement is requesting that the parameter estimates table of all the following logistic procedures be saved to data sets. The first one will be named parms. All subsequent ones will be named parms1 and so forth. We request a logistic regression both with and without the variable SMOKE. Not only is this checking a confounding with UI but also with all other variables in the model line. ods output parameterestimates(match_all persist=proc)=work.parms; proc logistic data=birth; class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first; model LOW(event='Yes') = AGE LWT ETH SMOKE PTL HT UI FTV; title 'Full Model'; run; proc logistic data=birth; class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first; model LOW(event='Yes') = AGE LWT ETH PTL HT UI FTV; title 'Model Removing Smoke'; run; ods output close; We save our output from the regressions, sort, and then use PROC COMPARE to see the change in the values of the parameter estimates and also the p-values. The variable classval0 is used when you have categorical variables in your model. This term contains the level of the categorical variable. proc sort data=work.parms; by variable classval0; run; proc sort data=work.parms1; by variable classval0; run; proc compare base=work.parms compare=work.parms1 ; id variable classval0; var estimate probchisq; run; The PROC COMPARE code will compare the parameter estimates (estimate) and the p-values (probchisq) between the two saved output data sets, parms and parms1. What we are looking for is that there is a change in the parameter estimates of more than 10 percent. The direction of this change is not the focus. We can also look for a 10 percent change in the p-values but we will also need to see a change in the significance in addition to this percent of change. In our example, look at UI. The change in the parameter estimate is only 6 percent, however, the change in the p-value is 23 percent. Despite the change in p-value being larger than 10 percent, the significance did not change between the two runs. We would say that the presence of confounding is not detected. If confounding is determined to be present, then what do you do? The truer relationship between your predictor of interest and the response is viewed when the confounding variable is present in the model. This is regardless of the significance of the confounding variable. You must fight the urge to remove the potentially non-significant confounding predictor from the model. What could happen if you do remove or ignore this confounder? You may find yourself dealing with Simpson’s Paradox. This is when you have one result when the confounder is ignored and a different result when you account for this confounding. For an excellent example of Simpson’s Paradox, see the blog post by Rick Wicklin. Before checking for confounding, we first need to check if the relationship between the two variables is an interaction. To check for an interaction, we can run a regression analysis and put the interaction effect into the model. If the interaction is significant, we do not have to follow up with a check for confounding. proc logistic data=birth; class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first; model LOW(event='Yes') = AGE LWT ETH SMOKE PTL HT UI FTV SMOKE*UI; title 'Testing Interaction of UI and SMOKE'; run; The interaction is not significant, so that means you can then check for confounding using either of the two methods previously described. With these two methods for determining the presence of confounding explained, I hope that you are no longer confounded by the concept of confounding. Always keep this topic in mind when you are performing your regressions as you never know where confounding may be lurking. Find more articles from SAS Global Enablement and Learning here.

DannyModlin · ‎05-23-2024

Many analysts are interested in taking models they currently have and transitioning them to the Bayesian realm. Most leap from their favorite classical analysis procedure directly to PROC MCMC, the general-purpose Bayesian procedure. PROC MCMC is a powerful procedure and capable of completing very complex analyses. The leap to PROC MCMC might be a challenge if one is not accustomed to coding their own link functions or dummy-coding categorical variables. In this blog, we will discuss another procedure that can ease this transition from classical to Bayesian, PROC BGLIMM. PROC BGLIMM (Bayesian Generalized Linear Mixed Models) is a very direct way to move many of your classical analyses to Bayesian. Think of PROC BGLIMM as an alternative to GLIMMIX. For this reason, classical models that were performed in REG, GLM, GLMSELECT, GENMOD, MIXED, and GLIMMIX can be performed within PROC BGLIMM. Let's begin with an example to illustrate my point. Suppose that a toy manufacturer was interested in comparing the breaking strength of three different adhesives. They randomly select seven toys off the assembly line. In this instance, we are wanting to perform inference across all toys and not just those in the study. This results in adhesive being a fixed effect and toy a random effect. The following PROC MIXED code would be the way to analyze this in a classical sense. proc mixed data=sasuser.toy; class adhesive toy; model pressure=adhesive / solution ddfm=kr; random toy; run; If we were to move this model to a Bayesian approach using PROC MCMC, the complexity of the code will increase. Without a CLASS statement, we would either use arrays or create the design variables on our own for the analysis. In our example, we used an array. The priors selected for this example were non-informative due to the lack of any additional knowledge about our problem. proc mcmc data=toy seed=27513 diag=all dic outpost=mixed propcov=quanew thin=25 nbi=5000 ntu=5000 nmc=500000 plots(smooth)=all mchistory=brief stats=all; array beta[3]; parms beta: 0; parms s2t 1; parms s2g 1; prior beta: ~ normal(0, var = 1e5); prior s2: ~ igamma(2.001, scale = 1.001); random gamma ~ normal(0,var=s2g) subject=toy monitor=(gamma) namesuffix=position; mu = beta[adhesivebeta] + gamma; model pressure ~ normal(mu, var = s2t); title "Bayesian Analysis of the Toy Data Set"; run; Would PROC BGLIMM have kept the complexity of the code simple? Yes! PROC BGLIMM does have a CLASS statement that avoids the issue with arrays or generating our own design variables. By default, the priors for the analysis are the non-informative type. From looking at the code below, you likely see aspects of other familiar procedures. This is what makes PROC BGLIMM a smoother transition to Bayesian. proc bglimm data=sasuser.toy seed=8675309; class adhesive toy; model pressure=adhesive / dist=normal link=identity; random int / sub=toy; run; Let's take a moment and discuss details about this procedure. There is a suite, 13 in total, of covariance structures that can be selected both on the G and R side for mixed models. Allowances are made for covariance heterogeneity modeling. Models can be compared using the DIC (deviance information criteria) statistics. This is like the AIC/BIC statistic in classical analysis. Since BGLIMM is Bayesian generalized linear modeling, there are several response distributions that can be used: binomial, exponential, gamma, geometric, inverse gaussian, negative binomial, normal, poisson, and binary. There are also several link functions that can be selected: log, logit, probit, inverse, identity, loglog, complimentary loglog, and powerminus2. PROC BGLIMM has built in priors for you to select across the different parameters used within your analysis. For the fixed effect parameter (betas), your prior options are flat/constant and normal. For the scale parameter, your prior options are inverse gamma, gamma, and improper. For the G-side covariance parameters, your options are inverse wishart, inverse gamma, uniform, halfcauchy, halfnormal, and siwishart. For the R-side covariance parameters, your options are inverse wishart and inverse gamma. As you can see, you will have the ability to build many diverse types of models very quickly within the Bayesian structure using PROC BGLIMM. Do realize that if your model does venture outside the bounds of the options within PROC BGLIMM all is not lost. You are welcome to then move to PROC MCMC to model this more complex analysis. Let's look at a few other examples moving models from your favorite classical procedures to Bayesian using BGLIMM. In the class data set, the response variable weight is being regressed against the predictor variables height, age, and gender. In the BGLIMM code, the COEFFPRIOR statement is changing the default flat prior of the betas to a normal prior with a mean of zero and variance of one million. The random seed is set for replication across multiple runs of PROC BGLIMM. *Simple Linear Regression with Class Variable (GLM); proc glm data=sashelp.class; class gender; model Weight = Height Age gender; run; *Simple Linear Regression with Class Variable (BGLIMM); proc bglimm data=sashelp.class seed=8675309; class sex; model Weight = Height Age Sex / dist=normal coeffprior=normal(variance=1e6); run; In the crab data set, the response variable satellites represent the count of male horseshoe crabs orbiting around the nest of a female horseshoe crab. Like GENMOD and GLIMMIX, the DIST= and LINK= options establish our generalized linear model type. *Poisson Regression with Random Effects (GLIMMIX); proc glimmix data=work.crab; class color spine site; model satellites = color spine weight width / dist=poi link=log solution; random int / subject=site; run; *Poisson Regression with Random Effects (BGLIMM); proc bglimm data=work.crab seed=8675309 diag=all plots=all; class color spine site; model satellites = color spine weight width / dist=poisson link=log; random int / subject=site; run; In the heartrate data set, the average heartrate is being compared across different drug treatments. Each patient was measured hourly establishing a repeated measures analysis. Note that BGLIMM returns to the use of the REPEATED statement for R-side random effects. *Normal Response with Repeated Measures (MIXED); proc mixed data=work.heartrate; class drug hours; model heartrate = baseline drug drug*baseline / solution ddfm=kr2; repeated hours/ type=un subject=patient; run; *Normal Response with Repeated Measures (BGLIMM); proc bglimm data=work.heartrate seed=8675309 diag=all plots=all; class drug hours patient; model heartrate = baseline drug drug*baseline / dist=normal; repeated hours/ type=un sub=patient; run; Give PROC BGLIMM a try with your classical models and see how you like it. If you want to move to PROC MCMC, check out this repository full of information or this Special Collection offered from SAS Publications.

DannyModlin · ‎04-10-2024

As an alternative to the method shown in the blog, you could use the LOGISTIC function within PROC IML to undo the logit in one move: prob = logistic(yhat); The input for this function is just the logit predictions. You can also format your output with the following: print pred[c=('low_1':'low_5')]; Thanks to Rick Wicklin for these suggestions.

DannyModlin · ‎04-04-2024

You have explored your data and generated your model using a Bayesian perspective. It is now time to use this model to score new observations. What are your options for Bayesian scoring? This blog will address two ways that you can perform Bayesian scoring on your new data. Scoring data is nothing new to a statistical analyst. From code statements to score statements, we have generated predictions from our classical models for some time. What makes Bayesian scoring different? In the classical (frequentist) world, your resulting model yields a single parameter estimate, beta-hat, for each term in your model. With these estimated, we take our new observation and "plug and chug" out our single prediction, y-hat, for each of the new observations. Recall that Bayesian statistics yield posterior distributions rather than single estimates. During the MCMC algorithm, we save iterations that represent a sampling of the posterior distributions of each of our parameters. So, we now have more than just one beta-hat for each parameter in our problem. Each saved iteration is a realization of what we believe the values of the parameters are at one moment or iteration. To score a new observation, in the Bayesian perspective, we take each observation, one at a time, and compute a y-hat prediction for each saved iteration (indicated by the subscript in the previous equation). This essentially creates a list of predictions for each new observation. This list is a sample of the posterior predictive distribution for that new observation. From this posterior sample, we compute posterior summary statistics and intervals that can be used to answer questions as needed. So, how do we score our Bayesian models within SAS? One way is to use the PREDDIST statement within the MCMC procedure. Do not add the PREDDIST statement to the procedure until you have confirmed that you have a converged chain. Doing so will potentially waste time as the scoring performed using a non-converged chain will be useless. Within this statement, you provide the name of a data set that contains your new observations to score. Please be sure that the variable names in the new data set match those in the training data set. Options allow you to control what posterior summary statistics will be displayed to the screen. It is in your best interest to save the scored chain to a SAS data set for any other use. In this example, we have an already converged chain that we wish to score. The PREDDIST statement is bringing in the observations to score from the new_birth dataset using the covariates option. The scored information will be stored in the dataset scored. The posterior summary results will be stored in scored_summaries. ods select none; proc mcmc data=sasuser.birth plots=none seed=27513 outpost=birthout1 propcov=quanew nbi=5000 ntu=5000 nmc=400000 thin=10; parms (beta0 beta1 beta2 beta3 beta4) 0; prior beta1 ~ normal(1.0986,var=0.00116); prior beta0 beta2 beta3 beta4 ~ normal(0, var=100); p = logistic(beta0 + beta1*alcohol + beta2*hist_hyp + beta3*mother_wt + beta4*prev_pretrm); model low ~ binary(p); preddist outpred=scored covariates=new_birth stats(percent=50)=summary; ods output predsummaries=scored_summaries; run; ods select all; proc print data=scored_summaries; title "Posterior Summaries of Scored Observations"; run; Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. The predictions returned, in our example, are 0s and 1s reflective of the fact that the response variable low was noted as binary in the model statement. PREDDIST calculated the predicted logits on each iteration, for each new observation, proceeded to perform the inverse link back to a probability, and based on comparison on a random uniform draw declares each 0/1 prediction. The posterior means you see in the provided output are the means of these 0/1 predictions, thus the proportion of 1s for each new observation. PREDDIST does have a few limitations. The number of new observations to be scored is capped at 50,000. Secondly, if you use the GENERAL or DGENERAL function, PREDDIST will be ignored. What alternative could you use if this situation presented itself? You could use PROC IML. To use PROC IML for Bayesian scoring, I will assume that you have saved the converged chain from the PROC MCMC run (birthout1) and that you have a new data set containing your observations to be scored (new_birth). After starting PROC IML, we begin by calling in the saved chain and new observations into two matrices, parms and newobs respectively. Listing the names of the columns in the braces ({ }) restricts the matrix to just the columns that we need and also sets their order. It is important to make sure that the columns for the parameters and the columns of the new observations are in matching orders within the matrices. proc iml; *putting saved chain into a matrix; use birthout1; read all var {beta0 beta1 beta2 beta3 beta4} into parms; close birthout1; *putting new observations into a matrix with variables in correct order; use new_birth; read all var {alcohol hist_hyp mother_wt prev_pretrm} into newobs; close new_birth; Since we included an intercept, we will generate a column of 1s and append this to the front of the new observations’ matrix, newobs. *creating and appending column of 1s for the intercept in the model; intercept= j(nrow(newobs),1,1); newobs = intercept || newobs; Next, we multiply the parameters matrix against the transpose of the observation matrix. The transpose ensures the correct dimension match. Note that this matrix multiplication is doing exactly what we want. Each new observation is being applied against each saved iteration and yielding a list of predictions, yhat. *matrix multiply to make the logits for each new observation for each iteration; yhat = parms*newobs`; We now have a list, posterior sample, of predicted logits for each of the new observations. To get back to predicted probabilities, we will undo the logit transformation. First, we will create a matrix e, the same dimension as the yhat matrix, filled with the value of the mathematical constant e. Then, use this matrix e to exponentiate each logit value within the matrix yhat. *make a matrix of value e and then exponentiate each logit value; e = j(nrow(yhat),ncol(yhat),constant("e")); eta = e ## (yhat); Next, we take these exponentiated versions of the logits and process the inverse link of the logit transformation. *undo the logit transformation to return to predicted probability; prob = eta / (eta+1); The resulting matrix, prob, now contains predicted probabilities for each new observation. Remember that we have as many predicted probabilities for each new observation as there were saved iterations from the saved chain. We then can use this to generate any posterior summary statistics we wish. In my example, I generate the posterior mean of each posterior predictive distribution. *find posterior mean of probability across all saved iterations; pred = prob[:,]; print pred; quit; It is noted that we could continue with the final step and declare actual 0/1 predictions based on these probabilities and then calculate the proportion of 1s thus like PREDDIST. In our example, we stopped at the point of the predicted probability and focused on the posterior mean of that distribution. Do note that this inverse link step was needed due to the analysis being a logistic regression. If you were performing any other analysis, you would have to correctly undo the link function that was used. In the case of the link being identity, no additional inverse link work would be needed. Also, this generated a posterior predictive distribution for the average response. If you were wanting to create the posterior for an individual prediction, say in the case of linear regression, there would be an additional step needed to generate the error element to add to the prediction. To do this, you would recall that the variance parameter was also saved in the chain. You would, at each iteration, draw a value from a normal distribution with a zero mean and a variance equal to the value of the variance component at that iteration. Repeat this to get a vector of errors. These random errors would then be added to the prediction of the observation at each respective iteration. For additional information about the PREDDIST statement within PROC MCMC, check out this link to the documentation. For more information about PROC IML, check out this link to the documentation and also The Do Loop blog written by Rick Wicklin.

DannyModlin · ‎02-21-2024

With the increase in popularity of Bayesian analysis, more and more areas of statistical analyses are starting to explore the possibilities and advantages of incorporating Bayesian techniques. In this post, we will discuss some tips and techniques to bring Bayesian analysis to Time Series. Let's begin our discussion of Bayesian time series structure with autoregressive elements. Time series analysis is no stranger to the value of a series (Y) at the current time point being dependent on a weighted combination of parameters (phi) and the value of the series at some past time point. To model this element, the lag( ) function was typically used in a preprocessing move prior to analysis. This ultimately put the work on you to create variables within the data set that contained lagged values of the response series. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. In PROC MCMC, we have access to lead and lagged values for random variables that are indexed. What exactly do I mean by indexed? Two types of random variables are indexed in the MCMC procedure. The first is the response variable, our time series. In the MODEL statement, this variable is indexed by observations. The second is a random variable placed in the RANDOM statement. This variable is indexed by the SUBJECT= option. To access both the lead and lagged values of these indexed variables, we state the variable name followed by either .L# or .N# to access lagged or next values respectively. For example, if our response series was named SALES and placed on the MODEL statement, SALES.L2 would represent the value of the SALES series two time points prior and SALES.N3 would represent the values of the SALES series three time points into the future. This greatly aids in the creation of models that would benefit from lagged elements such as a third-order autoregressive model (AR(3)). When lagged values are utilized within a model, we do have an issue that also needs to be addressed. To forecast the total sales at time position 5 in the series, we would include the values from time positions 4, 3, and 2. This is not a problem because those values are found within our response series. What happens if we wanted to forecast position 3 in the series? We have the total sales of time positions 2 and 1. You might now see the problem that we have. Do we simply drop these types of observations due to missing information in the model? No! In PROC MCMC, we can build in a way to account for the initial states of lagged variables that extend beyond our known data. What do I mean by initial states? As we approach the start of the time series, we run out of information for our lagged time values. This was a problem before the ICOND= option. In the MODEL statement or the RANDOM statement, we can account for these initial states (or initial conditions). In our example, we can include ICOND=(alpha beta gamma) in the MODEL statement. These initial states are treated as parameters in the problem, and we place priors against them just like any other parameter in our model. Three items are listed due to the maximum number of initial states needed being three when we are at the very beginning of the series. Using this technique, we do not lose data at the front from missing values. Now that we see how to use indexing and ICOND to our advantage in the MODEL statement, how does this help us in the RANDOM statement? Performing a Bayesian time series analysis also enables you to use a dynamic linear model setup. This setup is a very general type of nonstationary time series model. With this, you can create models with time-varying coefficients where you can explore stochastic shifts in regression parameters. To do this, we use random-effects models that specify time dependence between successive parameter values in the form of smoothness priors. The best application of this structure is for seasonality components. As you recall, seasonality components are deviations from the trend. These seasonality values sum to zero across the length of the seasonal period. For example, let's look at sales data that has been accumulated to quarterly averages. Upon inspection, we determine that there is a seasonal pattern existing across the quarterly values. From a deterministic approach, these quarterly seasonal component values will sum to zero across four consecutive time points. This is due to the period of quarterly data being four in length. Taking a more dynamic approach, this sum is zero in the mean of the distribution with an additional variability. The additional benefit of using the dynamic approach to this seasonal component is that we can now use the lag and next elements as well as initial conditions during the modeling process. Because we know that the sum of all the seasonal components should add to zero in the mean, we can model the seasonal component at the current time point as the sum of the negative previous seasonal components. There are other components that we can entertain such as trends that follow a random walk with drift where this drift could follow a first-order autoregressive process. Let’s look at a code example. proc mcmc data=UKcoal nmc=100000 seed=123456 outpost=posterior propcov=quanew; parms alpha0; parms mu0; parms s0 s1 s2; parms theta1; parms theta2; parms theta3; parms theta4; parms theta_phi; parms phi; prior phi~normal(0,var=exp(theta_phi)); prior alpha0~normal(0,var=theta2); prior mu0~normal(0,var=100); prior s:~normal(0,var=theta3); prior theta:~igamma(shape = 3/10, scale = 10/3); random alpha~normal(phi*alpha.l1,var=exp(theta2)) subject=t icond=(alpha0); random s~normal(-s.l1-s.l2-s.l3,var=exp(theta3)) subject=quarter icond=(s2 s1 s0) monitor=(s); random mu~normal(mu.l1 + alpha.l1,var=exp(theta1)) subject=t icond=(mu0); x=mu + s; model c~normal(x,var=exp(theta4)); run; The nine PARMS statements reference the model parameters. These are the ICOND parameters defined, the variance parameters, and the coefficient parameters. The three seasonal parameters are placed in the same block. This blocking is done to improve the mixing. The PRIOR statements place normal priors on most parameters and inverse gamma priors on the variance components. Sampling the variance parameters on the logarithmic scale is also a way to improve the mixing. The RANDOM statements are each indexed by the variable t and each contain lagged elements to random walk with drift and a seasonal effect. The ICOND option ensure that no observations are ignored due to a lack of information during the lag effects. Note that adding a MONITOR option to the RANDOM lines would allow you to also have the lagged items presented within the diagnostic and posterior summary output. The statement prior to the model line defines the value of the response at time t related to the other parameters of the problem. The MODEL statement specifies that the response variable has a normal distribution. The PREDDIST statement can be included if posterior predictive distributions would be needed from a Bayesian scoring perspective. For more information concerning Bayesian Time Series, click here to see a SAS Support page sample about Bayesian Time Series or check out the paper written by Aric LaBarr, “The Bayesians are Coming! The Bayesians are Coming! The Bayesians are Coming to Time Series!” Find more articles from SAS Global Enablement and Learning here.

DannyModlin · ‎12-15-2023

Within SAS Studio, many users have already taken advantage of the ability to customize SAS Studio tasks to provide a starter for themselves and their coworkers. The time taken to build such custom tasks has allowed others to get started using analytical procedures within SAS without having to know all the details about coding. SAS Studio now has another means of customization available, Custom Steps. This post continues from part 1 where we will continue the process of creating our own SAS Studio Custom Step that then can be shared with others. We will add the OPTIONS and OUTPUT tabs to our GUI and then work with the associated SAS code with our Custom Step. STEP 4: Creating the Step in Designer (Continued) Starting from SAS Drive, use the applications menu to proceed to Develop Code and Flows. From the left pane, access the Steps area and click on Shared. We should see our KCLUSExample from before listed. Double-click to open our Custom Step. In the Control Library, click Add Page. With this new page selected, set ID to options and Label to OPTIONS in the properties panel. Add two Check Box items to the page. Set the following properties: Check Box 1 - Set ID to imputeint. Make "Replace interval missing values with mean" the label. Check Box 2 - Set ID to imputecat. Make "Replace nominal missing values with mode" the label. Add a radio button group to the page. Set ID to radiocluster and Label to Number of clusters: . Click Add Many to add several items to the radio list. In the box that appears, enter Specify number of clusters and Calculate number of clusters, where each option appears on its own line. Click OK. If the original default radio button is still present in the list, click the box in front of that option (in properties) and click the trash button. Click the Map Values button to expand the options within the radio list. A new column named Value appears. On each line, change Value to specify and calculate respectively. Set the default radio button to Specify number of clusters. Add a text or numeric field to the page. Set ID to numcluster and Label to Maximum number of clusters. Change Type to Numeric. Set the default value to 4, the minimum value to 2, and the maximum value to 6. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Sometimes we do not want something to appear on the GUI unless another option has been selected. For example, should the text/numeric object appear if we pick Calculate number of clusters? No. The Dependencies area of the properties allows us to clarify moments when the object will appear or disappear. In our case, we only want the text/numeric object to appear when we pick the Specify radio button. Expand the Dependencies area within the text/numeric object. Under Visibility, type ["$radiocluster","=","specify"]. Your OPTIONS pane should now look like this. Now let's move on to the OUTPUT tab of the Custom Step. In the Control Library, click Add Page. With this new page selected, set ID to output and Label to OUTPUT. Add an output table to the page. Set ID to outputdata and Label to Select location of output table. Select the box for Required. Your OUTPUT pane should now look like this. Step 5: Adding the SAS Code With the GUI created, we turn our attention to the backing code for the Custom Step. For ease, we will address each section of the code as it will be added to the program code area. Recall that we needed to be careful when we were naming the IDs of each of our objects on the GUI. You will now see these ID names as macro variables referenced within the upcoming code. Click on the Program tab just above the Control Library of the Step Designer. You will see a line that says /* SAS templated code goes here */. This is a comment and can be left or replaced with our code. The first code block we will add creates two macro variables that contain the list of categorical inputs and continuous inputs selected by the user. The macro do loop shows how you can step through the list of variables selected one by one. This can be beneficial if you are needing to address aspects of these variables individually. The macro &catvars_1_name would reference the first categorical variable chosen by the user in the column selector. The &i and &j macro variables will count through the list until reaching the maximum number of variables selected. This value is stored in the macro &catvars_count. There are other ways to more directly reference the entire list of variables selected but this approach works well if there are other things you need to do to the variables before analysis. *Creates the lists of categorical and interval inputs; %let catvarslist=; %let intvarslist=; %macro predictors; %do i=1 %to &catvars_count; %let catvarslist = &catvarslist &&catvars_&i._name; %end; %do j=1 %to &intvars_count; %let intvarslist = &intvarslist &&intvars_&j._name; %end; %mend; %predictors; Next let's start the creation of the PROC KCLUS statement. The opening %let starts the PROC KCLUS statement and references the selected data set and distance options. These are consistent in the code regardless of the other options that are selected by the user on the GUI. The macro then adds options to this original PROC KCLUS statement based upon user selections. The macro variables &imputeint and &imputecat add options for imputation of the mean or mode as selected. The macro &radiocluster, when set to specify, adds the option of the maximum number of clusters input by the user, &numcluster. When &radiocluster is set to calculate, the option for the abc calculation will be added to the PROC line. *Creates the PROC KCLUS statement; %let proc=proc kclus data=&inputdata distance=Euclidean distancenom=binary ; %macro procline; %if (&imputeint) %then %do; %let proc = &proc impute=mean; %end; %if (&imputecat) %then %do; %let proc = &proc imputenom=mode; %end; %if (&radiocluster = specify) %then %do; %let proc = &proc maxclusters=&numcluster; %end; %if (&radiocluster = calculate) %then %do; %let proc = &proc noc=abc(minclusters=2) maxclusters=10; %end; %mend; %procline; Add this code block after the previously added code for the input variables. Now let's use these macros and establish the PROC KCLUS code we will want to run. The macro &proc will contain the completed PROC statement to begin KCLUS. The two input lines will call the correct macro variable to list the interval list and the categorical list. The final line, SCORE, will create the output data set that is referenced by the macro &outputdata. &proc; input &intvarslist / level=interval; input &catvarslist / level=nominal; score out=&outputdata copyvars=(_all_); run; Add this code block after the previously added code for the PROC line generation. The code for our Custom Step for PROC KCLUS is complete. Be sure to save if you have not already done so. Recall that your created Custom Step should appear in the Shared list when SAS Steps is selected on the left pane of SAS Studio. Step 6: Running your Custom Step To run your Custom Step, you have two options. You can right click on the step and select "Open in a tab". This will open your step in a very similar view as a SAS Studio task. The alternative method would be to click on New and then select Flow. You would then click and drag your Custom Step from the left pane into the flow. After adding and connecting data sets to and from the created step, you select the step in the flow and complete the GUI that appears at the bottom of your screen. When using a flow, the objects responsible for selecting the input and output data sets do not appear in the GUI. They are completed when you connect data sets to the squares on each side of the step within the flow. Input data sets will connect on the left side and output data sets will connect to the right. Congratulations! You have created your first Custom Step. From here, you can now add more things to the GUI and code to expand the use of this step as needed. If you would like to investigate Custom Steps further, please visit here to access more information. Also check out the SAS Custom Step repository. Find more articles from SAS Global Enablement and Learning here.

DannyModlin · ‎11-14-2023

Within SAS Studio, many users have already taken advantage of the ability to customize SAS Studio tasks to provide a starter for themselves and their coworkers. The time taken to build such custom tasks has allowed others to get started using analytical procedures within SAS without having to know all the details about coding. SAS Studio now has another means of customization available, Custom Steps. In this first, of a multipart blog, we will begin the process of creating our own SAS Studio Custom Step that then can be shared with others. STEP 1: Getting Started So, you have decided to create your own Custom Step within SAS Studio. What should you do first? Find a block of code, Data Step or PROC, that is utilized very often by you and your colleagues. This block of code should have many aspects that are commonly used by you and your colleagues but also have some unique additional options that could be taken advantage of when the problem dictates such. Starting with actual working code is much easier than from an open-ended scenario based structure. With the code there, we can focus our attention on the specifics of the Custom Step. In our example, we will select a block of code that performs k-means clustering. This code uses PROC KCLUS. PROC KCLUS is performing k-means, k-modes, or k-prototypes clustering on the observations using the provided input variables. This clustering is based on distances that are computed from quantitative or qualitative variables (or both). proc kclus data=work.cars distance=Euclidean distancenom=binary impute=mean imputenom=mode noc=abc(minclusters=2) maxclusters=10; input MPG_City MPG_Highway Horsepower / level=interval; input Type Origin / level=nominal; score out=work.clustered copyvars=(_all_); run; STEP 2: Generalizing and Expanding Now that we have working code, let's decide where and how we can generalize this code to make it applicable to various situations. Begin by asking yourself, what items in the current code would change depending on the problem at hand? In our case, the following would change: name of the input dataset (work.cars), lists of variables to use (MPG_City, MPG_Highway, Horsepower for interval, Type and Origin for nominal), and the name of the output dataset (work.clustered). If you are familiar with macro variables in SAS, these would be items where we could generalize the code with macro variable replacement. This mentality is helpful here as well for generalization for Custom Steps. What about expanding? These are items that are not necessary for the program to execute but are items that could be used when needed. In our code, examples of this would be the imputing of missing data (impute=mean imputenom=mode) and the maximum number of clusters that we will generate (noc=abc(minclusters=2) maxclusters=10). Yes, there are many other options and sub- options available within PROC KCLUS. Should we attempt to bring them all into our Custom Step? No, definitely not! Why shouldn't we? When creating a GUI (graphical user interface), inclusion of more options and items makes things feel cluttered and difficult to know where some options can be found. We should limit our expanding to items that are most commonly used by coworkers or yourself. Ask around. What do your coworkers use most often when they run PROC KCLUS? Here is our KCLUS code with the generalizing noted within brackets { } and the expanding noted within arrows < >. proc kclus data={work.cars} distance=Euclidean distancenom=binary <impute=mean imputenom=mode> <noc=abc(minclusters=2) maxclusters=10>; input {MPG_City MPG_Highway Horsepower} / level=interval; input {Type Origin} / level=nominal; score out={work.clutered} copyvars=(_all_); run; STEP 3: Designing the GUI We have our code ready. Now it is time to start thinking about the construction of the GUI for our Custom Step. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Within SAS Studio Step Designer, there are many items we can choose from to create the GUI just as we wish. Which types of items should we use with our Custom Step? The Designer has objects specific for the selection of data sets (both input and output) as well as variable selection objects. For these parts of the GUI we will simply include these as we have need of them. But what about the other options (expanding parts) of our code we included? Let's start with the option to impute missing values within our data. In this option, we will either impute or not. The answer is a simple yes or no. For this, a check box would be the best object to use. Selection of the check box would indicate we want to impute while deselection would indicate we do not. What about the selection of the number of clusters to utilize? Our options here can be broken into two parts. First, we will need to decide if we want to have the number of clusters calculated using the ABC statistic or do we want to specify a specific number of clusters. If we decide that we want to specify, then we will need to provide a value for the maximum number of clusters. What object would be good for the first decision? Since we are selecting among only a few options, I would suggest a radio button object. This will allow us to provide a list of options from which one will be selected. Ok, but what about the ability to provide the actual maximum number of clusters. For this, we will use a text/numeric field object. This will allow the direct typing of our value. Many other objects are available in the Step Designer for use; however, these are the ones that we will be needing for our Custom Step. STEP 4: Creating the Step in Designer Now we are ready to start our work within the Step Designer. The first thing we will assemble is the DATA tab for our Custom Step. For each object that we drag onto our GUI, we will have a list of properties that we will update on the right side of the screen. These properties control appearance and behaviors for our object. Throughout the process of building the GUI, please be very careful when setting the ID property of each object. You will need to remember exactly what you called an object since this will be its macro name later when we start working with the code creation. Starting from SAS Drive, use the applications menu to proceed to Develop Code and Flows. From the menu bar, click New and select Custom Step. The Custom Step area opens into the Designer with Page 1 appearing. On the properties panel, set the ID of the page to data and the label to DATA. In the Control Library, scroll down to the Data area and drag Input Table to the DATA page. Set ID to inputdata and Label to "Select table for analysis". Leave the check box next to Required selected. Add two Column Selector items to the page. Set the following properties: Column Selector 1 - Set ID to catvars. Make "Select categorical inputs" the label. Under Link to input table, select (inputdata). For Column Type, choose All types. Column Selector 2 - Set ID to intvars. Make "Select continuous inputs" the label. Under Link to input table, select (inputdata). For Column Type, choose Numeric. You have the option to restrict the type of variables that will appear within the column selector using the Column Type property. This will isolate choices so that when a numeric variable is needed then we will not have someone choose a character variable. The Link property creates a dynamic link between the column selector object and the data set that was chosen. When the column selector is opened, the Custom Step will parse the associated data set and list the appropriate variables from that data set. Save the step using the name KCLUSExample. Currently, our Custom Step should look like this when you click the Preview button. In part 2 of this blog, we will continue the creation of our Custom Step. We will add our OPTIONS and OUTPUT tabs to the GUI and then add the associated code to the Custom Step. If you would like to investigate Custom Steps further, please visit here to access more information. Also check out the SAS Custom Step repository. Feel free to play around with this Custom Step before viewing part 2 of the blog. Just remember to preserve the Custom Step at this point so we can continue.

DannyModlin · ‎10-17-2023

Elasticity models are an important econometric aspect to many researchers in marketing. Specifically, researchers are interested in how demand changes in relation to the price of a product or alternative product (cross-elasticity). In this blog, we will take two very common price elasticity model approaches and stretch them into the Bayesian environment. Let's begin with our example. Our data focuses on cigarette sales. Our variables include: the quantity sold, the price of the product, and the average income of the region containing the store. One common analysis of price elasticity is the log-log ordinary least squares model. Before we begin our analysis, we will take the log of each variable. Taking the log allows for the interpretation of predicted parameters associated with prices to be elasticities. With the logs of the variables taken, we construct a typical ordinary least squares regression within PROC REG. proc reg data=work.model plots(unpack)=all; model logQd=logprice logincome; output out=work.diagnostics p=logQdhat r=resid; run; quit; Looking at the results of this, we can see the estimate for the price elasticity for the cigarettes while accounting for the value of income (in the log form). Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. This OLS model can be easily recreated within PROC MCMC. proc mcmc data=work.model outpost=cigoutsimple diag=all dic propcov=quanew nbi=5000 ntu=5000 nmc=100000 thin=2 mchistory=brief plots(smooth)=all seed=27513 statistics=all monitor=(_PARMS_); parms (beta0 beta1 beta2) 0; parms sigma 1; prior beta: ~ normal(0,var=1e6); prior sigma: ~ igamma(shape=2.001,scale=1.001); mu = beta0 + beta1 * logPrice + beta2 * logIncome; model logQd ~ normal(mu, var=sigma); run; quit; While using the bayesian approach, you will be able to incorporate any additional (prior) information about the parameters in the model. By making adjustments to the prior statement in PROC MCMC, you can deviate from the non-informative prior that we used here. Since we used a non-informative prior for all parameters, our posterior means will emulate the estimates from the classical approach. In this case, there is an issue with the regression model. Looking at the residual plot for logPrice, we see that there is a pattern (curvilinear) visible. Besides this being an issue with the assumptions of an OLS model, this is also, in our example, referred to as endogeneity. Our interpretation and use of the results are now in question. One method of dealing with endogenaity is to run a two-stage OLS model. The first stage is to take the problematic or endogenous input variable (logprice) and regress it against other variable(s) (instuments) that are correlated with logprice. Using this first stage OLS, we create predicted values of logprice and use these predictions as the input to the second stage OLS that will return us to our original model that we tried earlier. This can be performed using PROC SYSLIN. For further details on endogeneity, check out Greene's book (Greene,William H. (1993). Econometric Analysis (3rd. ed.) Prentice-Hall.). proc syslin data = work.diagnostics 2SLS out=syslin_output plots(only unpack)=(QQ RESIDHISTOGRAM); endogenous logPrice; instruments sales_tax; model logQd = logPrice logIncome; output r=residuals; run; quit; Here are the results of this two-stage OLS. The estimated own price elasticity estimate has changed because of mitigating the endogeneity. How is this implemented within a Bayesian approach? We will run PROC MCMC twice and incorporate the use of the PREDDIST statement. First let's run the first-stage OLS. The PREDDIST statement is taking the original training data and creating posterior predictive distributions for each observation (logprice). We save the posterior summary statistics into a new SAS data set. proc mcmc data=work.model outpost=cigout2stage diag=all dic propcov=quanew nbi=5000 ntu=5000 nmc=100000 thin=2 mchistory=brief plots(smooth)=all seed=27513 statistics=all monitor=(_PARMS_); parms (alpha0 alpha1) 0; parms (sigma2) 1; prior alpha: ~ normal(0,var=1e6); prior sigma: ~ igamma(shape=2.001,scale=1.001); mu = alpha0 + alpha1 * sales_tax; model logPrice ~ normal(mu, var=sigma2); preddist outpred=pred stats=summary; ods output predsummaries=prediction_summaries; run; quit; The original data and the posterior summary statistics data sets match up nicely by observation. Using a simple merge, we combine these two data sets. data newmodel; merge work.model work.prediction_summaries; run; quit; From the posterior summary statisics, we have the choice to use the posterior mean or posterior median as our point estimate for logprice in the second-stage OLS. In our example, we use the posterior mean. proc mcmc data=work.newmodel outpost=cigout2stagealt diag=all dic propcov=quanew nbi=5000 ntu=5000 nmc=100000 thin=2 mchistory=brief plots(smooth)=all seed=27513 statistics=all monitor=(_PARMS_); parms (beta0 beta1 beta2) 0; parms (sigma) 1; prior beta: ~ normal(0,var=1e6); prior sigma: ~ igamma(shape=2.001,scale=1.001); mu = beta0 + beta1 * mean + beta2 * logIncome; model logQd ~ normal(mu, var=sigma); run; quit; Again, we are using non-informative priors for this example. If we did have additional information about the parameters, we could have made the appropriate adjustments to the PRIOR statements in the code. Since we are using non-informative priors, our posterior means for the second-stage OLS are comparable to the solution from PROC SYSLIN. Using PROC MCMC allows one to use the powerful Bayesian techniques in elasticity models. This can lead to better understanding of the price elasticities since we now can incorporate prior information and end with posterior distributions of the parameters. If you would like more information about price elasticity models, here is a suggested article to get you started. Find more articles from SAS Global Enablement and Learning here.

DannyModlin · ‎09-29-2023

Do you enjoy running statistical analyses without having to know the exact SAS code? SAS Studio tasks give users a nice reminder of the tasks within SAS Enterprise Guide: Open the task and make a few selections from those provided and click the RUN icon. No code knowledge is needed. But what if you wanted to customize these tasks? The purpose of this blog is to compare custom tasks and custom steps in SAS Studio. Maybe your workplace would benefit from a task that would be more specific to the needs of the company. That is the purpose of the custom task within SAS Studio. Using XML coding, you could create your own SAS Studio task and share these with other colleagues. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Creating Custom SAS Studio tasks reminds me of writing a program within Visual Basic. Step 1, you declare all the objects that will appear within the task. Step 2, you design the user interface (UI) using containers to generate the different tabs within the task. Step 3, you link the UI items to the program code using a macro style structure that grabs the elements selected from the UI. Optional step 4, you enhance your UI by establishing dependencies that will hide or reveal items as you make specific selections. From Custom Tasks to Custom Step With all this XML code to be written, there are multitudes of chances for mistakes to be made that can increase frustration and creation time. Let's look at custom steps within SAS Studio and compare them to customized tasks. The largest difference, in development, between a custom task and a custom step is the amount of code you have to write. Rather than declaring the objects that will appear within the step, you simply use the Step Designer to create the tabs and place elements on these tabs. With each object that is placed on the Step Designer, you provide properties for each of the objects. This is similar to functionality in SAS Enterprise Miner and to Model Studio in SAS Viya. Each object added will require an ID property to be named. Attention needs to be taken when naming the ID property as that is the reference name used in the program code. Dependencies, such as object visibility when certain selections are made, can also be established within the properties of an object. The largest similarity between a custom task and a custom step is the program code. Much like tasks, the custom step uses a macro style structure linking the items from the UI to the backing program code. When creating a custom step, you do not have to directly type the UI interface code. As you provide items and properties in the designer, the UI code will be written for you. However, one neat trick is that if you are given JSON UI code and copy/paste it into the UI area, the designer items will be created for you. Some objects will self-generate macro variables that can assist you in the creation of the program code. For example, when you place a column selector in the custom step, there will be a macro variable created that tracks the number of variables selected by the user. This can be of use in loops within the program code. For more information about these macros variables and using them, please reference the SAS documentation. Functionally, custom steps can be utilized in either a flow or in stand-alone mode. Stand-alone is similar to a task where it opens within a tab and you make all your selections. Within a flow, the structure matches the style of a process flow in SAS Enterprise Guide. As you can see, there are advantages, in production and use, to using the custom step compared to the custom task. If you have taken the free e-learning course on creating a custom task, you will find that the transition to a custom step is pretty easy.

DannyModlin · ‎11-30-2020

Greetings. To choose which variance structure to use include the TYPE= sub-option in the RANDOM line. For example, RANDOM .... / TYPE=UN. This would be unstructured. Hope this helps.

DannyModlin · ‎11-30-2020

Greetings. The use of holdout data is to see how well a model will generalize and perform on data that was not trained upon. This aids in the prevention of overfitting a model and being subjected to a false sense of accuracy in how well a model is performing. Hope this helps.

DannyModlin · ‎11-30-2020

Greetings. When using linear mixed models, it is possible to generate the same marginal model from use of the RANDOM statement or the REPEATED statement. You would just have to check the V matrix created by both structures. However, when you leave normality(Generalized linear mixed models) you will not yield the same marginal model. Hope this helps.

Online Status	Offline
Date Last Visited	2 weeks ago

Confounding What and How

PROC BGLIMM: The Smooth Transition to Bayesian Analysis

Re: Bayesian Analysis and PROC IML: What's the Score?

Bayesian Analysis and PROC IML: What's the Score?

It's About "Time" for Bayesian

Taking the Next Step in SAS Studio Customization - Part 2

Taking the Next Step in SAS Studio Customization - Part 1

Stretching Price Elasticity Models into Bayesian

Customizing SAS Studio Taking Tasks to the Next Step

Re: proc mixed classifying variance structure

Re: In Pipeline comparision tab, what is the use of holdout data set

Confounding What and How

PROC BGLIMM: The Smooth Transition to Bayesian Analysis

Bayesian Analysis and PROC IML: What's the Score?

It's About "Time" for Bayesian

Taking the Next Step in SAS Studio Customization - Part 2

Confounding What and How

PROC BGLIMM: The Smooth Transition to Bayesian Analysis

Re: Bayesian Analysis and PROC IML: What's the Score?

Bayesian Analysis and PROC IML: What's the Score?

It's About "Time" for Bayesian

Taking the Next Step in SAS Studio Customization - Part 2

Taking the Next Step in SAS Studio Customization - Part 1

Stretching Price Elasticity Models into Bayesian

Customizing SAS Studio Taking Tasks to the Next Step

Re: proc mixed classifying variance structure

Re: In Pipeline comparision tab, what is the use of holdout data set

Re: GENMOD and repeated measurement

SAS Innovate 2025