About smanoj

smanoj

In my previous article, I tried to cover basics of BART model, its mathematical form, MCMC algorithm and regularization priors. In this article, I demonstrate training a BART model using BART node in Model Studio. I also compare BART model with other modeling techniques including a traditional linear regression model. Note that the BART node in Model Studio models the interval targets only. Training a BART model in Model Studio In the demonstration, I will be using a data set from a charitable organization that seeks to predict the amount of donation made by their customers who chose to make donations. The data set includes variety of information on its customers like demographic details, past donation amounts and frequency of donation etc. The challenge is to build a machine learning model that can accurately predict the amount of donation (Donation_Amt) when almost half of the donation amounts are having missing values. Why donation amounts have these many missing values? Well, it is because not everyone in the database will choose to donate. A partial view of data table is produced below for reference. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. I will be using Model Studio to build several models (including a BART model) to predict the interval target (Donation_Amt, in this case). I am assuming that you are already familiar with the steps required to create a pipeline in Model Studio. If not, check out Build Models with SAS Model Studio | SAS Viya Quick Start Tutorial. I invoked a basic template for interval target that includes imputation node, a linear regression node and a model comparison node in the pipeline. Remember, the imputation node can impute the missing values in any of the predictors but cannot handle the missingness in a target variable. Besides the default linear regression model I would try several other machine learning algorithms to model the interval target. In this attempt I would add a generalized linear model (GLM node) to the Imputation node and change the default value of Target probability distribution from Poisson to Normal, i.e. select Target probability distribution > Normal. All other settings are kept at their default. Then I add a gradient boosting node, a forest node and finally a BART node to the Imputation node with their default settings. All these modeling nodes are automatically connected to the Model Comparison node to facilitate model comparison. The completed pipeline should resemble the following. Select Run pipeline to run all the included nodes in pipeline. Before discussing the Model Comparison node results and the champion model I would like to spend some time on BART model results (after all this article is about BART model). Right-click on BART node and select Results. On the Node tab, let us first discuss the Trace plot. It plots the sampled values of the variance parameter against the number of iterations. We need to make sure that the MCMC sampler explores the parameter space efficiently. Ideally, a Markov chain traverses the parameter space efficiently with a relatively constant mean and variance as the number of iterations increases. The Trace Plot can suggest that a Markov chain did not converge to its stationary distribution if there are visible trends, or if the distribution of values changes over time. Assessing the convergence of Markov chain is important, as no valid inferences can be drawn if the chain is not converged. Above trace plot highlights convergence issues as a clear trend is evident in the graph. To address convergence issues, you can either change the model specifications or increase the number of iterations. The vertical line in the graph indicates the end of burn-in period of initial iterations that are discarded. Remember, the Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of the initial values on the posterior inference is minimized. The default value is 100, hence the line is drawn at 100th iteration. However, you can adjust the burn-in period, depending upon the problem. Next, we look at the Autocorrelation function plot. It measures the correlation between MCMC draws at different lags. The lag-k autocorrelation is the correlation between every sample that are k-steps apart. Ideally, this autocorrelation should become smaller as k increases, i.e. samples can be considered as independent. On the other hand, high autocorrelation at long lags indicates poor sampling efficiency. In the adjoining Autocorrelation function plot, there is a lot of serial correlation between successive draws, and we clearly see high autocorrelation at long lags which suggest poor sampling efficiency. This means that the chain is very slow in exploring the sample space and that the sample space has been explored only few times. In short, though autocorrelation does not lead to biased Monte Carlo estimates, it does indicate poor sampling efficiency. In such scenario, we can thin the MCMC chain, that is we discard n samples for every sample that we keep. In Model Studio, you can do so by increasing the value for the Thinning rate property. Note that increasing thinning rate discards a potentially large number of samples. Though high autocorrelation can be concern in some applications, the estimates are more precise if you use all the samples, even in the presence of high autocorrelation. Next, I expand the Output to see results from the BART procedure. The Prior information table displays: Chain information table indicates the samples saved per chain. In this example, system uses 4 Markov chains which by default is equal to the number of worker nodes. The value 500 comes from the Number of iterations property setting in the MCMC option. This number excludes the burn-in iterations. Scroll further down to see the information about posterior samples. Note that 2000 samples are kept which means to score a new data with the model, each new data point needs to traverse all the 2000 samples of the ensemble to generate predictions. Node also produces Path EP score code and DS2 package code that can be used outside of Model Studio environment to score new data. Now, I click on the Assessment tab and examine the Predicted reports window. The graph displays the actual and predicted means for each quantile (depth in increments of 5). It is evident from the graph that the actual mean and predicted mean are very close to each other for both the training and validation data role. Hence model seems to be making accurate predictions. From the drop-down menu on the top right corner, you can request for another assessment plot, i.e. predicted mean by actual mean graph. This plot also indicates that the model seems to be doing a good job in making predictions as the predicted means and actual means are going hand-in hand. Next, I select the Output Data tab to examine the predictions made by the model. Select View output data twice to open a data table that includes a predicted target column along with the original inputs. In the output table snippet below, I have rearranged the default order of predicted columns and input variables to be able to quickly compare the actual and predicted target values. The first column (Donation_Amt) is now the actual target and second column (Predicted:Donation_Amt) is the predicted value. Looking at the first few observations, it appears that the BART model is doing a good job in predicting the target value. Next, I examine the Model Comparison node results to identify the best model among all the competing models. Right-click the Model Comparison node to view the results. Model comparison table reveals Forest as the champion model having lowest ASE (average squared error) of 16.65, and BART model finishing close second with ASE value of 30.76. Summary The BART model is a flexible Bayesian nonparametric approach to regression that consists of a sum of multiple decision trees. They can incorporate continuous predictors, categorical predictors, missing values (in predictors) and interactions. However, it is interesting to know how BART models compare with other tree-based ensemble methods? Well, in the gradient boosting algorithm, trees are added sequentially to the ensemble. Each subsequent tree is fit to the residuals of the previous tree and the loss function is minimized. The random forests use randomization to create a large number of independent trees, and then reduce prediction variance by averaging predictions across the trees. The BART on the other hand differs from these tree-based methods as the trees are not added to the ensemble; instead, it uses Gibbs sampler that successively modifies each tree in the ensemble. The BART algorithm is computationally intensive due to its iterative nature. Scoring new data with the model can also be time consuming because each new data point needs to traverse all the saved samples of the ensemble to generate predictions.

smanoj · ‎02-29-2024

This article tries to provide a quick summary of Bayesian Additive Regression Trees (BART) models. We discuss about the mathematical model, regularization priors and Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm. Decision trees are a nonparametric supervised learning method used for both classification and regression tasks. A classification tree models a categorical response and a regression tree models a continuous response. Recently, Ensemble of trees methods are frequently used in both regression and classification problems. As such algorithms like gradient boosting and random forests are widely used. Another ensemble method, Bayesian additive regression trees (BART) by Chipman et al.(2010) is gaining momentum in recent years due to its flexibility in dealing with interactions and non-linear effects. It can be applied to both regression and classification problems and yields competitive results when compared to other predictive models. Bayesian Additive Regression Trees (BART) The BART model is a flexible Bayesian nonparametric approach to regression that consists of a sum of multiple decision trees. The model uses regularization to constrain the trees to be weak learners to limit the contribution of each tree to the ensemble prediction. The model is fitted using a Bayesian backfitting Markov chain Monte Carlo (MCMC) algorithm to generate posterior samples of the sum-of-trees ensemble. The BART model starts with a fixed number of trees that it iteratively updates to draw many samples of the ensemble. The final model makes predictions by averaging the predictions from all posterior samples that are saved for prediction. The Model Consider a prediction problem with a continuous response (target). Given the data of size n, with a continuous target Y and p covariates X (x 1 , x 2 ,..x p ), BART attempts to estimate a function f(x) from the models of the form Y=f(x) + Ɛ, where Ɛ~N{0,σ 2 }. To estimate f(X), a sum of regression trees is specified as f(x) = Equation 1 In Equation (1), T j is the j th binary tree structure and M j = {µ 1j , . . ., µ kjj } is the vector of terminal node parameters associated with T j . The constant m represents the number of trees. First, we quickly revisit the basic structure of a decision tree model. A decision tree is read from the top down starting at the root node. Each internal node represents a split based on the values of one of the inputs. The inputs can appear in any number of splits throughout the tree. Cases move down the branch that contains its input value. In a binary tree with interval inputs, each internal node is a simple inequality. A case moves left if the inequality is true and right otherwise. The terminal nodes of the tree are called leaves. The leaves represent the predicted target. All cases reaching a particular leaf are given the same predicted value. The leaves of the decision tree partition the input space into rectilinear regions. The fitted regression tree model is a multivariate step function. A step function is highly flexible and is capable of modeling nonlinear trends. To simplify the understanding of the sum of trees model, consider an example with 2 trees and 3 inputs, i.e. m=2 and p=3. Let the structure of the 2 trees be as below- Figure 1 Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. In this example, the final prediction is calculated by adding up the values of µ k j from all the m(=2) trees. So, to predict the value of Y for any i th case, whose X 1 , X 2 and X 3 values are respectively 0.59, 6.5 and 4.9, we use the sum of regression trees model and allocate a sum of parameters µ k j to case i. For the given values of X’s the regression tree 1 resulted in a predicted value of µ 3j and regression tree 2 as µ 1j . Hence the E(Yi|Xi) = µ 3j + µ 1j. Note that the predicted value that is allocated to ith case is the sum of parameters in the leaf node for each tree rather than the mean of the parameter values (µ k j ’s). This is because BART calculates each posterior draw of the regression tree function g(X; T j , M j ) using a leave-one-out concept which will be discussed shortly. Sampling the posterior: Bayesian Backfitting Markov Chain Monte Carlo (MCMC) algorithm The Markov chain Monte Carlo (MCMC) method is a general simulation method for sampling from posterior distributions and computing posterior quantities of interest. MCMC methods sample successively from a target distribution. Each sample depends on the previous one, hence the notion of the Markov chain. The backfitting procedure is a modular way of fitting an additive model. It cycles through the predictors, replacing each current function estimate by a new function derived from smoothing a partial residual on each predictor. Given a training data set, a Bayesian backfitting MCMC algorithm is used to draw samples from the posterior distribution (Chipman, George, and McCulloch 2010). This algorithm is a form of Gibbs sampling. In our example let T -j denote the tree structures for the m-1 trees in the ensemble, excluding the j th tree. Likewise, let M -j denote the set of all leaf parameters, excluding the j th tree parameters. As described by Chipman et al. (2010), the draws for the tree structures T j and leaf parameters M j are simplified by observing that their conditional distribution depends on T -j , M -j and Y only through the partial residuals for the fit excluding the j th tree. Let R j = Y- ∑ k≠j g(X; T j , M j ) denote the partial residual that are based on the fit, excluding the j th tree. You then obtain the samples for T j and M j by taking successive draws from T j |R j , σ 2 M j |T j, R j , σ 2 The draws for the tree structure T j are obtained using the Metropolis-Hastings (MH) sampling algorithm as described by Chipman et al., 1998. This algorithm considers sampling from four operations that modify the tree structure by splitting a terminal node, pruning a pair of terminal nodes, changing the splitting rule of a nonterminal node, or swapping a splitting rule between a parent node and a child node. The BART procedure (in SAS Viya) considers only pruning and splitting operations. In the growing process, a terminal node is randomly selected and is split into two new nodes. The splitting variable and the splitting point is defined assuming the uniform distribution. During a prune step, a parent of two leaf (terminal) nodes is randomly chosen and then its child nodes are removed. How about taking a simple example to understand this mechanism? In this example we consider a continuous response variable (Y) and three inputs X=(x 1 , x 2 , x 3 ). We run this algorithm with three regression trees (m=3) for 4 iterations. The discussion to follow presents the regression tree structures as you go through each MCMC step. The process begins by initializing the three regression trees to single root nodes. The parameters initialized for these nodes would be μ = ȳ ⁄m = ȳ ⁄3 Figure 2 Now let us see how the tree structures are drawn for each regression tree in first iteration i.e. after initialization step. Note that the order of computation of regression tree does not matter, hence we may start with determining any tree. Here we start with determining the first regression tree (T 1 , M 1 ). We calculate the partial residual R 1 that excludes the first tree (T 1 ), i.e. R 1 = Y- ∑ j≠1 g(X; T j , M j ) = Y- [g(X, T 2 , M 2 ) + g(X, T 3 , M 3 )] = Y- 2 ȳ ⁄3. The MH algorithm helps propose a new tree structure. Let T 1 * denote the sampled modification to the tree structure T 1 and then calculate the probability of whether T1* should be accepted or rejected. If T 1 * is accepted, then T 1 is updated to T 1 *, else nothing changes for T 1 . In our example we see that T 1 * was not accepted in the first iteration hence the tree structure remains the same, i.e. as a single root node (refer to Iteration 1, Tree j=1 below). Next, a draw is taken for the set of leaf parameters M 1 (in this case) and algorithm then updates M 1 based on the new updated tree structure for T 1 . This continues and the partial residuals R j+1 are then updated for sampling the (j+1) th tree structure T j+1 and (j+1) th set of leaf parameters M j+1 . To determine (T 2 , M 2 ), the algorithm calculates R 2 = Y- ∑ j≠2 g(X; T j , M j ) = Y- [g(X, T 1 , M 1 ) + X, T 3 , M 3 )] = where, is the updated parameter for T 1. Again, MH algorithm is used to propose a new tree structure (T 2 *) for T 2 . Partial residual R 2 is used to calculate the acceptance probability and decide if T 2 * should be accepted or rejected. In our example, T 2 * was accepted and therefore a new tree structure was used in iteration 1 (refer to Iteration 1, Tree j=2 below). Then the leaf parameters M 2 are updated based on the new updated tree structure T 2 *. Next, R 3 is used to determine new structure for T 3 . R 3 now takes the form . Well, as seen in figure 3, T 3 * was not accepted hence the tree structure remains the same, i.e., as a single root node. Figure 3 Figure 4 Figure 5 Figure 6 Figure 2 through figure 6 depicts the iterations from initiation to iteration 4 for the three regression trees. Also, note at each iteration how the regression trees change. At each iteration, each tree may increase or decrease the number of terminal nodes by one. This iterative process runs for a burn-in period (by default, 100 iterations in SAS Model Studio) before the main simulation loop. Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect of initial values on the posterior inference is minimized. It is the sample from the main simulation loop that are saved for prediction and computing posterior statistics. Regularization prior BART model requires you to specify a prior for the tree structure, the tree parameters, and the variance parameter σ 2 . The use of a regularization prior enforces the so-called weak learner property on the individual trees. This approach sinks with the idea that many weak learners combined together perform much better than using a strong model that requires careful tweaking in the order for the model to perform well. At this point you have a fair idea about BART model and the underlying algorithm. Now, you must be excited about fitting and exploring a BART model. In the next post, I will discuss and demonstrate how to train a BART model in SAS Model Studio. References: Bayesian additive regression trees and the General BART model SAS Documentation: Sampling the Posterior details. Chapman et al. 1998 Find more articles from SAS Global Enablement and Learning here.

smanoj · ‎12-21-2023

The purpose of this article is to show how to generate score code for unsupervised learning nodes in Model Studio like the Clustering node and score data in SAS Studio. Currently in a Data Mining and Machine Learning project in Model Studio, you can deploy the score code only for a predictive model (that is, a branch of the pipeline that includes a Supervised Learning node). But perhaps you want the score code from the Clustering node, or Anomaly Detection node which uses an unsupervised learning method (not involving the target variable). After clustering is performed using Clustering node and you determine the clusters, you would like to apply the "rules" to a different data set. For example, if it is computationally infeasible to perform the cluster analysis on the whole population in your system due to the large amount of data, you want to score all the observations and assign them to the preliminary clusters directly in the first stage. Or you might want to deploy your cluster analysis on an altogether different data set. In all these cases, no clustering iterations are performed to determine the cluster membership. Thus, it greatly reduces the need of computer resource and computation time. The narrative that follows assumes that you have already created a pipeline in Model Studio. Pipelines are structured flows of analytic actions. These analytic actions are represented as individual nodes in a pipeline. (Learn more about Building Models with SAS Model Studio | SAS Viya Quick Start Tutorial.) Scoring Data Using Clustering Models You may choose to score your data in Model Studio or outside of it. This may depend on the size of your scoring table, scoring environment and / or your preference of using a GUI based scoring or write your own program. Scoring data in Model Studio To score data in Model Studio, you don’t necessarily need to have the score code. Just connect a Score Data node to Clustering node as shown below: Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. The Score Data node is a Miscellaneous node that enables you to score a data table with the score code that was generated by the predecessor nodes in the pipeline. The scored table can be saved or promoted to a CAS library. This is a straightforward approach and doesn’t require any prior knowledge of coding. Scoring data outside of Model Studio For any reasons, if you wish to score data outside of Model Studio, say in SAS Studio then you need to have the score code in first place. Currently, in Viya 4 the Clustering node results does not include the score code. To obtain score code from Clustering node, you have following options – Option 1: Use a SAS code node to simulate a column of target predictions and move the SAS Code node to the Supervised Learning group. The steps to follow guide you through the process of extracting the score code and finally scoring the data in SAS Studio. Consider you have already created a pipeline to segment your data. Right-click the Clustering node and select Add child node > Miscellaneous > SAS Code. Your pipeline should look like the one below: Now in the newly added SAS Code node, you can include code to simulate a column of target predictions. Click the Open Code Editor button in the SAS Code node. Ensure that your cursor is in the Scoring code window and write the code to simulate a column of target predictions. Note: If you do not have a true target in your data, you can either create a pseudo one or use another variable in your data set that is not used as an input for the Clustering node. In PVA data set (used in this example) a target variable (Target_B) is already present. In the upper right corner of the window, click the Save icon to save the SAS code and then click the Close button to close the Editor window. Right-click the SAS Code node and select Move > Supervised Learning. Notice that the two things happen. First, the SAS Code node changes from yellow (which is the color of Miscellaneous nodes) to purple (the color of Supervised Learning nodes). Second, a Model Comparison node is automatically added to the pipeline, connected to the SAS Code node. Click the Run Pipeline button. You now can deploy your score code in various ways (register, publish, download from the Pipeline Comparison tab), just as you would for a supervised model. Right click the SAS Code node and select Download Score Code. When you download the score code from SAS Code node, the resulting zip file will be saved on the client computer and location depends on the browser used. For example, Google chrome will save the zip file to the system download folder on client machine. This zip file contains epscore code sas file that will be referenced while scoring a data set in SAS Studio. Again, right click the SAS Code node and select Results. Note that the score code is accessible in the Path EP score code window. It displays the SAS code that was created by the node if there are analytic stores that are generated in the pipeline. The score code can be used outside the Model Studio environment to score new data. The xxxxxxx.setKey in the method init method block contains a string that identifies an analytic store. In this case, the astore file '_B85BU2NJNVFZH8XF74QX4G6O5’ can be located in the Models library of your CAS server. The string will be different in your case In the Models library, it is saved as ‘_B85BU2NJNVFZH8XF74QX4G6O5_AST.sashdat’. Note: - The astore file automatically gets saved in Models library only when you choose to download the score code. This analytic store binary table is combined with data in PROC ASTORE to perform scoring in SAS Studio. Launch SAS Studio, and submit following program to start a CAS session and assign libraries. %let homedir=%sysget(HOME); %put &homedir; cas; caslib _all_ assign; Next load your analytic store table from the Models library by processing the following piece of code: proc casutil; load casdata= “_B85BU2NJNVFZH8XF74QX4G6O5_AST.sashdat" incaslib="Models" casout="cluster_astore" outcaslib=casuser; quit; Upload the epscore code (saved in the system download folder on client machine, say) into a location accessible from SAS Studio, which for me was on the CAS Server. I selected a folder on CAS Server (visible from the Explorer menu in SAS Studio), right clicked, and selected upload files. Run PROC ASTORE with score statement pointing rstore to the astore file saved in casuser library from step 11 and epscore code to the location accessible from SAS Studio from step 12. proc astore; score data=casuser.pva rstore=casuser.cluster_astore epcode= '/greenmonthly-export/ssemonthly/homes/a.b@sas.com/dmcas_epscorecode.sas' out=casuser.cluster_scored; run; A snapshot of the output table is produced below. It contains _CLUSTER_ID_ column that holds the cluster membership of each record. Also, note the IMP_DemAge column that shows the imputed column for DemAge variable. It is this data preprocessing step (imputation in this example) that is accomplished through the epscore code file while scoring a new data. Option 2: To obtain the score code of a Clustering node in Viya 4, you can connect a Score Data node to the Clustering node and run your pipeline. Your completed pipeline should resemble the following: The steps to follow shows how to extract score code from Score Data node and perform scoring in SAS Studio. Right click the Score Data node and select Results. Note that the score code is accessible in the Path EP score code window. To use Path EP Score code file for scoring, you need to download the file. Click on the Download icon at the top of the window to download the file. Note that the Path EP Score code file gets downloaded as a text file in the browsers download folder (Google chrome). The astore file is stored in the project cas library. In this case, the file '_B85BU2NJNVFZH8XF74QX4G6O5_AST ' can be located in the project caslib "Analytics_Project_6c4541e7-b4d1-412c-ad79-ffa8617ab294". The astore file is not saved in the Models library like in option 1 discussed earlier. The caslib name here appears to be somewhat unusual and you must be wondering from where to get the name of the project caslib. Well, its bit tricky. To get the name of project caslib, you need to download the log file corresponding to Score Data node. Again, right click the Score Data node and select Log. This will allow you to access the log file. Search for the text caslib and notice the caslib name as shown below: Next, launch SAS Studio and submit code to start a CAS session and assign libraries. %let homedir=%sysget(HOME); %put &homedir; cas; caslib _all_ assign; Now I use CAS procedure to copy the astore to the CASUSER CAS library from the project caslib and renamed the astore as cluster_ast. proc cas; table.copyTable / casout={promote=true,name="cluster_ast", caslib="CASUSER"} table={name="_B85BU2NJNVFZH8XF74QX4G6O5_AST", caslib="Analytics_Project_6c4541e7-b4d1-412c-ad79-ffa8617ab294"}; run; Upload the Path EP Score Code text file that you downloaded in step 2 into a location accessible from SAS Studio, which for me was on the CAS Server. I selected a folder on CAS Server (visible from the Explorer menu in SAS Studio), right clicked, and selected upload files. Run PROC ASTORE with score statement pointing rstore to the astore file saved in casuser library from step 5 and Path EP Score Code file to the location accessible from SAS Studio from step 6. proc astore; score data=casuser.pva rstore=casuser.cluster_ast epcode= '/greenmonthly-export/ssemonthly/homes/a.b@sas.com/Path EP Score Code.txt' out=casuser.cluster_scored1; run; A snapshot of the output table is produced below: Of the two options discussed for scoring data in SAS Studio, I find the first method to be efficient and less error prone. This is because the second method requires the name of project caslib which is tricky to find out. Also, there are chances that users may inadvertently delete the files within the project’s caslib directory. Find more articles from SAS Global Enablement and Learning here.

smanoj · ‎11-28-2023

Generative adversarial networks (GANs) are one of the recent machine learning algorithms that has brought about a revolutionary storm in the AI world. In this post we will discuss the adversarial learning and the components of a GAN model. Before delving into GANs, lets first understand what are generative models and how they are different from discriminative models? A generative model is a type of machine learning model that aims at learning the true data distribution of training data to generate new, similar data. But it is not always possible to learn the exact distribution of your data and therefore you try to model a distribution which is as similar as possible to the true data distribution. Generative versus Discriminative Models In machine learning, most problems you come across are discriminative in nature. The distinction between generative and discriminative models is fundamental in machine learning. Let's consider an example to understand the difference. Suppose you have a dataset of images of airplanes and motor vehicles. With sufficient data, you could train a model (discriminative model) to predict if a given image was an airplane or not? During the model training process your model would learn the features specific to airplanes and for images with those features, the model would upweight its predictions accordingly. Also, the discriminative models require a labeled training data to train a model. Like in our example, all airplane images would be labeled as 1 and non-airplane images as 0. So, the model is trained to be able to discriminate between these two groups of images and outputs the probability that a new observation has label 1, i.e. an airplane. In other words, discriminative modeling aims to model the probability of a label y given some observations x. On the other hand, generative models don’t require a labeled dataset because the aim is to generate entirely new images, rather than predicting the label. Putting it simply, generative model aims to model the probability of observing an observation x. Sampling from this distribution allows us to generate new observations. Generative Adversarial Networks Generative adversarial networks, or GANs, are a deep-learning-based generative model that is used to generate new data. It involves two adversarial neural networks that compete with each other to generate new observations that are indistinguishable from real data. GAN models consist of several deep neural networks that process data. They work great for synthetic data generation, particularly for image data. You might be wondering, "How can networks be adversarial?" Well, the two networks, called generator and discriminator, compete with each other to win, hence the name adversarial. The generator learns to create fake samples of data, which can be image, audio, text, or simply numerical data. The generator tries to fool the discriminator by producing novel synthesized instances that ideally look like the real data. The discriminator evaluates the generated data and tries to discriminate whether the data are real or fake. The key to GANs lies in how we alternate the training of the two networks. As the generator becomes more adept at fooling the discriminator, the discriminator must adapt in order to maintain its ability to correctly identify which observations are fake. This drives the generator to find new ways to fool the discriminator, and so the cycle continues. Through multiple cycles of generation and discrimination, both networks train each other, while simultaneously trying to deceive each other. Components of GANs Generator Network: The generator network tries to learn the data distribution by using random noise as inputs and producing instances that look like real data. The main goal of the generator network is to maximize the likelihood that the discriminator misclassifies its output as real. Generator Training Process Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. In the generator network training process, the fake sample produced by the generator is trained on the discriminator. The discriminator network classifies the generated data as real or fake and produces generator loss. This generator loss penalizes the generator for failing to dupe the discriminator. Remember, the main goal of the generator is to fool the discriminator into classifying its output as real. The back propagation method is used to adjust each weight by calculating the weight's impact on the output. Discriminator Network: The discriminator network tries to differentiate between the fake data produced by the generator network and real data. Thus, the discriminator network is simply a classifier that could use any network architecture appropriate to the type of data that it's classifying. The training data feeding into the discriminator network comes from two sources: • the real data instances • the fake data instances, which were generated by the generator network The discriminator network generates predictions for how likely the instances are to be real or fake. So, the main goal of discriminator network is to accurately distinguish between real and fake data. Discriminator Training Process In the process of training the discriminator network, it classifies both the real data and the fake data from the generator. Notice that the discriminator connects to two loss functions. However, during discriminator training, the discriminator ignores the generator loss and simply uses the discriminator loss. The discriminator loss penalizes the discriminator for misclassifying a real data instance as fake or a fake data instance as real. .The discriminator updates its weights through back propagation from the discriminator loss through the discriminator network. GAN Training We have seen that the generator and discriminator have different training processes. So, you must be wondering, "How are GANs trained as a whole?" Well, the two networks are trained in alternating fashion. Note that the discriminator needs to train for a few epochs prior to starting the adversarial training, as the discriminator will need to be able to actually classify the data (images) as real or fake. While alternating the training of these two networks, we also must make sure that we update the weights of only one network at a time. For example, during the generator training process, only the generator’s weights are updated. Similarly, we keep the generator's weights constant during the discriminator training phase and update the discriminator’s weights only during this phase. As discriminator training tries to figure out how to distinguish real data from fake data, it has to learn how to recognize the generator's flaws. This is a different problem for a thoroughly trained generator than it is for an untrained generator producing random output. This back-and-forth training enables GANs to tackle otherwise intractable generative problems. Objective Function Recall that the generator network generates data that are similar to the real data, and to measure the similarity, we use objective functions. Both networks have their own objective functions, which they try to optimize during training. We train D to maximize the probability of assigning the correct label to both training instances and samples from the generator. We simultaneously train G to minimize the discriminator's reward. In other words, D and G play the two-player minimax game with the final objective function: Objective Function In this equation, Ex represents the expected value over all real data instances. Ez represents the expected value over all generated fake instances G(z). Px represents the real data distribution. Pz represents the distribution of data generated by the generator. D(x) represents the discriminator's output (that is, the probability that x came from the training sample or from the real data), and G(z) represents the generator's output given noise z. During training, the discriminator wants to maximize the objective function, whereas the generator wants to minimize it. In this way, the generator and the discriminator repeatedly learn to work together, and eventually the GAN can reach a Nash equilibrium when the following conditions are met: The discriminator is unable to distinguish between real and fake instances. The generator can produce fake samples that are indistinguishable from the real data in the training data set. Advantages of GANs GANs offer quite a few advantages- GANs can be used to generate fake data with or without labels. If you have labeled data, you can use GANs to generate more synthetic labeled data, but if you have unlabeled data, you can use GANs to generate more synthetic unlabeled data. GANs generate data that are similar to real data. They can generate images, text, audio, video, and tabular data. GANs learn the internal representations of data. They can learn messy and complicated distributions of data. In adversarial networks, the generator network is not updated directly with data examples, but only with gradients flowing through the discriminator. This means that components of the input are not copied directly into the generator’s parameters. Common Challenges There are also a number of challenges- Mode collapse- Usually, you expect your GAN to produce a wide variety of outputs given the input. However, if the discriminator is not powerful enough, the generator will find ways to easily trick the discriminator with a small sample of nearly identical images. This form of GAN failure is known as mode collapse. Vanishing gradients- During back propagation, gradient flows backward, from the final layer to the first layer. As it flows backward, it gets increasingly smaller. Sometimes, the gradient is so small that the initial layers learn very slowly or stop learning completely. In this case, the gradient doesn't change the weight values of the initial layers at all, so training the initial layers in the network is effectively stopped. This is known as the vanishing gradients problem. There is no explicit representation of the generator's distribution over data. Therefore, random noise is sampled. GANs can require a lot of computational resources and can be slow to train, especially for high-resolution images or large data sets. GANs can reflect the biases and unfairness present in the training data, leading to discriminatory or biased synthetic data. GANs can be difficult to interpret or explain, making it challenging to ensure transparency, accountability, and fairness in their applications. Read more in the post from Jason Colón , Generating Synthetic Data Using Generative Adversarial Networks Find more articles from SAS Global Enablement and Learning here.

smanoj · ‎09-06-2023

A Gradient Boosting model consists of multiple decision trees. The trees are built sequentially. Each tree is trained by splitting the subsampled data, then splitting each resulting segment, and so on recursively until some constraint is met. A new technique light gradient boosting machine model (LightGBM) proposed by Ke et al. (2017) is a high-performance gradient boosting framework based on decision tree algorithms that is used for regression, classification, and many other machine learning tasks. It extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients. This post attempts to describe the underlying algorithm in LightGBM and how it differs from traditional gradient boosting algorithm. Background Gradient boosting is an ensemble model of decision trees, which are trained in sequence. In each iteration, it learns the decision trees by fitting the negative gradients (also known as residual errors). The most time-consuming part in learning a decision tree is to find the best split points. One of the most popular algorithms to find split points is the split search algorithm, which enumerates all possible split points on the pre-sorted feature values. This algorithm is simple and can find the optimal split points, however, it is inefficient in both training speed and memory consumption. Another popular algorithm is the histogram-based algorithm. Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the number of bins is usually much smaller than the number of instances the histogram-based algorithm is more efficient in both memory consumption and training speed. However, if we can further reduce the number of instances or number of features, we will be able to substantially speed up the training process. LightGBM is a gradient-boosting framework based on decision trees designed to be distributed and efficient, and it offers the following advantages: higher training speed and greater efficiency lower memory usage better accuracy support of parallel, distributed processing ability to handle large-scale data It uses two novel techniques: gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB). Gradient-based One Side Sampling Technique for LightGBM: This method uses a different sampling method that can achieve a good balance between reducing the number of data instances and keeping the accuracy for learned decision trees. Recall that in traditional gradient boosting algorithm the gradient (residual error) for each observation provides useful information, i.e. if an observation is associated with a small gradient, then it means that the training error is small, and it is already well-trained. So, a straightforward approach to reduce the number of instances is to discard those observations with small gradients and focus on observations with large gradients. However, this approach will change the data distribution and can hurt the accuracy of the learned model. To avoid this issue, GOSS (Gradient Based One Side Sampling) uses a novel sampling method in which it keeps all the observations with large gradients and down samples the observations with small gradients. And in order to compensate the influence to the data distribution, when computing the information gain, GOSS introduces a constant multiplier for the observations with small gradients. Exclusive Feature Bundling Technique for LightGBM In real applications, although there are a large number of features, the feature space is usually quite sparse, which offers a possibility of designing a nearly lossless approach to reduce the number of effective features. Specifically, in a sparse feature space, many features are (almost) exclusive; that is, they rarely take nonzero values simultaneously. EFB bundles these features, reducing dimensionality to improve efficiency while maintaining a high level of accuracy. The bundle of exclusive features into a single feature is called an exclusive feature bundle. Now, you can build the same feature histograms from the feature bundles as those from individual features. In this way the complexity of histogram building can be reduced as the number of bundles are much less than the number of features. This speedup the training process without hurting accuracy. Architecture of LightGBM LightGBM splits the tree leaf-wise as opposed to other boosting algorithms that grow tree level-wise. It chooses the leaf to split that it believes will yield the largest decrease in loss function. Because leaf-wise chooses splits based on their contribution to the global loss and not just the loss along a particular branch, it often (not always) will learn lower-error trees "faster" than level-wise. Below is a diagrammatic representation that shows the difference in split order between a hypothetical binary leaf-wise tree and a hypothetical binary level-wise tree. Note that other orderings may be chosen for the leaf-wise tree while the order is always the same in the level-wise tree. Leaf-wise tree Select any image to see a larger version. Mobile users: If you do not see this image, scroll to the bottom of the page and select the "Full" version of this post. Level-wise tree LIGHTGRADBOOST Procedure in SAS The LIGHTGRADBOOST procedure trains a gradient boosting tree model by using the LightGBM method. Let us consider an example where a national veterans’ organization wants to better target its solicitations for donation. The PVA_Train table contains observations for 74582 individuals. A variable named Target Gift Flag (Target_B) is a class target that has two levels, 1(response) and 0(no response). Additionally, several categorical and continuous measurements are available that includes demographic inputs, promotion inputs and gift inputs that summarize the previous donation history. For this example, it is assumed that the PVA_Train and PVA_Test data tables are already loaded in memory and are accessed through Public caslib, but you can substitute any appropriately defined CAS engine libref. PROC LIGHTGRADBOOST treats numeric variables as interval inputs unless you specify otherwise. Character variables are always treated as nominal inputs. The BOOSTING= option in the LIGHTGRADBOOST procedure statement specifies type of boosting to use. Default value is GBDT (gradient boosting decision tree), however in this example you use gradient based one-side sampling (GOSS) method. The OBJECTIVE= option specifies the objective function to use and the DETERMINISTIC option ensures stable results when you use the same data and the same parameters. No additional parameters are specified in the PROC LIGHTGRADBOOST statement; therefore, the procedure uses all default values. For example, the number of trees in the boosting model is 100, and the number of bins for interval input variables is 255. Note that the VALIDDATA=option allows you to specify the validation data to avoid overfitting. The INPUT and TARGET statements are required in order to run PROC LIGHTGRADBOOST. The INPUT statement indicates which variables to use to build the model, and the TARGET statement indicates which variable the procedure predicts. The SAVESTATE statement creates an analytic store for the model and saves it as a binary object in a data table. You can use the analytic store in the ASTORE procedure to score new data. proc lightgradboost data=public.PVA_Train validdata=public.PVA_Test boosting=GOSS objective=binary deterministic; input statuscat96NK DemHomeOwner Demcluster / level = nominal; input GiftCnt36 GiftCntAll GiftCntCard36 GiftCntCardAll GiftAvgLast GiftAvg36 GiftAvgAll GiftAvgCard36 GiftTimeLast GiftTimeFirst DemMedIncome / level = interval; target Target_B / level = nominal; SAVESTATE RSTORE=public.lgbmStore ; output out=public.PVA_out; run; The successful execution of code produces results and output data. Model Information provides a brief description of the settings used to create the model including boosting method, objective function, accuracy metric (binary log loss) etc. The Iteration History table shows how the binary log loss function value changes as the number of trees in the model increases. For this model, the minimum objective function for the VALIDATE partition is 0.5277 and occurs for 100 trees, so the validation objective function value is still decreasing at the last tree. Having said that analyst may want to experiment by increasing the number of trees (along with other parameters/options) to further improve the performance of model. The PVA_Out table can be accessed to display the scoring results of training data. The generated columns P_TARGET_B0 and P_TARGET_B1 contain the predicted probabilities of the target variable TARGET_B with respective labels, and the generated column I_TARGET_B contains the predicted label. Are you not comfortable writing codes for developing models? Well, you can even fit LIGHTGBM model using Model Studio. To get more details, look through this article here: LIGHTGBM in SAS Model Studio. Find more articles from SAS Global Enablement and Learning here.

smanoj · ‎06-23-2023

The purpose of this article is to show how to use the SAS Visual Text Analytics models to score an external text data in SAS Studio using Astore method. This functionality enables you to score a collection of documents to extract information in a given context (using Concepts model), to discover the themes or topics in the document collection (using Topics model) and to categorize the documents (using Categories model). SAS Visual Text Analytics is a web-based text analytics application on SAS Viya that combines the power of Natural Language Processing (NLP), machine learning, and linguistic rules to reveal insights in text data. SAS Visual Text Analytics builds several types of text models, including contextual models, sentiment models, topic models, and categorization models. As the new documents appear, they can be scored using the models developed in Visual Text Analytics. The narrative that follows assumes that you have already created a pipeline in Model Studio. Pipelines are structured flows of analytic actions. These analytic actions are represented as individual nodes in a pipeline. Downloading Score code from analysis nodes in SAS Visual Text Analytics To download score code from an analysis node, complete the following steps: Navigate to the Pipelines tab in Model Studio and run the pipeline. When the pipeline run is complete, right-click on the analysis node (for e.g., Concepts node) that you want to download the score code for and select Download score code. When you download score code from a Concepts node, the resulting ZIP file contains the following: the concepts model (ConceptsModel.li) and its associated score code (ScoreCode.sas), and the concepts analytic store (ConceptsModel.astore) and its associated score code (AstoreScoreCode.sas). The ConceptsModel.li file contains the compiled LITI concepts and is used by ScoreCode.sas file while scoring. When you download score code from a Topics node, the resulting ZIP file contains the topics analytic store (TopicsModel.astore) and its associated score code (AstoreScoreCode.sas). When you download score code from a Categories node, the resulting ZIP file contains the following: the categories model (CategoriesModel.mco) and its associated score code (ScoreCode.sas), and the categories analytic store (CategoriesModel.astore) and its associated score code (AstoreScoreCode.sas). MCO files are binary encoded concept files and is used by ScoreCode.sas file while scoring. The MCO files can also be processed by other SAS products (for example, SAS Text Miner). The zip file will be saved on the client computer and location depends on the browser used. For example, Google chrome will save the zip file to the system download folder on client machine. The ScoreCode.sas files contain SAS program for batch scoring. Whereas the Astore (.astore) and its associated score code (AstoreScoreCode.sas) file use Astore (Analytic Store) scoring. An analytic store is a binary file that contains information about the state of an analytic object. It stores information that enables you to load and restore the state of the analytic object and set it in a score-ready mode. The analytic store is transportable. That is, an ASTORE can be produced on one host and consumed on others without the need of traditional SAS export or import. Scoring data in SAS Studio by using an Analytic Store (.astore) generated by Concepts node Note that the score code files (.astore, .sas files) when downloaded gets saved on the client machine. However, you must save the .astore file from the downloaded score code in a location that is network-accessible by the CAS server. Also, you need to move analytic store from local file system into CAS. Launch SAS Studio and submit following statements to send the analytic store from the local file system to the data table Public.ASRS_Concepts in CAS session. /*Uploading Astore from Local File System*/ proc astore; upload rstore=Public.ASRS_Concepts store="/workshop/VTXT/ConceptsModel.astore"; quit; The PROC ASTORE statement invokes the procedure and does not require any options. The UPLOAD statement moves an analytic store from the local file system into a data table in CAS. The RSTORE =PUBLIC.ASRS_Concepts specifies the CAS table to which the store is sent. The STORE= /workshop/VTXT/ConceptsModel.astore specifies the full path of the valid store file that was created earlier by the analysis node and exists in the local file system. Once the astore file has been moved into CAS session, it is available for scoring. The following example shows how to score an input table by using the information in the analytic store. The SCORE statement enables you to score models. DATA= PUBLIC.ASRS names the input data table for PROC ASTORE to use. The OUT= PUBLIC.Concepts_Scored specifies the output data table. The RSTORE=PUBLIC.ASRS_Concepts specifies the data table in CAS to contain the analytic store. /*Scoring table using ASTORE procedure*/ proc astore; score data=Public.ASRS out=Public.Concepts_Scored rstore=Public.ASRS_Concepts; run; The output table created resembles the following. Hover your mouse on any value under the first column to display the match text, the start and end byte positions in the original document and matched concepts. Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. You can use similar steps to score an external data using Topics and Categories models. Using Astore Score code generated by Concepts node in Visual Text Analytics Next you use the SAS code (AstoreScoreCode.sas) generated by the Concepts node to enable scoring, but before you submit the code you must specify inputs for some of the macro input fields. This SAS score code file also make use of (.astore) file under the hood to score a text data. When using the SAS score code generated by Concepts node for scoring, you need to specify the fully qualified path to access the .astore file. However, the SAS score code generated by Topics and Categories node has this path specified with a system generated value. Therefore, if you intend to use the same code for scoring in an environment other than the one used by SAS Visual Text Analytics to produce model then the value of this path will have to be modified (as shown in the example below). A similar modification is required in the “cas_server_hostname” value to match the hostname for the server where scoring is performed, because by default, the “cas_server_hostname” value be set automatically to the host for the associated SAS Visual Text Analytics project. /***************************************************************** * SAS Visual Text Analytics * Concepts Astore Score Code * * Modify the following macro variables to match your needs. * * NOTE: The text variable on the input table must match the * name and type of the text variable in the table that was used * to create the analytic store (astore) table. ****************************************************************/ /* specifies CAS library information for the CAS table that you would like to score. You must modify the value to provide the name of the library that contains the table to be scored. */ %let input_caslib_name = "Public"; /* specifies the CAS table you would like to score. You must modify the value to provide the name of the input table, such as "MyTable". Do not include an extension. */ /* NOTE: The text variable on the input table must match the name and type of the text variable that was used when the astore was created. */ %let input_table_name = "ASRS"; /* specifies the variables in the input table to copy to the output tables. You must modify the value to specify variables that you want to copy to the output tables, such as "doc_id". Copying the document identifier will map the results to the input data. */ %let copy_vars_variables = "docID"; /* specifies the fully qualified path to the concepts model .astore file to upload for use in scoring. You must store the concepts model .astore file from the downloaded score code in a location that is network-accessible by the CAS server. You must modify the value of local_astore_file_path to provide the path to the .astore file, such as "/vta/scoring/{concepts model.astore file}". */ %let local_astore_file_path = "/workshop/VTXT/ASRS_ConceptsModel.astore"; /* After uploading the concepts model .astore file, you must specify a CAS library to write out the astore table to use in scoring. This library will be used in the CAS session during the ASTORE scoring action. You must modify the value to provide the name of the library that will contain the astore table. */ %let input_astore_caslib_name = "Public"; /* specifies the CAS astore table to use in scoring */ %let input_astore_name = "ASRS_Concepts_Astore"; /* specifies the CAS library to write the score output tables. You must modify the value to provide the name of the library that will contain the output tables that the score code produces. */ %let output_caslib_name = "Public"; /* specifies the concepts output CAS table to produce */ %let output_table_name = "ASRS_out_concept_astore_results"; /* specifies the hostname for the CAS server. This should be set automatically to the host for the associated SAS Visual Text Analytics project. */ %let cas_server_hostname = "sas-cas-server-default-client"; /* specifies the port for the CAS server. This should be set automatically to the host for the associated SAS Visual Text Analytics project. */ %let cas_server_port = 5570; /* creates a session and a library reference */ cas mysess host=&cas_server_hostname port=&cas_server_port sessopts=(caslib=&input_astore_caslib_name); libname mycas cas sessref=mysess datalimit=all; /* uploads the analytic store to the CAS server */ %let input_astore_name_unquoted = %qsysfunc(dequote(&input_astore_name)); proc astore; upload rstore=mycas.&input_astore_name_unquoted store=&local_astore_file_path; quit; /* calls the scoring action */ proc cas; session mysess; loadactionset "astore"; action astore.score; param table={caslib=&input_caslib_name, name=&input_table_name} rstore={name=&input_astore_name} options={{name="extend_out_char_var_bytes", value=2048}} copyVars={©_vars_variables} casout={caslib=&output_caslib_name, name=&output_table_name, replace=TRUE} ; run; quit; In the astore score code macro look for “{…….}” snippets and substitute values accordingly. For input_caslib_name specify “PUBLIC”, for input_cas_table_name specify “ASRS” and so on. Note the local_astore_file_path value. Then submit the modified score code and view results. The output table can be interpreted in the same way as we did for previous results table. Having discussed above two approaches of scoring using astore file and SAS score code file, you must be wondering which one to use? Well, it depends. Use astore file when deploying model using SAS Scoring Accelerator or in an environment other than SAS. This approach requires you to write your own code. However, if you are not a proficient coder you may want to use the SAS score code developed by GUI (SAS Visual Text Analytics in this case). Also, the SAS score code can be used for scoring only in the SAS environments. To learn about building models using SAS Visual Text Analytics, you may want to register for the course SAS® Visual Text Analytics in SAS® Viya®.

Online Status	Offline
Date Last Visited	2 weeks ago

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Scoring Clustering Models in SAS Viya

Generative Adversarial Networks (GANs): A Brief Introduction

An Overview of Light Gradient Boosting Machine (LightGBM) Model in SAS...

Scoring SAS Visual Text Analytics Models with Analytic Stores in SAS S...

Re: score data set in sas vfl

Re: score data set in sas vfl

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Scoring Clustering Models in SAS Viya

Generative Adversarial Networks (GANs): A Brief Introduction

An Overview of Light Gradient Boosting Machine (LightGBM) Model in SAS...

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Efficiently Modeling Interval Targets Using Bayesian Additive Regressi...

Scoring Clustering Models in SAS Viya

Generative Adversarial Networks (GANs): A Brief Introduction

An Overview of Light Gradient Boosting Machine (LightGBM) Model in SAS...

Scoring SAS Visual Text Analytics Models with Analytic Stores in SAS S...