About DougWielenga

DougWielenga · ‎08-07-2017

The easiest way to view the surrogate rules is to go into the Interactive tree utility once your model has been fit (unless you are already there). To access this functionality, click on the desired Decision Tree node and then click on the ... to the right of Interactive. Once the Interactive utility opens, click on the desired node and then click on View --> Surrogate Rules Detail to see the splitting rule(s) that were created. Let me know if you have any trouble locating it. Hope this helps! Doug

DougWielenga · ‎08-07-2017

By "f-score", are you talking about the traditional F-measure or balanced F-score (F 1 score) which is the harmonic mean of precision and recall or are you referring to an F-test statistic of some sort? SAS Enterprise Miner calculates precision defined as % true predicted events / (true predicted + false predicted) and recall defined as the event classification rate. It is likely you could compute this statistic relatively easily but I am not aware of anything by that name being generated by SAS Enterprise Miner. Do you have a formula for the f-score you wish to compute? Cordially, Doug

DougWielenga · ‎08-07-2017

It sounds like a difficult scenario to be sure. As an analyst, I have less of an appreciation for why there might be a requirement for using Java or Groovy. The challenge with Java or PMML are that these languages are far more limited than Base SAS. In addition, using transformations that create better models leads to even more work as you try and translate that into another language. It seems like there is a lot of manual effort being put in place to avoid using SAS for scoring. Would it be possible to score in SAS and just upload the scores to the database rather than having to re-architect the process in another less capable language? Cordially, Doug

DougWielenga · ‎08-07-2017

Should we expect rank ordering to hold good on out of time samples for machine learning classification models? If the model didn't perform reasonably well on out-of-time samples, it would not be a particularly useful model. The expectation, of course, is that model performance on out-of-time samples will not be as good as the data which was used to train it but it does provide a benchmark for using the model going forward. SAS Model Manager is designed to apply previously fit models to future data and evaluate the performance. Over time, the model performance is likely to degrade and require a refit. How quickly it degrades, though, is a function of many factors including how well the training data reflected the population at the time of modeling, how the population and/or external factors has changed, and how well the model actually fit. Should the modeler desire to refit the model, SAS Model Manager can perform that task as well in most situations. In general, performance the out-of-time sample provide the best evaluation about how useful a particular model is at that time. Hope this helps! Doug

DougWielenga · ‎08-07-2017

If you only have the beginning subscription dates, you will need some additional information to help identify predictors for someone subscribing. Assuming you have such variables, you have several options in how you prepare your data. One approach is to start by setting an observation window and a target window. For example, you might consider starting looking at the data available for potential subscribers from January through March to predict who would subscribe during May or June. The missing month (April) is intended to provide you some time to take action on those people the model identifies as being more likely to subscribe. If you have monthly data available, you can record those variables at lag1_var (end of March), lag 2_var (end of February), and lag3_var(end of January) to try and capture changes in behavior that might make someone more likely to respond/subscribe. For more distant time periods, you might average together behaviors (e.g. lag46_var for the average of the variables for October/November/December). You could then see how many of those people who had not subscribed by the end of the target period ended up subscribing in the 2-month period of May and June. The beauty of this type of approach is that it relies on recent behavior to predict future behavior. Since you are using rolling time periods, you can validate the model's performance at any time by updating the time intervals, and you can score current data to project subscriptions that will current in the 2-month period starting 30 days later. As new data becomes available, you update the corresponding lag variables so that you can score the newest data. Hope this helps! Doug

DougWielenga · ‎08-07-2017

If you have SAS Enterprise Miner, you can incorporate decision weights into the target profile and/or you can choose options in the Decision Tree node that will allow the models to be assessed on just a portion of the data (e.g. the top decile). HPSPLIT does not currently have that functionality but a WEIGHT statement is planned for a future release that would allow you to specify a variable that assigns more weight to the desired target observations. Alternatively, you could try and oversample somewhat to generate a data set with more balance that might generate a more useful model. Hope this helps! Doug

DougWielenga · ‎08-07-2017

From the SAS Enterprise Miner help, the File Import node is designed to convert selected external flat files, spreadsheets, and database tables into a format that Enterprise Miner recognizes as a data source and can use in data mining process flow diagrams. The files you are describing are text files that can be used by text miner. These files would not need to be imported into SAS Enterprise Miner to be used as text; instead, they would need to be put in a particular location with other text files where the Text Miner nodes could access them and analyze them. I hope this helps! Doug

DougWielenga · ‎08-07-2017

Correlation among input variables could be a very important issue in classical regression where the structure of the model was critical to generating useful results and interpretation. In most data mining scenarios, you have far more data than was available to historical approaches as well as powerful methods (linear & nonlinear) that allow you to model relationships using flexible models which adapt to your data. You can use holdout data to empirically validate the relationships with data rather than relying on assumptions. The amount of interpretation available differs from model to model. Trees provide simple interpretability while neural network and SVM models do not lend themselves to interpretation. Correlation is a concern for interpretation of simple regression models but interpretation is not meaningful if the model is inadequate which they often are. If you would benefit from broader training in using these methods, check out the training available at http://support.sas.com/training/us/paths/dm.html where you can get a better understand about how these different models can be used. Hope this helps! Doug

DougWielenga · ‎08-07-2017

SAS Enterprise Miner generates a ROC curve for the Train, Validate, and Test data set in the Model Comparison node when modeling a binary target. It also generates a misclassification chart for the Train & Validate data sets but it does not generate a misclassification chart for the Test data set. In the design of SAS Enterprise Miner, Test data sets are intended for a final unbiased evaluation of model performance so they are not used by default when a Validate data set is present. Please note that SAS Enterprise Miner always generates F_<target variable name> : the target variable value I _<target variable name> : the predicted target value (based on highest probability) but it can also generates a D _<target variable name> which contains the 'decision' outcome based on the decision weights and priors entered in the target profile when one is present. For example, if the target variable is named 'BAD', SAS Enterprise Miner would create the variables F_BAD, I_BAD, and D_BAD. If a Test data set is available, you can add a SAS Code node after any modeling node and enter the following code in the Training code section. This example assumes the target variable is named BAD. /*** BEGIN SAS CODE ***/ proc freq data=&em_import_test; tables F_BAD*I_BAD; tables F_BAD*I_BAD; *only available if Decision profile has been created; run; /*** END SAS CODE ***/ The code above will generate both misclassification charts if the target profile is available. Hope this helps! Doug

DougWielenga · ‎08-07-2017

If you go to the SAS Enterprise Miner help available by opening SAS Enterprise Miner and then clicking on Help --> Contents, you can navigate to the Cutoff Node by navigating in the panel on the left to Node Reference Assess Nodes Cutoff Node and then navigate in the panel on the right to Cutoff Node Train Properties, you can scroll down until you see Event Precision Equal Recall where it says the following: Event Precision Equal Recall — With precision defined as % true predicted events / (true predicted + false predicted) and recall defined as the event classification rate, this method chooses the point at which precision and recall are equal. There are two ways to find this in the output: (1) In the Overall Rates plot, a line is drawn at the requested point and hovering over the line with the mouse will show the cutoff (see attached document for plot) (2) In the Output section, you can see the point at which the first two columns are closest is when the cutoff is at 0.36 (I added the bold -- see attached document for partial table). -------------------------------------------------- | | | | |Overall | | | Event | True | False |Classif-| | |Precisi-|Positive|Positive|ication | | |on Rate| Rate | Rate | Rate | |--------+--------+--------+---------+---------| |Cutoff | | | | | |--------| | | | | |0.99 | 200.00| 8.30 | 0.00 | 161.78| |--------+--------+--------+--------+----------| |0.98 | 200.00| 13.26| 0.00| 162.77 | |--------+--------+--------+--------+---------| |0.97 | 196.67| 15.47| 0.07| 163.15 | |--------+--------+--------+--------+----------| . . . . . . . . . . . . . . . |--------+--------+---------+--------+---------| |0.38 | 134.11| 125.06| 15.31| 172.80| |--------+---------+--------+--------+---------| |0.37 | 132.51| 127.47| 16.18| 172.59| |--------+---------+--------+--------+---------| |0.36 | 131.01 | 129.67| 17.01 | 172.36| |--------+---------+--------+--------+---------| |0.35 | 128.45| 131.27| 18.22| 171.71| |--------+---------+--------+--------+---------| I hope this helps! Doug

DougWielenga · ‎08-07-2017

On the PROC HPSPLIT statement, there is a PLOTS option that will allow you to open up the subtree where you start and to a set depth. In complex trees, you will not be able to reasonably see the entire tree in one plot without losing many details. The code below refers to the SAMPSIO.HMEQ data set which is available as a sample data set in SAS Enterprise Miner and is also attached here. The code requests the displayed Tree to have a depth of 5 beginning from node "3": /*** BEGIN SAS CODE ***/ libname x 'c:\data'; * < note: change libname and path as needed > ; proc hpsplit data=x.hmeq seed=123 maxdepth=10 plots=(zoomedtree(nodes=("3") depth=5)); class bad reason job; model bad (event='1') = debtinc derog loan mortdue value job reason clno ninq yoj; grow entropy; prune off;* costcomplexity(leaves=all); run; /*** END SAS CODE ***/ I hope this helps! Doug

DougWielenga · ‎08-07-2017

IrinaN, The short answer is that there is not a catalog for functions that are (generally) only used in SAS Enterprise Miner since these would typically provide no benefit to the user, but if you have questions about what a particular function does, you can look at the code (as you have done) or inquire with SAS Technical Support. In this case, the DMNORM function you are mention is used for normalizing input field names and values to have no more than 32 characters in a the name and no more than 32 characters in the field. It uses the internal normalized version of the variable for analysis and in the score code it generates but you never would have need of these functions elsewhere. The normalization in this situation is important because many data management applications/utilities export data with unnecessarily wide fields (e.g. 200+ characters for a Yes/No variable). Since SAS Enterprise Miner is designed to generate score code and the entire potential width of the field must be stored just in case it is needed, this limit prevents the data from becoming unnecessarily large and it prevents the scorecode from becoming unnecessarily long as both of these will slow processing. Even if your grouping variables have levels that do not differ prior to the first 32 characters, SAS Enterprise Miner will still keep them distinct but you will have to go back to the code in order to figure out which level each normalized level is assigned to. This is why we recommend to make sure that you don't use unnecessarily long field names/values, but if you do then make sure they differ in the first 20-25 characters so they will be easily distinguished. In general the DMNORM function handles all this but it is not a function that would typically be used directly by a user. I hope this helps! Doug

DougWielenga · ‎08-07-2017

It really doesn't matter to an algorithm where the data came from or whether or not there should be 'missing' values or not. The data structure for the techniques you are describing anticipate that there are going to be distinct units/observations/entities on each row (not spread across multiple rows) and each column will contain an attribute for the unit/observation/entity on the corresponding row. So if we were looking at cars, your rows might correspond to a particular make and model of a car and the columns might correspond to things like suggested retail price, city mpg, hiway mpg, number of cylinders, drivetrain type (front/rear/all-wheel), bluetooth enabled (yes/no), etc... It is possible that you don't have complete information even in simple situations like this since Mazda doesn't have cylinders (its a chamber) in its rotary engine, and some models might not post certain information. It is important to note that a neural network, a support vector machine, or a regression model will drop any observation with incomplete data which simply means there is a missing value for one or more of the input variables. Decision Tree models are able to incorporate these observations but you must impute/guess the missing value if you want the observation to be considered at all in your neural network or regression model. Adding rows with incomplete data will not help these latter modeling types but even incomplete data can be used by a Decision Tree model. If the rows that have been 'added' are not really contributing any additional information to the model, it is possible that one of those methods requiring complete data might be helpful. From a method standpoint however, it is important to understand how the methods are interpreting your data and to decide what will generate meaningful result. I hope this helps, Doug

DougWielenga · ‎08-07-2017

Given that your variables are all strings of characters and symbols rather than interval/numeric, you might consider working first with a Decision Tree rather than a Neural Network or Regression model. Regarding the observations, I am not sure why you would choose to limit the input data initially. It is common to model a rare event using any of these approaches. When the number of observations is extremely large relative to the computing power, the law of diminishing returns comes into play which is when one might consider sampling (or oversampling) as one approach to dealing with excessive time or resources being needed for modeling against the entire data set. The observation count you are describing is not excessive, but I still do not have a good understanding for what an observation is in your data set. In general, the methods you are discussing expect the data to contain one observation/entity on each row and the attributes of that entity are contained in the columns. From your description, it sounds like each row would correspond to either a malware app or a clean app and the columns would contain attributes for the corresponding app. The target variable would flag each row as malware or clean (perhaps, 1 and 0) and there would be an ID to flag the particular app (one row for each such app), and the columns would correspond to attributes of the app. You could also try neural network, support vector machine, and regression models but these models require complete data. Therefore, if there are any of the apps which have any missing data (no known value for a column), you must either impute/guess the missing value or the observation will be dropped from consideration in fitting the model. Even if your data is complete, you should still consider many types of models including a Decision Tree as there is no way to know in advance which approach will provide the best performance. I hope this helps! Doug

DougWielenga · ‎08-04-2017

Kamal, The SAMPSIO library in SAS Enterprise Miner actually consists of several sample library locations and I would not recommend putting your data in any of those. Instead, you can simply copy the data to a location that is accessible from SAS Enterprise Miner and then click on File --> New --> Library... to define the path to the data location (relative to the server where SAS Enterprise Miner is installed). Note that you cannot use locally mapped drive letters but must specify the fully qualified path (e.g. //myserver/myfolders/myprojects/mydata) for any drive which is not local to the server doing the work. You can then define a new data source using the LIBNAME you defined in the wizard launched by requesting to create a new library. Let me know if you have any trouble. Cordially, Doug

Online Status	Offline
Date Last Visited	‎11-19-2021 02:27 PM

Re: SAS EM: does clustering node have elbow method to select the optim...

Re: Quality Check of Training and Validation Set

Re: Putting a Support Threshold in Apriori Algorithm - SAS E-miner

Re: Lift chart value in business case after Random Forest model

Re: Confusion in decisions node (while scoring after oversampling)

Re: SAS Enterprise Miner: Imbalance data

Re: Calculating Percentiles in Miner

Re: SAS Enterprise Miner 14.3 - Corrupted Diagram

Re: How do I Display All Values in Nodes

Re: Logworth Drop to 0 when Editing Rule for Interactive Decision Tree...

Re: result of segment profile chart doesn't show up all. Is there some...

Re: SAS to Python

Re: Advantages & benefits of using EM Server vs EM for Desktop

Re: Run time error 1008 with market basket analysis node

Re: Using R code in SAS Enterprise Miner

Re: PROC HPSPLIT Decision Tree

Re: How to get ROC curve in Model Comparison

Re: Need links to instructions on reading these SAS charts: Mean Pred...

Re: Help with over/under sampling of the rare event in predictive mode...

Re: Next-Best Offer/Recommendation under the Association Node while us...

Re: SAS Enterprise Miner 14.1: Displaying surrogate rules for variabl...

Re: How to calculate f-score of classifiers

Re: Model Transformations - Intermediate Language or Format

Re: SAS EMiner Machine Learning Models - Stability Check

Re: Model to predict customers who are more likely to subscribe

Re: HPSPLIT and rare events

Re: Problems importing file with SAS Enterprise Miner

Re: Correlated variables in classifiers

Re: Testing Confusion matrix in Enterprise Miner

Re: Precision and recall scores in cutoff node

Re: PROC HPSPLIT Decision Tree

Re: Description of Data Mining Functions

Re: Missing values in a column

Re: Need an advice on Imbalance datasets

Re: Loading SAS Dataset in to SAS eMiner environment