adjgiulio Tracker
https://communities.sas.com/kntur85557/tracker
adjgiulio TrackerSat, 15 Jun 2024 00:18:58 GMT2024-06-15T00:18:58ZRe: Is this a high demensionality issue? How to deal with it?
https://communities.sas.com/t5/SAS-Data-Science/Is-this-a-high-demensionality-issue-How-to-deal-with-it/m-p/171469#M1949
<HTML><HEAD></HEAD><BODY><P>You can do a couple of things:</P><P>1) Group rare classes to fulfill EM's max number of classes requirement. That it is usually ok, in terms of loss of information, unless one of your rare classes is highly correlated with the target.</P><P>2) Replace the actual classes with their target mean and use this new continuous variable instead. But do keep in mind that this, if not done correctly, will increase the risk of overfitting. In particular, your validation and test datasets should not be used to calculated the target mean. That is somewhat hard to do in EM, but very easy to handle in R or Python. </P></BODY></HTML>Wed, 15 Oct 2014 16:40:31 GMThttps://communities.sas.com/t5/SAS-Data-Science/Is-this-a-high-demensionality-issue-How-to-deal-with-it/m-p/171469#M1949adjgiulio2014-10-15T16:40:31ZRe: Decision tree for dimensional reduction
https://communities.sas.com/t5/SAS-Data-Science/Decision-tree-for-dimensional-reduction/m-p/172455#M1973
<HTML><HEAD></HEAD><BODY><P>One thing I'd add is that I wouldn't trust one decision tree to do variable selection for me. Decision trees are highly unstable models, and small changes in the data (i.e. even a change in SEED) can produce vastly different variable selections. Especially when many of your variables are correlated. That is why I'd rather use a Random Forest to get a feeling for variable importance. Not sure EM has random forests though, haven't use EM in a while.</P></BODY></HTML>Wed, 15 Oct 2014 16:32:03 GMThttps://communities.sas.com/t5/SAS-Data-Science/Decision-tree-for-dimensional-reduction/m-p/172455#M1973adjgiulio2014-10-15T16:32:03ZRe: Help, Attrition Model Performance in SAS. Thanks
https://communities.sas.com/t5/SAS-Data-Science/Help-Attrition-Model-Performance-in-SAS-Thanks/m-p/123955#M1043
<HTML><HEAD></HEAD><BODY><P>Knowing nothing else, it seems to me that your training model is not generalizing well to the validation set. Which is usually a sign of overfitting.</P><P>What tool are you using to create the initial model, and what technique?</P></BODY></HTML>Mon, 29 Apr 2013 17:50:46 GMThttps://communities.sas.com/t5/SAS-Data-Science/Help-Attrition-Model-Performance-in-SAS-Thanks/m-p/123955#M1043adjgiulio2013-04-29T17:50:46ZRe: Enterprise Miner 4.3: Questions around the transform node
https://communities.sas.com/t5/SAS-Data-Science/Enterprise-Miner-4-3-Questions-around-the-transform-node/m-p/117279#M995
<HTML><HEAD></HEAD><BODY><P>You should feel free to try anything you want. In data mining, it is not uncommon to try dozens of transformations of the same variable, and then select those that seem to work best. EM offers a series of "Best Power" transformations (Maximum Normality, Maximum Correlation with Target,...), and each one will try for you several transformations: x, log(x), sqrt(x), e^x, x^(1/4), x^2, x^4.<BR />There isn't a rule regarding whether to do transformation, selection or imputation in a specific order. Try different things, and see what works best. EM makes it so easy that I don't see a reason why you wouldn't want to try different approaches just for the sake of curiosity.</P><P><BR />G</P></BODY></HTML>Wed, 24 Apr 2013 21:21:10 GMThttps://communities.sas.com/t5/SAS-Data-Science/Enterprise-Miner-4-3-Questions-around-the-transform-node/m-p/117279#M995adjgiulio2013-04-24T21:21:10ZRe: a question regarding statistics
https://communities.sas.com/t5/SAS-Data-Science/a-question-regarding-statistics/m-p/93308#M700
<HTML><HEAD></HEAD><BODY><P>I'm not sure how the author came up with the average t-statistic. I agree that it is a bad practice averaging t-statistics. If you have the opportunity to recommend a different solution, I would probably go for something like "% of regressions with a significan p-value". It's a way to say, for each independent variable, of the n "by groups", x% had a significan p-value.</P></BODY></HTML>Mon, 01 Apr 2013 16:52:41 GMThttps://communities.sas.com/t5/SAS-Data-Science/a-question-regarding-statistics/m-p/93308#M700adjgiulio2013-04-01T16:52:41ZRe: How to Export Logits in Enterprise Miner
https://communities.sas.com/t5/SAS-Data-Science/How-to-Export-Logits-in-Enterprise-Miner/m-p/85075#M576
<HTML><HEAD></HEAD><BODY><P>I'm interested in seeing if someone can answer your questions. I've looked into this same question myself not long ago. A logistic regression in EM is run through PROC DMREG. When you look at DMREG's documentation it doesn't say anything about an option to export coeffiecients. The workaround I found, which is no good if you need an automated solution but worked for what I had to do that one time, is the following:</P><P>-go to the results of your regression.</P><P>-by default EM shows an Effect Plots chart. Highlight that chart.</P><P>-now go to View->Table. This will open up a table that has a few pieces of information on your inputs, including their coefficients which you can use to calculate odds.</P><P>-you can now select all the data in the table and paste it in Excel, or you can save the table as a SAS table (File->Save as, after having highlighted the table)</P><P></P><P>G</P></BODY></HTML>Thu, 21 Mar 2013 22:33:42 GMThttps://communities.sas.com/t5/SAS-Data-Science/How-to-Export-Logits-in-Enterprise-Miner/m-p/85075#M576adjgiulio2013-03-21T22:33:42ZRe: Auto Neural Node and Optimization Stastistics
https://communities.sas.com/t5/SAS-Data-Science/Auto-Neural-Node-and-Optimization-Stastistics/m-p/80109#M521
<HTML><HEAD></HEAD><BODY><P>Since there is an almost unlimited number of possible network configurations, EM offeres a couple of approaches to cover a wide range of needs. The neural node allows you to control a single hidden layer network. The autoneural node provides an algorithm to build a multilayer network. The default autoneural action is to simply train a single network to give you a baseline model. Full controll only comes from PROC NEURAL.</P><P></P><P>G</P></BODY></HTML>Thu, 21 Mar 2013 21:27:49 GMThttps://communities.sas.com/t5/SAS-Data-Science/Auto-Neural-Node-and-Optimization-Stastistics/m-p/80109#M521adjgiulio2013-03-21T21:27:49ZRe: Proc Logistic , please help. Thanks
https://communities.sas.com/t5/SAS-Data-Science/Proc-Logistic-please-help-Thanks/m-p/77402#M484
<HTML><HEAD></HEAD><BODY><P>Hi,</P><P>being a classically trained statistician who was introduced to data mining only later in my career, I consider myself biased against data dredging. Stepwise selection is often brought up as a pragmatic example of using computational power to replace domain knowledge. In data mining is it not uncommon to start with hundreds or thousands of variables. It is just unpractical to analyze one variable at a time. That’s where I tend to use stepwise regression, as an initial variable selection method used in combination with other variable selection methods such as decision trees, IV,…<BR />In your case it seems like you’re starting with a small number of variables. That’s where domain knowledge should come in to help decide what to include and what to exclude, sometimes regardless of their p-value.</P><P>G</P></BODY></HTML>Thu, 14 Mar 2013 19:53:07 GMThttps://communities.sas.com/t5/SAS-Data-Science/Proc-Logistic-please-help-Thanks/m-p/77402#M484adjgiulio2013-03-14T19:53:07ZRe: Scoring Code in Enterprise Miner
https://communities.sas.com/t5/SAS-Data-Science/Scoring-Code-in-Enterprise-Miner/m-p/133063#M1179
<HTML><HEAD></HEAD><BODY><P>The meaning of the two variables is:</P><P>I_ -- normalized category that the case is classified into</P><P>U_ -- unnormalized category that the case is classified into</P><P></P><P>From a practical perspective I haven't come across cases where they differ. In your case the interval vs nominal format reflects the fact that your target is numeric (even though you have probably defined it as binary in metadata). If you were to use a GOOD/BAD binary target, I_ would also be nominal. U_ is always nominal.</P><P></P><P>G</P></BODY></HTML>Wed, 13 Mar 2013 15:35:08 GMThttps://communities.sas.com/t5/SAS-Data-Science/Scoring-Code-in-Enterprise-Miner/m-p/133063#M1179adjgiulio2013-03-13T15:35:08ZRe: Decicision Tree Results are Blank
https://communities.sas.com/t5/SAS-Data-Science/Decicision-Tree-Results-are-Blank/m-p/131089#M1131
<HTML><HEAD></HEAD><BODY><P>There are many reasons why that could be happening. In theory, your explanatory variables might not have enough power to generte a split. I very much doubt that is the case. If you provide more details it might be easier for us to help. Usually this comes down to a combination of factors. For example, if your sample is too small, and you try to predict a rare event, and you also set a minimum leaf size too high, the tree might not be able to find a split that satisfies purity and minimum leaf size at the same time.</P><P><BR />G</P></BODY></HTML>Fri, 08 Mar 2013 16:03:23 GMThttps://communities.sas.com/t5/SAS-Data-Science/Decicision-Tree-Results-are-Blank/m-p/131089#M1131adjgiulio2013-03-08T16:03:23ZRe: propensity modeling,clustering
https://communities.sas.com/t5/SAS-Data-Science/propensity-modeling-clustering/m-p/125349#M1053
<HTML><HEAD></HEAD><BODY><P><BR />Propensity modeling is such a broad term. From your question it seems like what you really need is some clustering technique reference.</P><P>In general I really like Data Preparation for Data Mining using SAS by Mamdouh Refaat. Many topics discussed in this book apply across the board to all commonly used techniques, including Cluster Analysis.</P><P>My go-to book for clustering using SAS is actually SAS' training material from their<A href="https://support.sas.com/edu/schedules.html?id=1446&ctry=US"> Applied Clustering Techniques </A>course. SAS has at least another couple of cluster related courses, which I have not attended and cannot really speak to. There is a Customer Segmentation using EM, and a Propensity Score Matching one.</P><P></P><P>G</P></BODY></HTML>Mon, 04 Mar 2013 19:04:29 GMThttps://communities.sas.com/t5/SAS-Data-Science/propensity-modeling-clustering/m-p/125349#M1053adjgiulio2013-03-04T19:04:29ZRe: Data Source Node in Enterprise Miner
https://communities.sas.com/t5/SAS-Data-Science/Data-Source-Node-in-Enterprise-Miner/m-p/113813#M954
<HTML><HEAD></HEAD><BODY><P>Think of the data source as metadata (data about the data). You're defining characteristics of the dataset and its variables. These characteristics can be leveraged later on through other nodes. For instance, if you defined lower and upper limits, then you could add a filter node and choose "Metadata limits" as default filtering method. That would essentially do what you had hoped the data source node would automatically do for you. The advantage of this approach is that it gives you a lot more flexibility in how you use limits. You can use them to filter through a filter node, you could replace values outside of your limits with a replacement node and so forth.</P><P><BR />G</P></BODY></HTML>Wed, 27 Feb 2013 19:42:10 GMThttps://communities.sas.com/t5/SAS-Data-Science/Data-Source-Node-in-Enterprise-Miner/m-p/113813#M954adjgiulio2013-02-27T19:42:10ZRe: How to implement oversampling in Enterprise Miner?
https://communities.sas.com/t5/SAS-Data-Science/How-to-implement-oversampling-in-Enterprise-Miner/m-p/121566#M1028
<HTML><HEAD></HEAD><BODY><P>Mike,</P><P></P><P>There are several ways to implement oversampling in EM. The first step is to determine what flavor of oversampling you are after. Is it oversampling, undersampling, weighting of observations, duplication of rare events? This choice is influenced by many factors, including the proportion of rare events (is it 10%, 1% 0.1%...?) and how many observations you have. The ultimate goal is to have enough examples of your rare class to allow the model to identify meaningful patterns.</P><P>Under a typical scenario your target has a rare class, say 10%. If you had enough observations you could afford to oversample the rare class to 50%. You can do that using a sample node with the following properties: Size/Type=Percentage, Size/Percentage=100, Stratified/Criterion=Equal. This will result in a 50-50 sample where all of your rare events are used and only a sample of 0’s are chosen.</P><P>At this point you can already start running models, however all of you posterior probabilities and many performance metrics will not be reflecting the true priors. Still good to do model comparison and performance evaluation, as well as ranking of observations.</P><P>If you want your priors to be adjusted, then add a Decision Node (after data partition, for example). Under the Custom Editor add the real priors. This will prompt EM to adjust all of your posterior probabilities.</P><P>However, and this is something to be careful with, the Decision Node alone will NOT prompt EM to use the real priors as a cutoff value when choosing whether an observation is a 0 or a 1. In our example, even after using the Decision node, EM would use 0.5 as cutoff value.</P><P>In order to get the cutoff right, you need to go back to the decision node, go to the Decisions Tab and select Yes, then click Default to Inverse Prior Weights.</P><P>Under the Decision Weights tab, copy the value in the lower right corner to the lower left corner but add a minus in front of it. Replace the lower right corner with a 0. Just keep in mind that, even after all of this work, some metric (Misclassification in particular) will not reflect the actual priors. But the posterios will be right and the 0/1 decision will be right.</P><P></P><P>G</P></BODY></HTML>Tue, 26 Feb 2013 15:47:19 GMThttps://communities.sas.com/t5/SAS-Data-Science/How-to-implement-oversampling-in-Enterprise-Miner/m-p/121566#M1028adjgiulio2013-02-26T15:47:19ZRe: Inverse Prior Weights
https://communities.sas.com/t5/SAS-Data-Science/Inverse-Prior-Weights/m-p/118240#M1000
<HTML><HEAD></HEAD><BODY><P>There are tons of papers out there on oversampling, undersampling, weighting of observation and other techniques of this kind. One of my favorite is this one:</P><P><A href="http://www2.sas.com/proceedings/forum2007/073-2007.pdf">http://www2.sas.com/proceedings/forum2007/073-2007.pdf</A></P><P>It talks about oversampling starting on page 6.</P><P></P><P>Other articles I like are these two by Gordon Linhoff:</P><P><A href="http://blog.data-miners.com/2009/11/oversampling-in-general.html">http://blog.data-miners.com/2009/11/oversampling-in-general.html</A></P><P><A href="http://blog.data-miners.com/2009/09/adjusting-for-oversampling.html">http://blog.data-miners.com/2009/09/adjusting-for-oversampling.html</A></P><P><BR /> If you're looking for more technical details, just google the topic and you'll find a lot more.</P><P></P><P>G</P></BODY></HTML>Mon, 25 Feb 2013 19:40:06 GMThttps://communities.sas.com/t5/SAS-Data-Science/Inverse-Prior-Weights/m-p/118240#M1000adjgiulio2013-02-25T19:40:06ZRe: Output SAS Miner data to Excel.
https://communities.sas.com/t5/SAS-Data-Science/Output-SAS-Miner-data-to-Excel/m-p/113018#M951
<HTML><HEAD></HEAD><BODY><P>How about you add a Code node after the node where your dataset is created and then use this code in the code node:</P><P></P><P>PROC EXPORT DATA= &EM_IMPORT_DATA</P><P> OUTFILE= "<A href="https://communities.sas.com/">your_export_location\test.xls</A>" </P><P> DBMS=EXCELCS REPLACE;</P><P> SHEET="test"; </P><P>RUN; quit;</P><P></P><P>Notice that the exported dataset you want to save becomes the import dataset in the code node. Play with the DBMS to make it work for you.</P><P></P><P>G</P></BODY></HTML>Thu, 21 Feb 2013 19:16:02 GMThttps://communities.sas.com/t5/SAS-Data-Science/Output-SAS-Miner-data-to-Excel/m-p/113018#M951adjgiulio2013-02-21T19:16:02Zchange variable role via Code node
https://communities.sas.com/t5/SAS-Data-Science/change-variable-role-via-Code-node/m-p/117487#M996
<HTML><HEAD></HEAD><BODY><P>I've a dataset where each observation has a common set of variables. Each observation also has a time series set of variables, with the length of the series changing from observation to observation.</P><P>For instance. The max length of the time series is 36. A member who churned after 6 months would have 6 out of 36 time series data points (the remaining 30 would be missing values). Another member who churned after 22 months would have 22 datapoints out of 36.</P><P>Something like this:</P><P>obs age gender t1 t2 t3 t4 ... t36</P><P>1 23 0 9 8 3 . .</P><P>2 54 1 8 8 . . . </P><P>3 34 1 5 5 6 4 8</P><P></P><P>I want to create an ensemble model where a model is fitted to each subgroup of members according to the length of their time series. In order to do so, I need to be able to change the role of the unsed time series variables to rejected.</P><P>That can be done manually using an endless series of metadata nodes. But I'd like a more flexible code driven solution. Is that possible?</P><P></P><P>Thanks,</P><P><BR />G</P></BODY></HTML>Thu, 21 Feb 2013 19:06:34 GMThttps://communities.sas.com/t5/SAS-Data-Science/change-variable-role-via-Code-node/m-p/117487#M996adjgiulio2013-02-21T19:06:34Z