JasonXin Tracker

Re: How many leaves and nodes should a tree

JasonXin — Sun, 04 Dec 2016 14:48:56 GMT

If possible, could you point me to the paper? Thanks.
Jason Xin

Re: How many leaves and nodes should a tree

JasonXin — Sat, 03 Dec 2016 20:54:58 GMT

Hi, If you don't see the button to the right to click, open and configure, the chance is the image belongs to a different version of EM. Thanks. Jason Xin

Re: SAS EMiner Oversampling reduced the traget sample size

JasonXin — Thu, 01 Dec 2016 21:12:52 GMT

Hi, First of all, there is no over-sampling node in EM. I figure you meant Sample Node. The Sample Node has random, systematic, First, N, stratify... None of them allows you to change the ration between 1 and 0 on the target. The purpose of sampling is to take a subset, in one way or another, to represent the master source. The goal is to represent, not to alter. On the other hand, the matter of oversampling is to recompose a sample, therefore to alter, logically. Sampling Node often is used in situation like : The qualified model universe has 20 million observations. I need to take 5% sample to make it work in EM. In this sense, sampling really is not analytical/technical. But oversampling is every bit of analytics. In other words, the reason you run sampling should not overlap with that driving oversampling, although the act of oversampling per SE is sampling. Hope this helps? Thank you for using SAS. Best. Jason Xin

Re: Enterprise miner Node Leaf size issues

JasonXin — Thu, 01 Dec 2016 21:03:29 GMT

I agree your assessment about transmitting corporate data to SAS. It is doable. It just needs to go through some paperwork. Technology is ready. SAS TS has experience with it. But it depends on if your company grants it or not. Thanks.

Re: Enterprise miner Node Leaf size issues

JasonXin — Wed, 30 Nov 2016 22:19:33 GMT

It seems your max branch remains at 2. For the sake of testing, wonder if you can relax it to, say, 5, or 8. Also run StaExplore (if you have not) under Explore to profile the input variables. See if you have any significant /strong/dominating and highly categorical variables in the input set, like the most significant binary variable you mentioned. When I ran into situation like this, i often hold out (strong) categorical variables , and build a DT with continuous variables. Then throw back the hold out variables to check the impact on the model.

Re: SAS Enterprise Miner GBM Node

JasonXin — Mon, 21 Nov 2016 22:50:12 GMT

Hi, I have seen cases in the past where EM GBM performs in comparable speed with R integrated into the same flow, everything else roughly held equal. Yes, I have seen cases where GBM is slower than R. And vice Verse. So there is little general to infer or conclude. If I am to be very useful to you, in eventuality, I will have to see down in front of your data set and operations to help speed up, as I did several times in the past. Generally speaking, EM spends a lot of resource running the GUI operations, writing and rewriting code in the background, something that running R through the integration node does not entail. Often when one EM node runs this slow, it indicates the work space for the flow likely is running out of space. It is simply writing as it is swapping... This eventually is a SAS Management Console subject where one can try to relocate and optimize space management. If GUI operation does not appeal to you that much, you can try the underlying procedure TreeBOOST. If you go to Google.com, search for "Jason Xin, treeboost", you should quickly get to the full-fledged sample code I published years ago. Once you finish modeling using the procedure code, you can re-introduce the predicted value back to EM by using Model Import Node to align model comparison with other models you are building with EM GUI. Hope this helps. Thanks. Jason Xin

Re: SAS Enterprise Miner GBM Node

JasonXin — Mon, 21 Nov 2016 17:31:32 GMT

Hi, How many variables /observations are you trying with GBM node? Possible to run variable selection before GBM node? EM's HPFOREST node runs at least on multi-threading ability (SMP) of the CPUs; if you are configured to run on MPP (massively parallel processing) engaging, say, 32 or 48 computers, the speed and other performance are expected to be better than SMP. The GBM node is not supported to run on SMP or MPP. It runs on traditional single thread node; the node does not have a HP prefix in front of it (this is how you tell). The speed expectation, therefore, is not supposed to be in line with HPFOREST, besides inner algorithm difference. This is, on the other hand, in no way suggesting random forest runs faster than GBM, or vice verse. The latest SAS Viya sports a geninue in-memory GBM procedure and actions that scale on wide /table tables, a real big data. SAS is not expected to upgrade the existing EM GBM node to something like "HPGBM". Hope this helps? Thank you for using SAS. Best Regards Jason Xin

Re: Missing/Not Applicable Values for Interval Variable

JasonXin — Mon, 21 Nov 2016 14:42:16 GMT

Hi, Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it. Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set. Best Regards Jason Xin

Re: SAS EMiner Oversampling reduced the traget sample size

JasonXin — Sat, 19 Nov 2016 16:58:28 GMT

Hi, In EM, see attached picture. Once you load the data into EM, the YES group (in the picture) should be 1 in your case and NO group should be your 0 group. Count=999 should be your 518 and Count=967 should be your 252. To the right, in replace of 0.5081, enter 1. In replace of 0.4919, enter 2.055555556 (=518/252). In plain English, doing so you are telling EM to treat the 518 1 group as it is. And treat the 252 0 group as if there are 2.055555556*252~518. Logically. Hope this helps? Jason Xin

Re: Missing/Not Applicable Values for Interval Variable

JasonXin — Sat, 19 Nov 2016 16:46:32 GMT

Hi, If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright. It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you. Hope this help? Thank you for using SAS. Jason Xin

Re: How many leaves and nodes should a tree

JasonXin — Sat, 19 Nov 2016 15:51:07 GMT

Hi,

If you are using SAS software like Enterprise Miner or HPSLIT, default settings on these parameters, more often than not, serve you a fairly good baseline deccision tree model.

In the case of Enterprise Miner where you can do what we call interactive tree, you can inject any variable based rules to stop, expand or prune a tree. You can also combine this kind of 'manual' tree with machine built trees. Machine trees are trees most predictive modelers mean when they talk about decision tree modeling. I believe your question is about machine-built tree (DT).

Best this, best that, the key is one word: validation. Where to stop, how many trees, how many variables to try (in other words, if you have 500K variables, it is not good idea to pump them all in at once to the tree engine), pruning guidance, surrogates... should all be decided on hold-out samples. As for deciding criteria (which I believe is what you are asking, literally), cost-complexity, balance between training and validation, outweighs so-called accuracy. Best practice typically involves rounds of rounds of tweaking.

In the latest and the great SAS Viya ML suite, you have access to a facility called Auto Tuning that allows you to set ranges on (hyper) parameters, like those mentioned in your question, and let Viya tell which are the optimal combination. The search routine goes beyond brute force nature of grid search (Latin Hypercube, anyone). It is directly and immediately scalabe for the modeler to run it against huge data set in-memory.

So what is the best of this and that? Go to work. Decision tree is unlike many other methods and algorithms. In many cases, the best is when you see it, like gardening. Because there is a visual tree for you to see.

Hope this helps?

Best Regards

Jason Xin

Re: Collaborative Filtering in SAS. Please help. Thank You

JasonXin — Fri, 18 Nov 2016 13:49:14 GMT

Hi, By product, SAS IMSTAT has direct support of recommendation modeling capability. Without IMSTAT, you may find several past papers to get you going on the subject. Here is one http://support.sas.com/resources/papers/proceedings14/1886-2014.pdf Another interesting one is: http://support.sas.com/resources/papers/proceedings13/511-2013.pdf Best Regards Jason Xin

Re: HPforest variable importance

JasonXin — Wed, 02 Nov 2016 01:42:16 GMT

1. Yes, VARS_TO_TRY=n means SAS HPFOREST (and actually any package that wants to legitimately call itself RF should) will randomly pick n out of the total # of variables input by the user to do splitting. Yes, the same n figure applies on all branch split. The rationale is: the split criteria tend to become more and more 'ad hoc' when larger and large number of input variables are put to test for splitting, regardless how one adjusts (Kass or else). So one thing revolutionary about RF is not to run 'split significance test' on all fed input variables. Just pick a smaller number. Rule of thumb is SQRT of the total. After the n smaller # of variables are randomly picked, then split test is ran against them. One direction that is taking place is to make split criteria more simulative 2. Negative reduction Gini intuitively means you should drop the variable since it is not significant enough contributor. It is common practice that one runs RF for once, drop those with negative reduction G and re-run RF. So to use RF both to select variable and build model. 3. If you really believe there is such thing like science in data science or statistics, or there should be, then there is nothing one should and can generalize against one method or another. Since when declaring one method universally better than another becomes the mission of data science or any science at all? My suggestion to you, my friend, is to focus on the work on your hand, focus on delivering value to those who hire you and need you to work. Let fashion be fashion. No matter how the central tendency is going, in one way or another, study the data first. Spend most of your time study data, not method. Thank you for using SAS. Best Regards. Jason Xin

Re: Enterpise Miner

JasonXin — Wed, 21 Sep 2016 03:31:18 GMT

I forgot that attachment. Here it is.

Re: Enterpise Miner

JasonXin — Wed, 21 Sep 2016 03:15:53 GMT

Hi, lokendra_devangan_corecompete_com,

If the client did the imputation before partition using EM Impute node, the imputation data steps should have been collected into the eventual scoreing piece using Score Node or Score Export Node. Automatically. I attach a PDF of 2 pages I took from EM user guide that shows which nodes generate steps that are automatically appended together by EM Score nodes automatically.

By design, nothing written under the EM SAS Code node is automatically picked into the final scoring equation by Score Node or Score Export Node. Subtle is this: if and when set up properly (selecting the right tool type +options) the SAS Code Node runs as expected and produces as expected. This, however, does not give the license for the 'correct' stuff to be automatically incorporated into the final scoring equation. The reason behind this essentially is the price we pay for the great fliexibility affordedy the SAS Code Node. Almost anything and everything you have licensed, or anything you can stick into BASE Editor and make it run, can be inserted into the SAS Code node and run successfully. If EM SAS Code node is designed to automatically write everything we put under Score Code Node, EM could very well end up like EG. And perform much worse than EG or any other SAS code writer.

In the past two years I have encountered two dozens SAS customers who have tried to insert thousands of data preparation codes into SAS Code Node and force EM to compile them into its final score code. The distinction is this: EM is a predictive modeling tool while one should use BASE, or Java, or Hive or C or whatever to prepare the input model universe as much as possible before inputing the data set into EM for modeling. The SAS Code Node is intended for 'mid-flow' or 'mid-stream' supplements that require facilities that are from beyond EM's built in scope. As a practical matter, it is much neater and much easier to inspect if you just lay bare thousands of your steps and procedure code in BASE or SAS Studio, instead of clicking through deep into the thick of EM Code.

For the purpose of completing your scoring piece, you don't need to copy over everything done related to validation, if any, in the Code node. Validation requires a present target variable which does not typically provide itself in your scoring process.

As for your lop-sided segment% of 90%, I think you need to trace back down to make sure all the custom code the client analyst had injected into this EM process has been adequately recovered and brought back to your scoring job on hand, before you 'scream' again. The chance is: as you mentioned, the client analyst's pre-partition imputation logic is missing, then the 90% smells just like a category collapse due to the imputation piece being missing. I concur with you that if the client analyst did the imputation prior to the partitioning process in EM, the analyst may very well have done that on the data set before plugging the data set into EM to beging with.

Hope this help? Thank you for using SAS.

Best Regards

Jason Xin

Re: bayesian network in em

JasonXin — Tue, 13 Sep 2016 20:57:39 GMT

Hi,

EM treats variable you specified as TARGET as nominal. It does not care whether the variable is numeric or character type. So you may need to pick a variable that does not have too many categories. Normally you get the idea which variable should be the target from your 'business'. Since your goal, as indicated in your question, appears to find association, not to predict, I would say just use EM to test several different non-interval variables, to see which association finding makes more sense to you.

if you set Automatic Model Selection =YES, EM will select the 'best' network for you. As a starter, it often is sufficient for network selection. Since you are running HPDM GUI, you are entitled to have access to HPDM procedure documentation, in addition to accessing from within the EM product (Help Menu --> Contents). The within-product access is not bad, but it mixes with EM operation instructions.

Once you get access to HPDM procedure document, Examples 5.1 to 5.6 under HPBNET should pretty much answer all you questions, except which one should be your target variable (that really is a business question, not technical question). Hope this helps? Thank you for using SAS.

Best Regards

Jason Xin

Re: SAS EM prior probabilities

JasonXin — Tue, 13 Sep 2016 15:20:40 GMT

Hello, When you set it at 0.25/0.75 (assuming 0.25 for target=1 and 0.75 for target=0), you are telling the software(here is EM. The same if you set the weight statement variable value as such) that the effective, logic event rate on the incoming target value is 25%. In marketing term, you already have historical response rate at 25%. Many, if not all marketing managers, would ask why we need a response model, because normally when the past response rate is <=5% people would think building a response model would make sense to boost it. When you set it to 0.05 vs 0.95, you are telling EM the incoming historical event/response rate is 5%. Therefore, with 25 vs 75, your model is OK, just there is little room for improve so the ROC appears just like the 45 degree random toss line. When you have 5 vs 95, the curve appears 'normal'. This, of course, is the case if you hold other things unchanged. Hope this helps? Thanks for using SAS. Jason Xin

Re: Imputing vs Rejecting

JasonXin — Thu, 08 Sep 2016 17:52:10 GMT

Hi, The act to impute is mainly to keep the observations, in other words to preserve the model universe. The price is how much distortion you can accept and pay. The least distortive is to find out the reason behind the missing values. It is very rare that the model universe has all its data sourced from just 1 or 2 tables. It is almost always the case that the model universe is assembled from various sources, easily in the range of >10 tables. One top source of missing values is what I sometime call 'left join syndrome'. The left side table is your master table of 160,000 IDs, but the right hand side table may only have 52% ID that can match to the left hand side. So you have ~48% missing on all the variables you are appending from the right hand table. Now the nature of the right hand table is key to your imputation. It is not really technique. It is business knowledge. The question is if a table missing 48% is overall useful at all. If the answer is YES, then you can dive into individual variables. It is a good practice to keep a 'missing lineage' when you merge throughout the universe preparation process. After you go through business background, here are some rules of thumb. For categorical /nominal variables with >50% missing, I would drop them, regardless if this is clustering or supervised model. Because if you have many categories, you group the missing portion with one of them, you have no ground to promote that non-missing group to dominate the variable. The artificial impact of this practice is more severe in clustering than supervised model. In, say a decision tree modeling, carrying missing values as it is can add value to the model with little distortion. Clustering does not have this sort of mechanism. If you assign a unique value to replace the missing portion, you create a dominating but artificial value. This is where it becomes tricky depending on how your clustering solution parametrizes the categorical variable. If the categorical variable has >> 50% non-missing, I am comfortable grouping the missing portion with one of the non-missing groups. In SAS clustering, there is a random option that allows you to impute according to the distribution of the non-missing. This actually is available to both categorical and interval variables. As for interval variables, if you have many input variables to spend, you can afford to raise the non-missing requirement % bar and drop more variables. If you don't have many variables, you may tolerate variables that have many missing values. In playing with the requirement %, you need to closely consult definition of the variable. Some 'important' variables having large % of missing may have to stay. In other words, you need to balance. One recommendation is to try different imputation methods (means, median, random) and assess their respective impact on your clustering solution. When there are many variables, you may consider variable clustering with different imputation methods and assess impact accordingly. There is no fast rule which way is better. This is where packages like SAS EM provide a huge productivity edge in that it documents and compares more efficiently than code-programming. One thing special about clustering is scale. It is necessary to scale all the input variables together for clustering. Whether you should impute before or after re-scaling/ standardization is another layer of complexity. There are other aspects related to what distance measure you use in your clustering. I will leave that to another day. Hope this help? Thanks for using SAS. Best Regards Jason Xin

Re: Variables in Random Forests in SAS EM

JasonXin — Wed, 31 Aug 2016 23:43:26 GMT

Hi, Within EM, you can attach Score node to the HP Forest node. Then at Score Node, Go to Result. Then View->> to where you normally pick up SAS Code. You can then pick RF scoring piece, which is actually some batch code that saves the Proc HPFOREST modeling syntax (invoked by your EM node operation) , saves the model info to a directory location (which you can copy out to another location to facilitate your EG scoring), + syntax that involves Proc HPSCORE to conduct the RF scoring. You can copy out the whole batch code and deploy it in EG. There are macro variables in the scoring piece that allows you to specify input and out files. In another in-memory analytics product IMSTAT, when you build RF model using RANDOMWOOD statement, you can opt to save the scoring piece into .sas code. Especially when you RF is kind of complicated, the .sas file can become big. The file size can easily be >100MB per se. The HPSCORE procedure used in HP EM creates a SAS proprietary binary file to capture RF model info needed for scoring. The binary file is efficient and 'nimble', only that you have to have the HPDM product installed to run it. It is highly recommended that scoring RF models happens in some kind of in-memory fashion, which is not the typical 'way of life' for EG. Inside HPDM machine learning nodes and procedures, RF --HPFOREST is the only one that requires a separate HP procedure to support scoring. Hope this helps? Thanks for using SAS. Jason Xin

Re: Stochastic Gradient Boosting

JasonXin — Sat, 27 Aug 2016 18:24:23 GMT

Hi, The procedure that runs behind EM's non-HP GB node is proc treeboost. That is not a HP procedure. If you have EM license, your work at EM GUI is supported by SAS technical support. But your usage of proc treeboost is not officially supported. Depending on specific motivation to want HP version of GB, proc treeboost can still have some different benefits; some EM users have told they simply like 'line-coding' better than GUI. (i have a blog that has syntax of proc treeboost) No, SAS HPDM package does not have HP GB procedure or EM GUI node. Unlike random forest that has HPFOREST + a separate HP Forest node in HPDM. The Stochastic Gradient Boost node has been in regular, non-HP EM for many years, while random forest was added only several years ago. HP Forest was originated into EM in HP mode. Therefore, the evolution paths for GB and RF in EM are different. In the latest SAS high performance, in-memory offering Viya, a GB procedure is included to run in distributed mode on big data sets. Very scalable, if scalability is why you asked this question. Viya also supports calling SAS API from languages like Python, Lua or Java. That means Viya has potential (not guarantee) to support some custom GB implementations. Hope this helps? Thank you for using SAS. Jason Xin