01-19-2017
JasonXin
SAS Employee
Member since
06-23-2011
- 122 Posts
- 5 Likes Given
- 17 Solutions
- 42 Likes Received
-
Latest posts by JasonXin
Subject Views Posted 9199 12-04-2016 09:48 AM 9214 12-03-2016 03:54 PM 1693 12-01-2016 04:12 PM 4244 12-01-2016 04:03 PM 4266 11-30-2016 05:19 PM 1930 11-21-2016 05:50 PM 1939 11-21-2016 12:31 PM 3007 11-21-2016 09:42 AM 1744 11-19-2016 11:58 AM 2503 11-19-2016 11:46 AM -
Activity Feed for JasonXin
- Got a Like for Re: proc glm class variables descending. 08-30-2024 10:47 AM
- Got a Like for Re: proc glm class variables descending. 04-11-2023 09:50 PM
- Got a Like for Re: How many leaves and nodes should a tree. 11-24-2021 05:14 PM
- Got a Like for Re: How to use the scoring code from gradient boosting?. 06-20-2019 06:00 PM
- Got a Like for Re: Imputing vs Rejecting. 01-11-2017 09:59 AM
- Got a Like for Re: What is the best way to continuously train a SAS EM Model?. 01-06-2017 04:37 PM
- Posted Re: How many leaves and nodes should a tree on SAS Data Science. 12-04-2016 09:48 AM
- Posted Re: How many leaves and nodes should a tree on SAS Data Science. 12-03-2016 03:54 PM
- Posted Re: SAS EMiner Oversampling reduced the traget sample size on SAS Data Science. 12-01-2016 04:12 PM
- Posted Re: Enterprise miner Node Leaf size issues on SAS Data Science. 12-01-2016 04:03 PM
- Posted Re: Enterprise miner Node Leaf size issues on SAS Data Science. 11-30-2016 05:19 PM
- Posted Re: SAS Enterprise Miner GBM Node on SAS Data Science. 11-21-2016 05:50 PM
- Posted Re: SAS Enterprise Miner GBM Node on SAS Data Science. 11-21-2016 12:31 PM
- Got a Like for Re: How many leaves and nodes should a tree. 11-21-2016 10:24 AM
- Posted Re: Missing/Not Applicable Values for Interval Variable on SAS Data Science. 11-21-2016 09:42 AM
- Posted Re: SAS EMiner Oversampling reduced the traget sample size on SAS Data Science. 11-19-2016 11:58 AM
- Posted Re: Missing/Not Applicable Values for Interval Variable on SAS Data Science. 11-19-2016 11:46 AM
- Posted Re: How many leaves and nodes should a tree on SAS Data Science. 11-19-2016 10:51 AM
- Posted Re: Collaborative Filtering in SAS. Please help. Thank You on SAS Data Science. 11-18-2016 08:49 AM
- Posted Re: HPforest variable importance on SAS Data Science. 11-01-2016 09:42 PM
-
Posts I Liked
Subject Likes Author Latest Post 3 2 2 1 -
My Liked Posts
Subject Likes Posted 2 07-08-2014 11:54 AM 2 11-19-2016 10:51 AM 1 09-20-2016 11:31 PM 1 09-20-2016 11:15 PM 2 09-08-2016 01:52 PM -
My Library Contributions
12-03-2016
03:54 PM
Hi, If you don't see the button to the right to click, open and configure, the chance is the image belongs to a different version of EM. Thanks. Jason Xin
... View more
12-01-2016
04:12 PM
Hi,
First of all, there is no over-sampling node in EM. I figure you meant Sample Node. The Sample Node has random, systematic, First, N, stratify... None of them allows you to change the ration between 1 and 0 on the target. The purpose of sampling is to take a subset, in one way or another, to represent the master source. The goal is to represent, not to alter. On the other hand, the matter of oversampling is to recompose a sample, therefore to alter, logically. Sampling Node often is used in situation like : The qualified model universe has 20 million observations. I need to take 5% sample to make it work in EM. In this sense, sampling really is not analytical/technical. But oversampling is every bit of analytics. In other words, the reason you run sampling should not overlap with that driving oversampling, although the act of oversampling per SE is sampling. Hope this helps? Thank you for using SAS. Best. Jason Xin
... View more
12-01-2016
04:03 PM
I agree your assessment about transmitting corporate data to SAS. It is doable. It just needs to go through some paperwork. Technology is ready. SAS TS has experience with it. But it depends on if your company grants it or not. Thanks.
... View more
11-30-2016
05:19 PM
It seems your max branch remains at 2. For the sake of testing, wonder if you can relax it to, say, 5, or 8. Also run StaExplore (if you have not) under Explore to profile the input variables. See if you have any significant /strong/dominating and highly categorical variables in the input set, like the most significant binary variable you mentioned. When I ran into situation like this, i often hold out (strong) categorical variables , and build a DT with continuous variables. Then throw back the hold out variables to check the impact on the model.
... View more
11-21-2016
05:50 PM
Hi,
I have seen cases in the past where EM GBM performs in comparable speed with R integrated into the same flow, everything else roughly held equal. Yes, I have seen cases where GBM is slower than R. And vice Verse. So there is little general to infer or conclude. If I am to be very useful to you, in eventuality, I will have to see down in front of your data set and operations to help speed up, as I did several times in the past.
Generally speaking, EM spends a lot of resource running the GUI operations, writing and rewriting code in the background, something that running R through the integration node does not entail. Often when one EM node runs this slow, it indicates the work space for the flow likely is running out of space. It is simply writing as it is swapping... This eventually is a SAS Management Console subject where one can try to relocate and optimize space management.
If GUI operation does not appeal to you that much, you can try the underlying procedure TreeBOOST. If you go to Google.com, search for "Jason Xin, treeboost", you should quickly get to the full-fledged sample code I published years ago. Once you finish modeling using the procedure code, you can re-introduce the predicted value back to EM by using Model Import Node to align model comparison with other models you are building with EM GUI.
Hope this helps. Thanks.
Jason Xin
... View more
11-21-2016
12:31 PM
Hi,
How many variables /observations are you trying with GBM node? Possible to run variable selection before GBM node?
EM's HPFOREST node runs at least on multi-threading ability (SMP) of the CPUs; if you are configured to run on MPP (massively parallel processing) engaging, say, 32 or 48 computers, the speed and other performance are expected to be better than SMP.
The GBM node is not supported to run on SMP or MPP. It runs on traditional single thread node; the node does not have a HP prefix in front of it (this is how you tell). The speed expectation, therefore, is not supposed to be in line with HPFOREST, besides inner algorithm difference. This is, on the other hand, in no way suggesting random forest runs faster than GBM, or vice verse.
The latest SAS Viya sports a geninue in-memory GBM procedure and actions that scale on wide /table tables, a real big data. SAS is not expected to upgrade the existing EM GBM node to something like "HPGBM".
Hope this helps? Thank you for using SAS.
Best Regards
Jason Xin
... View more
11-21-2016
09:42 AM
Hi,
Yes, Distribution AND Tree should both work. You can try and tell the difference. Tree method is more informativeness friendly while distribution method remains univariate essentially. Please pay attention to the distribution inside the non-missing subgroups +the % size of the non-missing. For argument sake, if you only have 1% non-missing, I am hard-pressed to do it.
Converting to 'flags': this idea is always intriguing, in the sense that the resulting indicators by definition are associated with the sourcing element. In the linear regression context, classically we 'stay away' from categorical variable, almost by instinct. But facilities in EM or SAS STAT are equally robust supporting categorical variables, in variable selection and estimation, by way of, say, the CLASS statement. The chance is if you derive indicator, you can only use one of them, if it is useful after all. You could use decision tree in EM to run a test. Make sure all the performance reading is off validation data set.
Best Regards
Jason Xin
... View more
11-19-2016
11:58 AM
Hi,
In EM, see attached picture. Once you load the data into EM, the YES group (in the picture) should be 1 in your case and NO group should be your 0 group. Count=999 should be your 518 and Count=967 should be your 252. To the right, in replace of 0.5081, enter 1. In replace of 0.4919, enter 2.055555556 (=518/252). In plain English, doing so you are telling EM to treat the 518 1 group as it is. And treat the 252 0 group as if there are 2.055555556*252~518. Logically.
Hope this helps?
Jason Xin
... View more
11-19-2016
11:46 AM
Hi,
If the missing values fall exactly along the line of 1 and 0, then simple imputation does not work, since they will run into total or quasi separation. They will be and should be rejected by logistic regression or NN downright.
It is, however, not entirely hopeless, besides the option to drop them. If you do not have EM, and have STAT, take a look into proc MI. You may need to build your final models by the group of values MI plugs in for you. If you have license for EM, under Impute Node, take a look at the Distribution option. In some cases, the Tree option may work but depending other variables, it is possible that you still may not be able to reduce the risk of 'quasi seperation'. Tree Imputation should be the secondary option to try after Distribution. Given that your target=1 typically is very small proportionately, make sure the distribution of non-missing is large, 'normal' or sensible enough for you.
Hope this help? Thank you for using SAS.
Jason Xin
... View more
11-19-2016
10:51 AM
2 Likes
Hi,
If you are using SAS software like Enterprise Miner or HPSLIT, default settings on these parameters, more often than not, serve you a fairly good baseline deccision tree model.
In the case of Enterprise Miner where you can do what we call interactive tree, you can inject any variable based rules to stop, expand or prune a tree. You can also combine this kind of 'manual' tree with machine built trees. Machine trees are trees most predictive modelers mean when they talk about decision tree modeling. I believe your question is about machine-built tree (DT).
Best this, best that, the key is one word: validation. Where to stop, how many trees, how many variables to try (in other words, if you have 500K variables, it is not good idea to pump them all in at once to the tree engine), pruning guidance, surrogates... should all be decided on hold-out samples. As for deciding criteria (which I believe is what you are asking, literally), cost-complexity, balance between training and validation, outweighs so-called accuracy. Best practice typically involves rounds of rounds of tweaking.
In the latest and the great SAS Viya ML suite, you have access to a facility called Auto Tuning that allows you to set ranges on (hyper) parameters, like those mentioned in your question, and let Viya tell which are the optimal combination. The search routine goes beyond brute force nature of grid search (Latin Hypercube, anyone). It is directly and immediately scalabe for the modeler to run it against huge data set in-memory.
So what is the best of this and that? Go to work. Decision tree is unlike many other methods and algorithms. In many cases, the best is when you see it, like gardening. Because there is a visual tree for you to see.
Hope this helps?
Best Regards
Jason Xin
... View more
11-18-2016
08:49 AM
Hi,
By product, SAS IMSTAT has direct support of recommendation modeling capability. Without IMSTAT, you may find several past papers to get you going on the subject. Here is one
http://support.sas.com/resources/papers/proceedings14/1886-2014.pdf
Another interesting one is: http://support.sas.com/resources/papers/proceedings13/511-2013.pdf
Best Regards
Jason Xin
... View more
11-01-2016
09:42 PM
1. Yes, VARS_TO_TRY=n means SAS HPFOREST (and actually any package that wants to legitimately call itself RF should) will randomly pick n out of the total # of variables input by the user to do splitting. Yes, the same n figure applies on all branch split. The rationale is: the split criteria tend to become more and more 'ad hoc' when larger and large number of input variables are put to test for splitting, regardless how one adjusts (Kass or else). So one thing revolutionary about RF is not to run 'split significance test' on all fed input variables. Just pick a smaller number. Rule of thumb is SQRT of the total. After the n smaller # of variables are randomly picked, then split test is ran against them. One direction that is taking place is to make split criteria more simulative
2. Negative reduction Gini intuitively means you should drop the variable since it is not significant enough contributor. It is common practice that one runs RF for once, drop those with negative reduction G and re-run RF. So to use RF both to select variable and build model.
3. If you really believe there is such thing like science in data science or statistics, or there should be, then there is nothing one should and can generalize against one method or another. Since when declaring one method universally better than another becomes the mission of data science or any science at all? My suggestion to you, my friend, is to focus on the work on your hand, focus on delivering value to those who hire you and need you to work. Let fashion be fashion. No matter how the central tendency is going, in one way or another, study the data first. Spend most of your time study data, not method. Thank you for using SAS. Best Regards. Jason Xin
... View more
09-20-2016
11:15 PM
1 Like
Hi, lokendra_devang an_corecompete_ com,
If the client did the imputation before partition using EM Impute node, the imputation data steps should have been collected into the eventual scoreing piece using Score Node or Score Export Node. Automatically. I attach a PDF of 2 pages I took from EM user guide that shows which nodes generate steps that are automatically appended together by EM Score nodes automatically.
By design, nothing written under the EM SAS Code node is automatically picked into the final scoring equation by Score Node or Score Export Node. Subtle is this: if and when set up properly (selecting the right tool type +options) the SAS Code Node runs as expected and produces as expected. This, however, does not give the license for the 'correct' stuff to be automatically incorporated into the final scoring equation. The reason behind this essentially is the price we pay for the great fliexibility affordedy the SAS Code Node. Almost anything and everything you have licensed, or anything you can stick into BASE Editor and make it run, can be inserted into the SAS Code node and run successfully. If EM SAS Code node is designed to automatically write everything we put under Score Code Node, EM could very well end up like EG. And perform much worse than EG or any other SAS code writer.
In the past two years I have encountered two dozens SAS customers who have tried to insert thousands of data preparation codes into SAS Code Node and force EM to compile them into its final score code. The distinction is this: EM is a predictive modeling tool while one should use BASE, or Java, or Hive or C or whatever to prepare the input model universe as much as possible before inputing the data set into EM for modeling. The SAS Code Node is intended for 'mid-flow' or 'mid-stream' supplements that require facilities that are from beyond EM's built in scope. As a practical matter, it is much neater and much easier to inspect if you just lay bare thousands of your steps and procedure code in BASE or SAS Studio, instead of clicking through deep into the thick of EM Code.
For the purpose of completing your scoring piece, you don't need to copy over everything done related to validation, if any, in the Code node. Validation requires a present target variable which does not typically provide itself in your scoring process.
As for your lop-sided segment% of 90%, I think you need to trace back down to make sure all the custom code the client analyst had injected into this EM process has been adequately recovered and brought back to your scoring job on hand, before you 'scream' again. The chance is: as you mentioned, the client analyst's pre-partition imputation logic is missing, then the 90% smells just like a category collapse due to the imputation piece being missing. I concur with you that if the client analyst did the imputation prior to the partitioning process in EM, the analyst may very well have done that on the data set before plugging the data set into EM to beging with.
Hope this help? Thank you for using SAS.
Best Regards
Jason Xin
... View more