About JasonXin

JasonXin · ‎11-20-2015

JakesVenter , You can try variable selection. Miguel reminded me of something. I am conducting some tests on the variable selection part. Let us know how it goes and we will update later. Thanks. Jason Xin

JasonXin · ‎11-19-2015

JakesVenter, This is a very good question. I just got off phone with another customer who had difficulty bringing transformation into the score code. You did everything right in keeping and rejecting variable for the HP Forest node (RF)model to work. That is great. When you add nodes to EM flow, EM writes out SAS code for many of them. Transform Variable Node is one of them. (Impute Node is another, but Partition Node is not because you don't generate partition in scoring deployment). What does it write? it writes out your transformations so you don't have to. At the end of the flow, which in your case appears to be a HP Forest node, EM compiles all the individual 'node code bits' along the flow. You probably have picked up your existing copy of the score code from the HP Forest node (Results--> View 00--> Scoring -->Score Code) or you did connect a Score Code to the HP Forest node, but somehow the Optimized Code option under the section Score Code Generation is set as NO. Optimized Code option is designed specifically to help you to cut out unwanted transformations in generating scoring code. When one transforms variables to build models, one does not know which ones will be in the model into scoring, which ones will not. The default transformations in Transform Variables for interval variables are already 7. + you may add more. Imagine your initial inputs are 750 variables. How many are getting the RF node? This glut problem is particular headache for random forest models. In logistic regression, for example, if you feed in, say, 500 transformed variables, the model ends up using 14 of them. The optimized score clearly drops out the unwanted 486 variables. The RF does not work that way. If you build, say, 50000 trees in a forest, variable w25 may never be significant until you arrive at tree # 45000. And it is significant only in that specific branch. The very spirit of RF model is to want to include this branch in voting for the total accuracy. So, generally, the fewer trees you are building, the short the glut transformations in your scoring code you should expect. There is a good practice you may consider: In the HP Forest node (I am running EM 14.1. Some previous versions may not have it yet), there is a Variable Selection option. You can start your RF modeling with variable selection set to NO. Look at the performance curve (likely the misclassification curve). Then turn on variable selection. Compare the two performance curves. Kind of, get an idea of how much performance sacrifice you will pay if you engage variable selection for fewer variables, to cut the scoring glut, among other considerations. Your habit to meticulously apply and reject variable at the Meta data layer absolutely should continue. Your business sense is your best variable selection tool. Big picture wise, think and re-think why you want to transform variables to build a random forest model to begin with. Good luck. Best Regards Jason Xin

JasonXin · ‎11-18-2015

Benjamin8, At the Merge node that probably is now bearing a RED box rim, right-mouse on the RED node, select Result . Then go to View menu at upper left corner. Go to SAS Results. Under there you can see log. Select Log. Search keyword. Tell us what the errors are in detail. Or you can take the errors to search in sas.com's knowledge base. Thanks. Jason Xin

JasonXin · ‎11-18-2015

yogesh927, First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. SAS Enterprise Miner 14.1 PMML is updated to full compatibility with DMG Version 4.1. Different versions of Enterprise Miner have different DMG version to match. Since you are able to use Enterprise Miner, you should have access to its in-product document. You can search for PMML there. It should tell you which DMG version THE EM version you are running is compatible with. Challenge is DMG versions are not backward compatible. Therefore a newer version of EM such as 14.1 that is compatible with DMG 4.1 may still have problems with external PMML scoring engines that are not specifically DMG 4.1 compatible. I Googled for JPMML 1.2.6. I would directionally recommend to make sure JPMML 1.2.6 is fully compatible with the DMG version your EM is comfortable with. Good luck. Jason Xin

JasonXin · ‎11-18-2015

rogelio_mancisidor, First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. I just saw your post. When the Use Frozen Groupings property is set to Yes and you run the Interactive Binning node, automatic grouping is not performed on variables where grouping definitions have already been created. You therefore do not have 'collision' or 'involuntary override', within one single IG node. Automatic grouping /training based on data on variables other than those performed by the frozen groupings will play out alongside. It is not uncommon to observe that imposed frozen groupings do not yield as expected. Sometimes they are cutting out of the range of (new) data being tested. Sometimes they give you surprising distributions. These kinds of 'symptoms' are usual diagnosis modelers run to see if a rebuild or recalibration of the model is called for. Visually inspecting grouped variables is one matter. Checking their accuracy or predictive potential is another. If you are interested to carry along the groupings generated by the frozen set with the machine trained groupings, you may consider using two IG nodes (they don't even need to be immediately after the other). One processes the frozen and the other gives you the automatic. Check imported and exported data sets to make sure they are not overriding each other. They should not. Hope this helps. Happy holiday. Best Regards Jason Xin

JasonXin · ‎11-17-2015

If you set it to 0, it will give you dynamic allocation, so you don't have to check as you go. But proceed with caution. If others are sharing the resource with you, they may 'scream' at you. Or if they don't, your system admin may call you. Glad it is progressing forward. Thanks. Jason Xin

JasonXin · ‎11-17-2015

Albe, this DMNORMIP error suggests it fails to call up underlying Java function at scoring due to resource condition. Perhaps time to open a technical support ticket? They may advise you how to alter memsize... to make it through. thanks. Jason Xin

JasonXin · ‎11-17-2015

Hi, First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. Your raw data with the 6 to 1 ratio is not really that imbalanced at all from predictive modeling perspective. The 'response rate' (% of 1 in the model universe) ranging anywhere from 40% to 0.5% is considered 'normal', 'not rare event' or 'just fine'. As matter of fact, your raw response rate of ~16% is very ideal for seeking lift from predictive models. If the raw 'response rate' is too low, one gets a great model. We may very well say, hen, the lower incoming response rate makes boosting performance easier. If the raw rate is kind of high, say, 35%, it will be challenging to have a model with great lift or ROC. An ideal response rate of 6 to 1 does not necessarily make it right, or true to your business on hand. The reality is constraints you have at collecting the data and/or assembling the model universe may very well be different from where and when you want to implement. In statistical term, 'sample' may very well does not reflect source population or target audience. This is typical, and quite frankly the only incentive aspect that drives you to adjust the sample. All the remarks I have made above are independent of random forest being the method you are tinkering. It is general model design practice. Now return to HP Forest (RF) procedure. Unlike HP Logistic where you have a Weight statement. Weighting is to tell a procedure treat one physical record as if the data set has many of it. In telling the event entries to follow one figure and telling non-event entries to follow another, you virtually alter effective count ration between YES and NO. But machine learning procedures / methods like RF builds models while splitting samples and finally assemble /vote them back. There is no practical way (this is not a SAS problem. This is everyone's problem) to trick down a weight quantity properly to subsamples after it is imposed on the whole of the model universe (like HPLOGISTIC). RF actually thrives on the target ratio 'being screwed' when it splits and builds, goes down and down. Return to your question. 1. if I were you, I would stop doing this entirely " the first using the pre-sampling approach (throwing away a large proportion of the non-event observations) ". If 50-50 is true to your business, you can randomly target this group. And then use response data from those random campaign to build model; if you have 50-50, a random toss should perform very close to if you have a model. 2. You can very well stick to your second practice, if you are comfortable with the 6 to 1 ratio being representative of your population. SAS has enabled random forest on high performance because with big data (tall tables and /or wide table (many variables and complex relationships), implementing RF will generally have the benefits in better model accuracy as you train deeper engaging more data. RF has built out of bagging facility so it is less prone to be over-fitting. Inside Proc HPFOREST, it does not automatically (and consciously) seek to balance, although as the tree splits randomly from the root, it may very well hit a ratio near 50-50. That is automatic, but coincidental. Hope this helps. Happy holiday. Thanks. Best Regards Jason Xin

JasonXin · ‎11-17-2015

hugo_viga , First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute. In EM, the Text Parsing node does all the heavy duty initial work ending in frequency matrix. Text Filer node essentially is where most machine-human interaction, subsetting, trimming terms, keep/drop, viewing sterms,... happens. Although the content has been massaged this and that, and certainly exported data sets appear different, the essence remains frequency matrix /query matrix. In rare cases one benefits from clustering directly on count matrix. In most cases, which I suspect includes your case, you would engage SVD as input into text clustering. I cannot find a machine that runs 12.1. I recall SVD back in 12.1 inside Text Cluster node, the same as 14.1 that I am running now. So the answer to your question is just to connect the TF node to TC node and configure SVD there. Hope this helps. Best Regards Jason Xin

JasonXin · ‎11-16-2015

Albe, First, thanks for using SAS. My name is Jason Xin, analytics solution architect focused on financial services. I was helping another customer when I saw your post. I searched SAS.com's technical support and found this support Note that I think is relevant to your error message. http://support.sas.com/kb/15/720.html "Problem Note 15720: Enterprise Miner nodes fail when Server Initialization code contains a syntax error" If you can share with us what your log says, that may be very helpful. Or have you tried the Optimized code? Thanks. Best Regards Jason Xin

JasonXin · ‎11-16-2015

Ujjawal First, thanks for using SAS. My name is Jason Xin, analytics solution architect focused on financial services. Regarding account level vs. individual level given some have >1 CC acccount, you should model on individual level, aggregating all the input variables across all CC accounts under individuals. You could consider building a flag to indicate single account or multiple accounts for testing (significance test, story telling, profiling, insights, model lift...). As long as you don't create/engage (too many) new variables just to measure either single or multiple groups. You should have good control over creating new variables. Engaging existing data could be intriguing. If your models involve external account data on the bureau side, you may not have 'equal' access to all the data over all the multiple accounts all the time. Sometimes they are OK at model building time, but not sustainable at scoring. So a bit planning ahead is more important than technical jogging. Whether you build one model for all, or build one for single account and another for multiple accounts, eventually you need to aggregate individual models to one for the individual to support decisions; piecing together segment based models can be tricky as well. We simply do not say your credit score on Discover is 820 and your credit score on Amex card is 740. We say your SSN shows your score is 785. Hope this helps. Thanks for using SAS. Happy holiday. Best Regards Jason Xin

JasonXin · ‎11-10-2015

Ujjawal , First of all, thank you for your interest in SAS community and SAS product. My name is Jason Xin, solution architect working at SAS mainly focused on analytics area. Your treatment of imputing missing values with zeros on those, I would call, spending categories where non-zeros values are populated sparsely is proper from pure technique standpoint. And to the truth, because they did not spend. Several ideas I like to share. 1. Try to create set of Boolean indicators 1= if the spending is >0. 0=otherwise. Often the flags are more predictive than interval scales. Depending on specific cases, pay more attention to univariate correlation of such binary flags to the target variables. Some binary flags could be all of sudden so 'relevant' to the target that other variables are blocked from accessing the target. 2. Explore the possbilities to combine the individual sparsely spent categories. Sometimes the population % is low due to the modeler breaking down the categories too much. Try to 'prune' back the categories a bit. You can try the same with the Boolean indicators. You can be pretty creative engaging AND , OR in this exercises. 3. I know you are building logistic regression models. If you have access to decision trees, test the raw (not imputed) variables with the decision trees. Get some ideas about their informativeness before your imputation. This could be done in parallel to or before 1 and 2 above: sometimes combining with the raw variables as they are make more sense, especially if you need to explain your practice end biz users. Sometimes combining with only the 'siginficant' or informative makes more sense. Best Regards Jason Xin

JasonXin · ‎07-17-2014

At SAS BASE Editor, you should have Export /Import at the File menu. That can help you do the CSV, SAS data set ... file conversion. You can point and click to get the job done. There is option to save the code so you can batch-run it elsewhere without engaging the GUI. Jason Xin

JasonXin · ‎07-17-2014

Hi, I just googled with "export sas data sets to sql server" and got many good leads. This paper should have some template to go directly to SQL server. http://analytics.ncsu.edu/sesug/2002/DM10.pdf Jason Xin

JasonXin · ‎07-12-2014

Hi, You can look into proc transpose to change the structure to fit the FREQ procedure input requirement. Jason Xin

Online Status	Offline
Date Last Visited	‎01-19-2017 03:46 PM

Re: How many leaves and nodes should a tree

Re: How many leaves and nodes should a tree

Re: SAS EMiner Oversampling reduced the traget sample size

Re: Enterprise miner Node Leaf size issues

Re: Enterprise miner Node Leaf size issues

Re: SAS Enterprise Miner GBM Node

Re: SAS Enterprise Miner GBM Node

Re: Missing/Not Applicable Values for Interval Variable

Re: SAS EMiner Oversampling reduced the traget sample size

Re: Missing/Not Applicable Values for Interval Variable

Tip: Defining Global Metadata for SAS® Enterprise Miner™ Projects

Re: Tip: Bayesian networks implemented in the HPBNET proc in SAS® Ente...

Credit Scoring by Example in SAS® Enterprise Miner™

Re: Tip: How to interpret your SAS® Rapid Predictive Modeler results

Re: proc glm class variables descending

Re: How many leaves and nodes should a tree

Re: Enterpise Miner

Re: Enterpise Miner

Re: Imputing vs Rejecting

Re: Tip: How to interpret your SAS® Rapid Predictive Modeler results

Re: Data Partition Node SAS EM

Re: Variables in Random Forests in SAS EM

Re: Variables in Random Forests in SAS EM

Re: merde node exits with error

Re: Score mismatch while using JPMML to evaluate a SAS produced PMML f...

Re: IGN - UseFrozenGroupings - ImportDataSet

Re: System error in scoring phase with Gradient Boosting model

Re: System error in scoring phase with Gradient Boosting model

Re: HP forest node with an imbalanced training set

Re: Clustering Twitter data and TF-IDF Matrix

Re: System error in scoring phase with Gradient Boosting model

Re: Definition of Bad in Probability of Default model

Re: Safe to include high missing percentage variables

Re: How can we export dataset from enterprise Miner as a csv file or t...

Re: How can we export dataset from enterprise Miner as a csv file or t...

Re: Hi