About munitech4u

munitech4u · ‎04-25-2016

Does the scoring code, takes care of decision node preference, automatically?

munitech4u · ‎04-15-2016

well, it is a smaller step in one of the loops, that I was using. But this worked fine for me: %sysfunc(strip(&bin371));

JasonXin · ‎04-14-2016

Hi, When we build EM flows, it actually writes out SAS code in the background behind each (most) node. When a model is built and scoring code is built, it typically retains all the analytically relevant pre-codes leading towards the final scoring equation. For example, if a variable transformation node is involved in the flow, all the transformations are automatically retained including all the renames, what-if... There is also a separate SAS Score Code Node that helps. One may ask: I transformed 800 variables + derived 1000 variables, but only used 12 in the final model. Does the scoring have all of them? No, the scoring code only contains what survives into the final model. If you see score code and optimized score code, make sure you pick the optimized one. Exception to that is: if you insert your custom code by using SAS Code node, they are not automatically copied over. Also this 'way' may not work for some methods like random forest. But I recall gradient boosting is fine. Over the years I have seen EM users going back to transformation node to pick up the code behind the scene, study and improve their SAS programming that way. Although conventions like _ appear a bit odd, coding there is 'best'. Enjoy. Jason Xin

munitech4u · ‎04-14-2016

Is there a way, I can optimize it? Because I here I am setting the dataset for each variable. Can I avoid that and let the code write only the if conditions and then run?

JasonXin · ‎04-13-2016

Under ASSESS, there is a Decisions Node. If you have a multi-class target, it should be populated there for you to customize the proportions. I am not in front of EM, but recall there is a Custom Editor you can click. Jason Xin

munitech4u · ‎04-12-2016

well, I am having trouble again running it : Now the target variable is different with 4 classes: 0,1,2,3. With 0 being the reference. Not sure, why it is not producing any output. I have attached the log.

FreelanceReinh · ‎03-30-2016

@munitech4u wrote: %do %until(&n1=&n2); Now this is sum of g in 2 tables. and we want to run the loop, until it gets equal. Don't you think, there are chances that these numbers may match prior to optimum? e.g. 1+2+5+10 and 1+3+4+10 are same, especially when the observations get large I don't think so. My reasoning behind this criterion was: After the initial assignment of group numbers on one "side" (i.e. based on ID1) in dataset HAVE2, the same procedure of reassignments of group numbers is applied alternately based on ID2, ID1, ID2, ID1 and so on. As the new group numbers are always less than or equal to the previous group numbers (they are replaced by the minimum of their respective BY group), the sum of all group numbers is monotonically decreasing from one iteration step to the next. If it stays the same in one step (&n1=&n2), not a single group number may have changed in this step (because otherwise the decrease of this group number would have reduced the sum of all group numbers as well). This means, after the group numbers have been made consistent with regard to one of the two IDs in the previous iteration step, they now turn out to be already consistent also with regard to the other ID. Hence, the algorithm terminates. A situation where the two sequences of group numbers from two consecutive steps are (1, 2, 5, 10) and (1, 3, 4, 10) is impossible, because group numbers can never increase (from 2 to 3 or from 4 to 5). The same argument holds for any two different sequences with equal sums (&n1=&n2). Another question would be: Is it possible that the algorithm ends up in a sort of "local" optimum which is not the "global" optimum? I say no. Because: In this case there would be at least two different IDs x, y such that their cluster numbers were equal (say, 1) in the "global optimum" solution and unequal (say, 1 and 2) in the "local optimum" solution. The former implies that there is a (finite) chain of linked IDs (linked by pairs in dataset HAVE2) from x to y all of which have cluster number 1. Following this same chain from both ends (x in cluster 1 and y in cluster 2) in the "local optimum" solution necessarily leads to a contradiction (an ID with two different cluster numbers). Thanks for analyzing the algorithm so thoroughly. It would indeed be interesting to prove its correctness mathematically. Since other solutions have now been suggested, you can also compare the results for validation purposes.

FreelanceReinh · ‎03-17-2016

@munitech4u wrote: Thanks, but do you recommend running it on a dataset as large as 4 million? No, given this new information I would choose a different approach: /* "Blow up" the test dataset and add an ID to identify observations */ data Remission; set Remission; do i=1 to 148149; id=(_n_-1)*148149+i; output; end; drop i; run; /* 4000023 obs. */ /* Run an arbitrary logistic regression, write predicted probabilities to dataset PRED */ proc logistic data=Remission; model remiss(event='1')=li; output out=pred p=p; run; /* Select "tied" observations */ proc sql; create table tied_obs(drop=_level_) as select * from pred group by p having count(distinct remiss)>1; quit; /* 1185192 obs. */ This has the additional advantage that you have the other variables from dataset PRED in dataset TIED_OBS, so you can start your analysis immediately. Edit: Simplified HAVING condition: count(*)>1 was redundant.

Reeza · ‎03-14-2016

You can specify a PRIOR dataset in your SCORE statement. http://support.sas.com/documentation/cdl/en/statug/65328/HTML/default/viewer.htm#statug_logistic_syntax31.htm I'm not sure that's a valid way to develop a model.

munitech4u · ‎03-14-2016

Well, I saw the log, one of the input variable has more than 32000 categories. I think that is the problem

M_Maldonado · ‎03-04-2016

My bad, I don't know how to count... I meant to ask, the Misc of your 4 models and their ensemble. I am curious if the ensemble of all 4 models is getting worse--and if the Tree or the SVM are tripping it off... Since you are at it, can you include the classification charts as well? Maybe that will give us a better light of what's going on. Thanks!

WendyCzika · ‎03-03-2016

You need to use the Metadata node following the Neural Network node to change the roles of the observed target to REJECTED and the predicted target (posterior probability for the event level if a nominal target) to TARGET. Then use a Decision Tree node after that.

munitech4u · ‎02-29-2016

It says, run time error encountered, please check the log. But log doesn't show any error. I think towards the end I see below, which might be the reason: 47423 %let _EM_TREECONVERSION=0; 47424 data _null_; 47425 set EMWS1.EM_NODEID end=eof; 47426 where upcase(Component) ='DECISIONTREE' and CLASS = 'SASHELP.EMMODL.DECISIONTREE.CLASS'; 47427 if eof then call symput('_EM_TREECONVERSION', '1'); 47428 run; NOTE: There were 0 observations read from the data set EMWS1.EM_NODEID. WHERE (UPCASE(Component)='DECISIONTREE') and (CLASS='SASHELP.EMMODL.DECISIONTREE.CLASS');

Tom · ‎07-24-2015

Use dataset B to generate the code to subset dataset A. filename code temp; data _null_; set b; put 'if ' var '>' value 'then delete;' ; run; data want ; set A; %include code / source2 ; run;

munitech4u · ‎06-16-2015

Thanks for your help. Since there can be duplicates for the geo_key. Its giving me multiple columns around col1-col50. So just summing across col1 wont solve my problem. So i summed across all rows and set a new indicator to 1 if value is >=1. and then summed on that variables, which seems to be doing the job.

Online Status	Offline
Date Last Visited	‎04-04-2019 08:32 AM

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

How to check duplicate customer IDs using SAS

Re: Calculating GEODIST for various combinations, cross looping

Calculating GEODIST for various combinations, cross looping

Ensemble of Random Forest and Neural network node in Enterprise Miner

Re: Looping over a dataset to check values in another dataset

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

Re: How to check duplicate customer IDs using SAS

Re: Looping over a dataset to check values in another dataset

Re: Looping over a dataset to check values in another dataset

Re: How to store proc freq output of n variables in one dataset?

Re: Scoring new data from random forest node

Re: Positional parameters error sas macro

Re: How to use the scoring code from gradient boosting?

Re: Why this code is failing?

Re: How to perform oversampling using EM, when target variable is mult...

Re: EM Gradient Boosting node, not producing any output

Re: How to perform this in sas?

Re: Proc logistic, how to get the observations with ties?

Re: Using bayesian analysis in SAS.

Re: Why decision tree is throwing run time error?

Re: Why ensemble is decreasing my overall lift?

Re: How to access Variable importance in neural network in EM?

Re: Run time error occurred in EM,while creating an ensemble of model

Re: how to subset efficiently based on cut off value from other datase...

Re: How to select count of distinct key based on indicator in another ...