Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Sample after data partitioning

Accepted Solution Solved
Reply
Highlighted
Occasional Contributor
Posts: 15
Accepted Solution

Sample after data partitioning

Hi everyone,

 

All discussions following are based on SAS Enterprise Miner.

During the process of building different models I found that KNN is the most time-consuming method. For example, if the Neural Network takes few minutes to complete, KNN will take even several hours to finish modeling based on the same training and validation sample. I decide not to include KNN as a workable solution on our project, but in order to at least "get some results", I tend to apply it to smaller subsamples of both EXISTING training and validating datasets. To be specific,

 

(1) I have already done Partition, Metadata (role assignment), Replacement, Impute, PCA before KNN and I want to apply the results of PCA (i.e. the generated principal components) to KNN method;

(2) Between PCA node and KNN node, I would like to resample in both training and validating datasets with the same proportion, e.g. 10% of the exported PCA training dataset and 10% of the exported PCA validating dataset;

(3) SAS Code node will not be considered, since it is not easy to operate with to other users.

 

This is really a difficult problem for me. If any of you have any ideas, you are welcome to discuss with me and I will really appreciate it.

Thanks very much!


Accepted Solutions
Solution
‎02-04-2018 01:43 PM
Occasional Contributor
Posts: 15

Re: Sample after data partitioning

Posted in reply to Capt_VA_SAS

Hi Capt_VA_SAS,

 

sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).

 The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.

Thank you again for your time and suggestions.Smiley Happy Hope you have a nice week!

View solution in original post


All Replies
SAS Employee
Posts: 6

Re: Sample after data partitioning

You should use the Sample Node (http://go.documentation.sas.com/?docsetId=emref&docsetTarget=p15ebk1ysuwhqln11brhbopvmpj7.htm&docset...)

 

This should get you what you need!

Occasional Contributor
Posts: 15

Re: Sample after data partitioning

Posted in reply to Capt_VA_SAS

Hi Capt_VA_SAS,

 

No offense but I think you failed to understand the situation I mentioned, or you didn't really try to simulate the described situation in my question.

 

To be specific, if you use the "Data Partition" node (assume Train:Validation = 70:30) before the "Sample" node (assume 10% sampling), then you will see in the "Imported Data" property of "Sample" node that only the training data set from the "Data Partition" node was imported to the "Sample" node; in other words, you can only sample from the training data set but not sample from both training AND validation data sets.

 

This requirement is important for me since some models (e.g. kNN) DO NOT WORK for large data sets (e.g. 2 million observations).If I want to first do the same partition, imputation and variable selection operations for all models - e.g. random forest, logistic regression and kNN - and then apply only a subsample to kNN method, I would need a small (e.g. 100 or 200 thousand) sample AFTER those operations. This is what the SAS EM Sample Node CANNOT do at present.

 

Or if you have any excellent solutions, please tell me and I would really appreciate it.

Many thanks to you.

 

 

SAS Employee
Posts: 6

Re: Sample after data partitioning

Hi YG1992,
So I've been mulling over this issue some and have a few suggestions for you. You may have thought of these already but I'll throw them out there.

1. Sample smaller and further upstream before splitting into Train and Validate. Now, I totally get it. Sometimes that last thing you want to do is to have to re-model all of your candidate models from scratch especially if that takes hours to complete. Thus, #2...
2. Create a separate process flow for KNN where you're sampling upstream. Once you're done you can still compare it the performance against the other models (if not through the Model Comparison node, then PROC EYEBALL)
These are less than ideal but at least you can get there. Sorry I don't have a better answer.
Solution
‎02-04-2018 01:43 PM
Occasional Contributor
Posts: 15

Re: Sample after data partitioning

Posted in reply to Capt_VA_SAS

Hi Capt_VA_SAS,

 

sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).

 The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.

Thank you again for your time and suggestions.Smiley Happy Hope you have a nice week!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 275 views
  • 0 likes
  • 2 in conversation