BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
YG1992
Obsidian | Level 7

Hi everyone,

 

All discussions following are based on SAS Enterprise Miner.

During the process of building different models I found that KNN is the most time-consuming method. For example, if the Neural Network takes few minutes to complete, KNN will take even several hours to finish modeling based on the same training and validation sample. I decide not to include KNN as a workable solution on our project, but in order to at least "get some results", I tend to apply it to smaller subsamples of both EXISTING training and validating datasets. To be specific,

 

(1) I have already done Partition, Metadata (role assignment), Replacement, Impute, PCA before KNN and I want to apply the results of PCA (i.e. the generated principal components) to KNN method;

(2) Between PCA node and KNN node, I would like to resample in both training and validating datasets with the same proportion, e.g. 10% of the exported PCA training dataset and 10% of the exported PCA validating dataset;

(3) SAS Code node will not be considered, since it is not easy to operate with to other users.

 

This is really a difficult problem for me. If any of you have any ideas, you are welcome to discuss with me and I will really appreciate it.

Thanks very much!

1 ACCEPTED SOLUTION

Accepted Solutions
YG1992
Obsidian | Level 7

Hi Capt_VA_SAS,

 

sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).

 The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.

Thank you again for your time and suggestions.Smiley Happy Hope you have a nice week!

View solution in original post

4 REPLIES 4
YG1992
Obsidian | Level 7

Hi Capt_VA_SAS,

 

No offense but I think you failed to understand the situation I mentioned, or you didn't really try to simulate the described situation in my question.

 

To be specific, if you use the "Data Partition" node (assume Train:Validation = 70:30) before the "Sample" node (assume 10% sampling), then you will see in the "Imported Data" property of "Sample" node that only the training data set from the "Data Partition" node was imported to the "Sample" node; in other words, you can only sample from the training data set but not sample from both training AND validation data sets.

 

This requirement is important for me since some models (e.g. kNN) DO NOT WORK for large data sets (e.g. 2 million observations).If I want to first do the same partition, imputation and variable selection operations for all models - e.g. random forest, logistic regression and kNN - and then apply only a subsample to kNN method, I would need a small (e.g. 100 or 200 thousand) sample AFTER those operations. This is what the SAS EM Sample Node CANNOT do at present.

 

Or if you have any excellent solutions, please tell me and I would really appreciate it.

Many thanks to you.

 

 

Capt_VA_SAS
SAS Employee
Hi YG1992,
So I've been mulling over this issue some and have a few suggestions for you. You may have thought of these already but I'll throw them out there.

1. Sample smaller and further upstream before splitting into Train and Validate. Now, I totally get it. Sometimes that last thing you want to do is to have to re-model all of your candidate models from scratch especially if that takes hours to complete. Thus, #2...
2. Create a separate process flow for KNN where you're sampling upstream. Once you're done you can still compare it the performance against the other models (if not through the Model Comparison node, then PROC EYEBALL)
These are less than ideal but at least you can get there. Sorry I don't have a better answer.
YG1992
Obsidian | Level 7

Hi Capt_VA_SAS,

 

sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).

 The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.

Thank you again for your time and suggestions.Smiley Happy Hope you have a nice week!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 2050 views
  • 0 likes
  • 2 in conversation