Hi everyone,
All discussions following are based on SAS Enterprise Miner.
During the process of building different models I found that KNN is the most time-consuming method. For example, if the Neural Network takes few minutes to complete, KNN will take even several hours to finish modeling based on the same training and validation sample. I decide not to include KNN as a workable solution on our project, but in order to at least "get some results", I tend to apply it to smaller subsamples of both EXISTING training and validating datasets. To be specific,
(1) I have already done Partition, Metadata (role assignment), Replacement, Impute, PCA before KNN and I want to apply the results of PCA (i.e. the generated principal components) to KNN method;
(2) Between PCA node and KNN node, I would like to resample in both training and validating datasets with the same proportion, e.g. 10% of the exported PCA training dataset and 10% of the exported PCA validating dataset;
(3) SAS Code node will not be considered, since it is not easy to operate with to other users.
This is really a difficult problem for me. If any of you have any ideas, you are welcome to discuss with me and I will really appreciate it.
Thanks very much!
Hi Capt_VA_SAS,
sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).
The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.
Thank you again for your time and suggestions. Hope you have a nice week!
You should use the Sample Node (http://go.documentation.sas.com/?docsetId=emref&docsetTarget=p15ebk1ysuwhqln11brhbopvmpj7.htm&docset...)
This should get you what you need!
Hi Capt_VA_SAS,
No offense but I think you failed to understand the situation I mentioned, or you didn't really try to simulate the described situation in my question.
To be specific, if you use the "Data Partition" node (assume Train:Validation = 70:30) before the "Sample" node (assume 10% sampling), then you will see in the "Imported Data" property of "Sample" node that only the training data set from the "Data Partition" node was imported to the "Sample" node; in other words, you can only sample from the training data set but not sample from both training AND validation data sets.
This requirement is important for me since some models (e.g. kNN) DO NOT WORK for large data sets (e.g. 2 million observations).If I want to first do the same partition, imputation and variable selection operations for all models - e.g. random forest, logistic regression and kNN - and then apply only a subsample to kNN method, I would need a small (e.g. 100 or 200 thousand) sample AFTER those operations. This is what the SAS EM Sample Node CANNOT do at present.
Or if you have any excellent solutions, please tell me and I would really appreciate it.
Many thanks to you.
Hi Capt_VA_SAS,
sorry for my late reply and many thanks for your suggestions and I really appreciate your efforts. Your second suggestion is theoretically workable and I understand it, but unfortunately I cannot apply it to our workflow since the boss doesn't want anything complicated (although it is not complicated for me).
The good news is that I finally solved this "Sample after partition" problem using the SAS Code Node and the EM Macro variables. By using simple SQL sentences and generating new export datasets we successfully sampled a smaller sample for kNN method separately.
Thank you again for your time and suggestions. Hope you have a nice week!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.