BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
YG1992
Obsidian | Level 7

Hello everyone,

 

I just used the HP SVM node in SAS EM to build SVM model and it turned out to perform pretty well. But what is confusing to me is: In the output of SVM model , I see following information:

                Training Results
 
Inner Product of Weights               72.2366916
Bias                                   5.27294747
Total Slack (Constraint Violations)    2962.97082
Norm of Longest Vector                 12.6467077
Number of Support Vectors                   69998
Number of Support Vectors on Margin             0
Maximum F                              24.7895353
Minimum F                              0.04925285
Number of Effects                              18
Columns in Data Matrix                         18
Columns in Kernel Matrix                      190

(1) I don't know why it reports that "Number of Support Vectors" is 69998, which is just the size of my training dataset. That is obviously impossible that the model uses all observations in training dataset as support vectors since the AUC on training dataset is NOT equal to 1 and there is very slight overfitting situation.

 

(2) By the way, is there anyone who could tell me, that how can I see the REAL number of support vectors that the model uses?

 

Thanks very much.

1 ACCEPTED SOLUTION

Accepted Solutions
taiphe
SAS Employee

Here are my answers. Hope they make sense and are helpful.

1. Basically a good model should be having relatively accurate prediction without much overfitting. To see the accuracy values, you can look at the misclassification table and the fit statistics table from the training output. The training result table does not contain the training accuracy information. Though you can get some other valuable information from the table such as "Inner Product of Weights", "Constraint Violations", "Number of Support Vectors", and so on. To see a model is overfitting or not, you can score the model with the validating dataset and compare the accuracies with that of the training dataset. If the validating accuracy is pretty close to the training accuracy, then there is no overfitting. Otherwise the overfitting might exist. Assume there is no overfitting, then the models can be compared through their ROCs. 
2. For the linear-kernel SVM model, the chance of overfitting is relatively low. If the overfitting does happen. Then try to adjust the penalty value to get a different model.
3. The default technique for the SVM node is the interior-point method. This method supports multiple threads and distributed computation. While the SMO algorithm is sequential and single threaded. For relatively large dataset problem, the interior-point method runs much fast than the SMO method. For small data problems, you can also select the active-set method from the SVM node. In this case, the non-linear kernel can be used to obtain a higher accuracy model. At the same time, you have a higher risk to encounter the overfitting issue. By the way, the SMO method is a special case of the active-set method.

View solution in original post

6 REPLIES 6
taiphe
SAS Employee

From the training result table I can see that you selected the polynomial kernel with degree=2. In this case, the expanded number of variables is 190. It is possible that all the observations are reported as support vectors. It is good to learn that you get a pretty good model. At the same time, the model could be a little bit overfitting. There are several ways you can adjust the model.

1. Do a data partition and validate the model through the validating dataset.

2. Build a model with linear kernel instead of the polynomial kernel.

3. Try a different penalty value.

 

For the second question, unfortunately the real support vectors are not reported.

YG1992
Obsidian | Level 7

Hi taiphe,

 

Thanks very much for your quick reply and I found them helpful to me. According to your feedback I have some further questions as following:

1. How did you make the judgment that I have trained a good model? In other words, how do you read the training result table? Which statistics are useful? How would you suggest to analyze the results? (I ask this question since I usually add a "comparison node" after each model and tend to ONLY put emphasis on training ROC and validating ROC)

2. I also included a linear-kernel-SVM model. What if there is also some overfitting for this model?

3. According to the official brochure of SAS EM HP Procedure, the HP SVM node uses primal-dual interior point method as default to solve the quadratic programming problem. Why didn't you choose sequential minimal optimization (SMO) algorithm? What do you think the advantages and disadvantages of both algorithms?

 

I know that it may take some time to answer questions above - especially for the 3rd one - so please take your time. As a fresh man in the field machine learning, I always would like to hear some really valuable and useful opinions about different models/methods.

Thank you very much!

taiphe
SAS Employee

Here are my answers. Hope they make sense and are helpful.

1. Basically a good model should be having relatively accurate prediction without much overfitting. To see the accuracy values, you can look at the misclassification table and the fit statistics table from the training output. The training result table does not contain the training accuracy information. Though you can get some other valuable information from the table such as "Inner Product of Weights", "Constraint Violations", "Number of Support Vectors", and so on. To see a model is overfitting or not, you can score the model with the validating dataset and compare the accuracies with that of the training dataset. If the validating accuracy is pretty close to the training accuracy, then there is no overfitting. Otherwise the overfitting might exist. Assume there is no overfitting, then the models can be compared through their ROCs. 
2. For the linear-kernel SVM model, the chance of overfitting is relatively low. If the overfitting does happen. Then try to adjust the penalty value to get a different model.
3. The default technique for the SVM node is the interior-point method. This method supports multiple threads and distributed computation. While the SMO algorithm is sequential and single threaded. For relatively large dataset problem, the interior-point method runs much fast than the SMO method. For small data problems, you can also select the active-set method from the SVM node. In this case, the non-linear kernel can be used to obtain a higher accuracy model. At the same time, you have a higher risk to encounter the overfitting issue. By the way, the SMO method is a special case of the active-set method.

YG1992
Obsidian | Level 7

Thanks very much! Your answers are really helpful especially with mentioning some key words, which from my point of view will be helpful to my future study. Wish you a happy weekend!

AnnaBrown
Community Manager

Hi @YG1992,

 

I'm glad you found some useful info! If one of @taiphe's replies was the exact solution to your problem, can you "Accept it as a solution"? Or if one was particularly helpful - which they both seem like they were, feel free to "Like" one or both. This will help other community members who may run into the same issue know what worked.

Thanks!
Anna


Join us for SAS Community Trivia
SAS Bowl XXIX, The SAS Hackathon
Wednesday, March 8, 2023, at 10 AM ET | #SASBowl

YG1992
Obsidian | Level 7

Hi Anna,

 

Already done. Thanks for your work.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2509 views
  • 0 likes
  • 3 in conversation