Hi everyone!
I'm currently learning SAS programming, and I wanted to embark on my own project for now. I have access to SAS Viya, so I was thinking of conducting unsupervised classification of emails (multi-class classification) through VDMML and VTA.
I was thinking of running the text through VTA and then extracting the score code from the categories node, and then process this data to use in VDMML to train a classification model. However, I'm not sure what kind of pipeline would be suitable for this approach as most of the current pipelines seem catered towards supervised learning.
Any help in this area would be appreciated. Apologies if this is a very basic question, and thank
I'm not so sure this is the best little project to learn SAS programming ... but anyway.
In SAS terminology multi-class classification (and multi-label classification) are always supervised.
You probably need unsupervised learning clustering classifiers or topic detection capabilities.
If there's no pipeline template for clustering in Model Studio (VDMML), you can always build such a pipeline yourself starting from a data node (or an empty pipeline).
After you have used Singular Value Decomposition (SVD) or Latent Dirichlet allocation (LDA) to reduce the dimensionality of the weighted term-by-document frequency matrix, you can perfectly apply some clustering algorithms. But every e-mail will belong to only 1 cluster. If you use the topic detection in VTA, then a single e-mail may contain several topics.
Koen
I'm not so sure this is the best little project to learn SAS programming ... but anyway.
In SAS terminology multi-class classification (and multi-label classification) are always supervised.
You probably need unsupervised learning clustering classifiers or topic detection capabilities.
If there's no pipeline template for clustering in Model Studio (VDMML), you can always build such a pipeline yourself starting from a data node (or an empty pipeline).
After you have used Singular Value Decomposition (SVD) or Latent Dirichlet allocation (LDA) to reduce the dimensionality of the weighted term-by-document frequency matrix, you can perfectly apply some clustering algorithms. But every e-mail will belong to only 1 cluster. If you use the topic detection in VTA, then a single e-mail may contain several topics.
Koen
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
