Re: Text Processing

WWD · Posted 09-05-2021 09:02 AM

Course : AI and Machine Learning Professional

Module: Natural Language and Computer Vision

Sub-Module - Visual Text Analytics using SAS Viya

I have three questions, none of which are specific to any of the demonstrations or discussions within this submodule.

1. Is it true that SAS Viya considers a categorical variable within a data set to be a "Target" variable? If the category variable only has two outcomes, . then SAS basically treats the problem as logistic regression?

2. If the original data set does not have any categorical variables and a categorical node is not added to the pipeline, is it true that no scoring (classifying) of the original data nor any "new" data is possible?

3. In the Machine Learning Specialist Module, a text node was added to the pipeline to help predict if a cable-company customer would churn. The output of the text node was a matrix of SVD values. Are SVDs generated and used behind the scenes in the model-building process demonstrated in the " Natural Language and Computer Vision" node?

Thank you,

Bill Donaldson

HarrySnart · Posted 09-28-2021 09:34 AM

Hi,

The Visual Text Analytics projects are a bit different to the Data Mining projects as you can get different types of models depending on what you require (tokenization, classification, sentiment analysis...). Some answers to your questions below in red.

1. Is it true that SAS Viya considers a categorical variable within a data set to be a "Target" variable? If the category variable only has two outcomes, . then SAS basically treats the problem as logistic regression? In the Visual Text Analytics project pipeline you can set categorical variables as inputs (Category), however the 'Target' is the text corpus (Text). The additional categorical inputs can be added to supplement classification nodes in order to generate linguistic rules. This is different to the target in the Data Mining project you create in VDMML - where the target can be classification or regression. In the case of a categorical variable with two levels the model will be a binary classification model, not always a logistic regression (e.g. XGBoost, Decision Tree, Random Forest, etc.). In the example screenshot below you can see the metadata definition for a Visual Text Analytics project where I've added my text corpus and some supplementary categorical fields.

2. If the original data set does not have any categorical variables and a categorical node is not added to the pipeline, is it true that no scoring (classifying) of the original data nor any "new" data is possible?

In a Visual Text Analytics pipeline, I believe, it is only mandatory to have a Text input. The Category input is supplementary to the model and is used for the Categories classification node. You can generate categories by combining topics from the Topics node, for example, or a combination. See in the below screenshot of a Visual Text Analytics pipeline I did with the SAS Global Forum agenda. The topics collections are at the top and have classification rules based on topics trained earlier in the pipeline. The Category input that I added in my metadata definition on the Data tab breaks into separate rules for each of the levels. For example when Session Type = "Business" the classification model scores on key terms like "presentation", "journey" and "customer". These nodes don't have to be run in a set order as you may not always need a classification node if you're more interested in identifying topics or matching patterns based on the LITI syntax using the Concepts node.

3. In the Machine Learning Specialist Module, a text node was added to the pipeline to help predict if a cable-company customer would churn. The output of the text node was a matrix of SVD values. Are SVDs generated and used behind the scenes in the model-building process demonstrated in the " Natural Language and Computer Vision" node?

You are able to generate datasets from the 'results' option of nodes in a Visual Text Analytics pipeline. The output varies on the node type. See below an example from the Categories node which gives both a transactional and modelling-ready format. The modelling ready format is close to the SVD output in the Data Mining project except that it is based on classifiers rather than individual tokens.

Hope this helps answer your question

Harry

WWD · Posted 09-30-2021 07:20 AM

Harry:

Thank you for responding and helping further my understanding of the topics.

Bill Donaldson

Text Processing

Re: Text Processing

Re: Text Processing

Click image to register for webinar

Classroom Training Available!