Semantic Segmentation Using the ModelZoo Action Set

Semantic segmentation is a technique in the realm of computer vision that allows us to predict the contents of an image at a pixel level. These algorithms play a crucial role in a wide range of applications, including autonomous vehicles, object detection, medical imaging, and more. The goal of this article is to provide an introduction to segmentation for those unfamiliar with the topic but more importantly, to highlight and show users how to use a new CAS action set for deep learning recently introduced by R&D, the Model Zoo action set.

Introduction to Semantic Segmentation

Before we talk about the Model Zoo action set and show an example into how we can use it to build segmentation models let’s dive deeper into some of the details of segmentation models.

To build a deep learning algorithm or more broadly, machine learning algorithm, we need a training dataset. For segmentation models the training dataset consists of images, which act as the inputs, and their ground truth annotations (often referred to as masks), which act as the target.

1&2_JC_2024-01-03_09-24-12.png

Figure 1. Raw Image (Left) and Corresponding Mask (Right).

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The masks needed to train a segmentation model can be generated in different ways, including using tools such as CVAT (Computer Vision Annotation Tool) where users can manually annotate a set of images and assign classes to the segmented regions of the image. The annotated images can be exported in different formats, but the format used in this blog is one in which an image has each pixel value correspond to a class. This exported image assigns a unique color to each class which means that each class has a unique set of pixel values.

Figure 2. Annotating Images using CVAT.

Ultimately trained segmentation models are used to generate segmentation masks. The output masks will assign each pixel to the class the model found had the highest predicted probability. As stated before, these masks can be used in a wide array of applications and in a variety of ways. This data comes from a project I worked on with Data4Good for the UNC Center for Galapagos Studies where due to the size, quality, and distribution of the data, I built a segmentation model that can identify turtles in an image and segment them out. Doing this will remove the amount of noise in an image since turtles can appear in all kinds of different background (e.g. boats, sand, underwater, in front of someone, etc.) which can play a negative role when matching turtles to one another. The predicted masks were used to alter the original image such that only the turtle remains (background removed). The new altered images were then used to identify individual turtles from a list of ~600 different turtles.

Now that we’ve covered segmentation more broadly, in the next section we will choose a model architecture to perform segmentation and dive a little deeper into those details.

U-Net Model Architecture

In order to build a segmentation model, a suitable convolutional neural network (CNN) based architecture is chosen or designed to capture relevant features and learn the intricate relationships between pixels. Among various architectures used for semantic segmentation, the U-Net architecture has been one of the most popular options for years due to its performance and intuitive architecture. The U-Net model contains a U-shape architecture (hence the name!) with skip layer connections that combines both high resolution and low-resolution feature maps.

Figure 3. U-Net Model Architecture.

The first half of the U-Net model, or down sampling half, does the following:

Passes an input image through two, back-to-back 3x3 convolutional layers with ReLU activation.
Feature maps generated from the second convolutional layer output is passed through a single 2x2 max pooling layer.
Steps 1 and 2 are repeated three more times for a total of four down sampling steps.
The feature map generated by the last max pooling layer is then passed through a final set of two, back-to-back 3x3 convolutional layers with ReLU activation.

The second half of the U-Net, or the up-sampling half, starts at the bottom of the U-Net after we perform our fifth set of 3x3 convolutions. The process looks like down sampling except we’re going in the opposite direction:

Start the up-sampling portion by performing a 2x2 up-convolution, which is also known as transposed convolution, on the feature map generated after the last convolutional layer.
Take the feature map output from the convolutional layers from the other side of the U-Net (since the U is symmetrical) then copy and crop the feature map to match the dimensions of our up-sampled feature map.
Concatenate our previously down-sampled feature map with our new up-sampled feature map.
Pass the concatenated feature maps through a set of two back-to-back 3x3 convolutional layers with ReLU activation.
Repeat steps 1 through 4 three more times for a total of four up sampling steps.
Pass the final up-sampled feature map through a final set of two back-to-back three by three convolutional layers with ReLU activation.
pass the last feature map through a 1x1 convolution layer to make the final feature map.

This process allows us to combine both high dimensional features and low dimensional features, which appears to be a key contributor in the performance of these models in segmentation tasks. The network can also be modified such that the output matches our input image dimensions. This way the masks can be overlaid over the input image if necessary. One final key distinction from this model architecture and other common CNNs is that this architecture does not have fully connected layers at the end. Instead, the output feature map of a U-Net model is directly passed to a pixel wise softmax link function, which generates the pixel wise class probabilities for each pixel in the resulting image.

Introduction to Model Zoo

Now that we have a short background in semantic segmentation and the U-Net architecture we can now start to discuss model zoo. Model Zoo is a CAS action set for deep learning based on the PyTorch programming framework. The action set has a backend that calls the PyTorch C++ API and a front end that is responsible for initiating actions, terminating actions, reading CAS tables, writing CAS tables, and communicating with the backend portion. Aside from the architectural differences, Model Zoo takes a different approach to deep learning than the traditional deepLearn CAS action set.

The process to build a neural network that works well on specific types of data can be a very time-consuming process that does not always yield the best results. For this reason, many users choose to use neural network architectures that have been published by reputable research groups that perform very well on specific tasks. These groundbreaking neural network architectures are generally published alongside an open-source implementation that oftentimes make their way to some of the most popular open-source deep learning libraries such as TensorFlow and PyTorch. For those reasons, Model Zoo currently puts a focus on using pre-defined model architectures, which is how many users use deep learning in practice. All that being said, Model Zoo also allows users to define custom architectures using PyTorch, a machine learning framework for deep learning built for Python and C++. There is a lot that can be said about defining custom models using PyTorch for Model Zoo, but that is beyond the scope of this blog. Readers of this blog should note that Model Zoo was released in the 2022.09 LTS release of Viya, so there is a still a lot of development being done on Model Zoo and readers can expect a lot more functionality to be added to the action set over the next few years.

Now that we’ve given some background about Model Zoo, let’s discuss how we can use Model Zoo to train a U-Net segmentation model.

Model Zoo U-Net Segmentation Example

This demonstration will be performed using Python; however, users should be aware that they can also use Model Zoo in CASL. The first thing we need to do, like with most Python program, is import the necessary packages:

import os 
import yaml
import swat
import dlpy
import numpy as np
import pandas as pd
from glob import glob

This may end up looking like a typical list of packages when using Model Zoo. Below is a table briefly describing the use of each package in this demonstration:

Package	Use
os	Required to save files to disk
yaml	Used to ensure that the YAML files used by Model Zoo are accurate and complete.
swat	Used to call the necessary CAS action sets, including the connection object and the model zoo actions.
dlpy	Used to perform deep learning tasks using a keras like syntax. This can also be used to train model zoo models using the MZModel class, however, to make this more adaptable to CASL users we will not use this approach.
glob	Used to create a list of files stored in a directory.
numpy	Used for array manipulation, including images.
pandas	Used to create dataframes/tables that can be saved in different formats.

Table 1. Python Package Description.

Now that we have imported the package, we can establish a connection to CAS using the SWAT’s CAS class and load the necessary action sets. In this demonstration I will load the dlModelzoo action set, the image action set to be able to explore images saved in CAS, and the sampling action set to partition tables.

conn = swat.CAS("connection information")

conn.loadactionset("dlModelzoo")
conn.loadactionset("image")
conn.loadactionset("sampling")

Now that we have gone through some of the basic bookkeeping, we can start getting more into the details of the requirements to build segmentation models. We require a table that contains two columns, the location of the image files complete with the extension, and the mask files complete with the extensions as well. Note that these paths should be relative to the CAS library that you’re planning on using. There are a variety of ways for a user to create this table such as creating a pandas data frame then saving as a CSV, using SAS programming then exporting as a CSV, creating it in excel, creating it using the DLPy package, etc. Once you have the CSV you can then load it into memory using the load table action from the table action set. Below is a sample of what your csv can look like:

Figure 4. Sample of CSV Containing Image Paths and Mask Paths.

Something different between model zoo and the deepLearn CAS action set is that we need to create a YAML file that is used by the actions to specify the model and/or data set specific parameters. Although this YAML file may seem a little bit strange, it gives users the flexibility to have a fixed location where they can quickly make changes to the model such as image transformations, input size changes, number of target classes in the chosen algorithm, etc. It is possible to have all the information within a single YAML file, however for the sake of readability I generally prefer to split it up into two YAML files; one for training and one for scoring. One suggestion that I would like to give users when creating the YAML file, is to use a text editor such as notepad++. The main reason for this is that the YAML file is sensitive to spacing, and the sub sections are tab delimited. Below is the YAML file that was used to train the model as well as a table explaining the options used in the YAML file:

documents = """ 
sas:
    dlx:
        train: 
            label: "galapagos_unet"
            dataset:
                type: "Segmentation"
                organization: "CoCo"     # enum[IMAGENET, OPENIMAGE, COCO]
            preProcessing:
                - modelInput: 
                    label: input_tensor1
                    imageTransformation:
                        resize:
                            type: TO_FIX_DIM
                            size: 256 256 
                            target_size: 256 256
                        imgStdType: STD 
                            
            model:
                type: "TORCHNATIVE" # enum [TORCHSCRIPT, TORCHNATIVE]
                name: "SAS_TORCH_UNET" # ignored when type is "TORCHSCRIPT"
                caslib: "mycl" 
                classNumber: 2   
                inputs:
                    - label: input_tensor1
                      size:
                      - 0
                outputs:
                    - label: output_tensor1
                      size: 
                      - 0
"""

Train and Score	Description
label	Used by the train and scoring actions to call a specific section of the YAML file
dataset	Specifies the dataset type the action receives and the type determines how the data is read in and processed.
dataset.type	Supported dataset types are Univariate, ObjDetect, Segmentation, and Autoencoder.
preProcessing	Specifies the pre-processing (augmentation) of the input data as well as the target data.
preProcessing.modelInput	The action looks for the matching label in the model.inputs section to find the stream of input to apply the pre-processing to.
modelInput.label	The label gives the inputs a name and indicates the inputs (data) that the pre-processing will be applied to. Used by the actions to find the stream of inputs to apply the pre-processing to.
modelInput.imageTransformation	Allows users to apply transformations to the image, such as resizing, color transformations, random transformations, etc.
imageTransformation.resize	Used to resize the inputs to a specific height and width
imageTransformation.imgStdType	Normalizes the image pixel values such that each value falls in a range of [0, 1]

Table 2. Train and Score YAML Options.

There are options such as dataset.organization that are just for the user to have a little bit more information that are not used by the Model Zoo CAS actions. Your YAML files should start off with SAS, DLX, and then the subsection after DLX can either be train or score depending on whether you will use this YAML file to train the algorithm or score using the algorithm. Most of the options specified under the training subsection are explained in the table above, so I will not be explaining those options. In this demonstration we’re using the segmentation data set type and two options to preprocess the input dataset. The first preprocessing technique used was to resize the images to fixed dimensions of 256 by 256 pixels for both the input images and the target images. The other transformation that was applied to the inputs was a standardization of the image pixels. Image pixels range from values between 0 and 255, so to ensure the weight updates are not too drastic in scale and to have information flow better through the network, regularization techniques such as this one are common. Note that there are many preprocessing techniques available in model zoo, however, in this example I chose not to apply any outside of these two. The second important subsection that we need to specify is the model subsection and here's where we can give more details about what kind of model we ultimately want to build. The table below explains some of the options that were used:

Model Options	Descriptions
type	Currently supports two types of models, TorchScript and TorchNative. TorchNative are models that are pre-written in C++ and built into the action library. TorchNative is used for pre-built models such as YOLO and UNet. TorchScript is used to import custom models written in PyTorch.
name	Used only for TorchNative models to specify what type of model the user wants to use.
caslib	CasLib where model weights can be loaded and saved to.
classNumber	Number of classes in our targets.
inputs	Can be used to apply changes and transformations to the input and target tensors.
inputs.label & outputs.label	Specifies the name of the tensor to be modified.
inputs.size & outputs.size	Can be used to reshape the specified tensor to a given size. A value of 0 means no reshaping, if reshaping, then values need to be specified as: - Channel - Height - Width

Table 3. Model YAML Options.

Underneath the model subsection the first thing that we need to specify is what type of model we want to build. TorchNative corresponds to the pre-built models, while TorchScript correspond to user defined models using PyTorch. Afterwards you need to specify the name of the model that you want to build, which in this case SAS_TORCH_UNET. The ModelZoo documentation shows which models are currently available to use within the action set, with presumably more on the way in the future. Another necessary option when building a segmentation model is the number of pixel classes. The objective of the model in this case is to distinguish between the background and the turtles, so in this case my class number is going to be two, one class corresponding to background and one class corresponding responding to a turtle. We do not need to resize the input or output tensors, so we leave the size option as 0. That completes the YAML file, so next we can check to see whether it is syntactically accurate by using the YAML package along with the following line of code:

for data in yaml.load_all(documents):
    print(data)

If the YAML file is syntactically correct the output will be a JSON string that displays all of the user specified options. Keep in mind that this doesn't ensure that you won't have any errors whenever the YAML file is used to train or score the model, this only ensures that the YAML file is syntactically correct. In other words that every single option is delimited and spaced correctly.

Although users can build their own YAML file from scratch, the file can also be generated automatically via the DLPy package and modified as needed. The code below will create an object called unet that has the yaml file saved as part of an attribute that can accessed with the documents_train attribute.


from dlpy import mzmodel

# Defines the UNet Model

unet = mzmodel.MZModel(conn, model_type = "TorchNative", model_name = "unet", num_classes = 2, dataset_type = "segmentation")

# Defines transformations for the UNet Model

unet.add_image_transformation(image_resize_type = "RETAIN_ASPECTRATIO", image_size = 256, target_size = 256)

# Displays model information

unet.documents_train

Now that the YAML file is complete and syntactically correct, we can move forward with training the segmentation model. The way that we're going to do that is by using the dlmztrain action from the Model Zoo action set. Below is the dlmztrain action as well as a table with a description of the options being used in the action:

train_action = conn.dlmztrain(loglevel = "DEBUG", table = dict(name = "part_metadata", where = "_PartInd_ = 1"),  
                              inputs = "image_path", targets = "label_path", ngpus = 1, 
                              validationTable = dict(name = "part_metadata", where = "_PartInd_ = 0"),
                              modelOut = dict(name = "galapagos_unet", replace = True), checkpointBest = True,
                              outputIndexMap = dict(name = "galapagos_outputindex", replace = True),
                              optimizer=dict(loss='cross_entropy',
                                             mode=dict(type='synchronous', syncFreq=1,),
                                             algorithm=dict(
                                                            learningRate=dict(value = 0.0002, scalefactor=1e-3),   
                                                            method='ADAM',
                                                            weight_decay=0.0005
                                                            ),
                                             batchSize = 10, seed=12345, maxEpochs=50),
                              learningRateScheduler = dict(policy = "STEP", stepsize = 10, gamma = 0.5),
                              extraoptions=dict(yaml = documents, label='galapagos_unet'))

dlmzTrain Action Options	Description
loglevel	Reporting level for progress messages sent to the client. DEBUG allows users to see more information related to the training process.
table	CAS Table that is used to store the input data for the training step in training a deep learning model.
inputs	The input variables for the training task. Currently, we support image data as input, the input column can either be a string of an image path or a binary of the image data.
targets	The target variables for the training task.
ngpus	Used together with HyperParameter Tuning, specifies the number of GPU's to use across the entire grid. The GPU's with the lowest amount of memory currently allocated will be chosen. This option is mutually exclusive with the gpu option.
validationTable	The CAS table that contains the validation dataset. Used to assess model weights after each epoch.
modelOut	The CAS table used to store the trained model and model weights.
checkPointBest	Specifies whether to save the model weights that performed best on validation or the model produced in the final epoch.
outputIndexMap	The output CAS table containing a mapping from nominal class type to numeric values. The Index table of nominal class type to numeric values built in the training process is written to the CAS table at the end of the action.
optimizer	Key component of training any type of neural network. Currently supports a variety of optimization algorithms including SGD, Adagrad, ADAM, and ADAMW.
tuner	Specifies settings for hyper parameter tuning
learningRateScheduler	Used to specify a learning rate policy, such as a fixed learning rate or one that’s modified after n number of steps.
extraOptions	Used to specify the YAML file that should be read when training or scoring model as well as which section to read by using the label option.

Table 4. dlmzTrain Action Options.

The first option I would like to highlight is the log level option. There are a variety of values that it could take. Personally, I like to set the log level equal to debug so that I get more messages about the training process. In the table option we specify the name of the table that we imported from a CSV that contains the names of the images and their corresponding masks. We use the inputs option to specify the name of the column that contains the names of the images, and the targets option to specify the name of the column that contains the images of the masks. If we are using a machine with GPU capabilities, we can use the GPU equals option and specify the number corresponding to the GPU that we want to use.

Our model weights are going to be stored in a CAS table called galapagos_unet and the weights that will be stored are the weights associated with the model that had the best performance thanks to the checkPointBest option. The optimizer parameter is very important because this is the process that we take to optimize the model weights. The loss function is used to specify how the loss will be calculated. Loss functions are very closely related to the distribution of the target and it's number of classes , as well as what type of model you may be trying to build. The algorithm option allows us to specify the details regarding the algorithm we want to use such as the learning rate, momentum, optimization technique, weight decay and more. Other options that I have specified in the optimizer option include the batch size, which is the number of images that I want to use for every iteration of the optimization algorithm, the seed to make the optimization reproducible, and the max epochs which specifies how many iterations of the optimization I want to go through. The learning rate scheduler can be used to specify whether we want a fixed learning rate for the optimization, or whether we want the learning rate to be changed throughout the optimization. In this demo, the learning rate is adjusted after every 10 epochs. Lastly, we use extraoptions to specify the name of the YAML file, and which section of the YAML file should be read to train the model by using the label option.

Figure 5. Sample of dlmzTrain Action Output.

Keep in mind that the output will differ depending on the log level, a level of debug gives a lot of information regarding the training process. I can also see information about the model such as the name of the model, the kind and number of layers in the model, how the loss function progresses, as well as the misclassification error and the mean intersection over union. At the conclusion of the training process, we get a reason as to why the optimization stopped and a message stating that the action completed successfully.

Now that the model is trained, we can move to start scoring with the model. To score using our model, we need instructions in a YAML file once again. The good thing is that the instructions for this YAML file look very similar to the ones in the training YAML file.

score_doc = """
sas:    
    dlx:
        score: 
            label: "galapagos_unet_score"
            dataset:
                type: "Segmentation"
                organization: "CoCo"     # enum[IMAGENET, OPENIMAGE, COCO]
            preProcessing:
                - modelInput: 
                    label: input_tensor1
                    imageTransformation:
                        resize:
                            type: TO_FIX_DIM
                            size: 256 256 
                            target_size: 256 256
                        imgStdType: STD 
                            
            model:
                type: "TORCHNATIVE" # enum [TORCHSCRIPT, TORCHNATIVE]
                name: "SAS_TORCH_UNET" # ignored when type is "TORCHSCRIPT"
                caslib: "mycl" 
                classNumber: 2    
                inputs:
                    - label: input_tensor1
                      size:
                      - 0
                outputs:
                    - label: output_tensor1
                      size: 
                      - 0
"""

The main differences between this YAML file and the one that we use for training, is that the third sub option is score instead of train, and the label is now “galapagos_unet_score” so that this YAML file has a different name. Outside of this change everything remains the same, and as a matter of fact, in most cases the training YAML file can be the same as the scoring YAML file. When using DLPy to build your UNet models, DLPy makes no distinction between these two files. With a complete YAML file we can now use the dlmzscore action. Below we can see the code as well as a table that provides a brief description for all the options that were used within the action:

score_action = conn.dlmzscore(modelTable = "galapagos_unet", table = dict(name = "part_metadata", where = "_PartInd_ = 0"), 
                              inputs = "image_path", targets = "label_path", batchsize = 10, gpu = {0}, loglevel = "DEBUG",  
                              tableout = dict(name = "unet_output", replace = True),
                              extraoptions = dict(yaml = score_doc, label = "galapagos_unet_score"))

dlmzScore Action Options	Description
loglevel	Reporting level for progress messages sent to the client. DEBUG allows users to see more information related to the training process.
modelTable	The CAS table containing the model and model weights. This parameter is optional. When it is specified, the table stores a binary blob containing the model and weights in Pytorch format; when it is not specified, you can specify the model table file path relative to a CAS table path in the YAML file in the extraOptions parameter.
table	The input CAS table that is used to store the input data for scoring a Deep Learning model.
inputs	The input variables for the scoring task. Currently, we support image data as input, the input column can either be a string of an image path or a binary of the image data.
targets	The target variables for the scoring task.
batchsize	Number of images to score each epoch.
GPU	Specifies the GPU to use when scoring.
tableOut	Output table used to store the output generated from scoring.
extraOptions	Used to specify the YAML file that should be read when training or scoring model as well as which section to read by using the label option.

Table 5. dlmzScore Action Options.

Similar to the dlmztrain action, the amount of information that we get as part of the output, is going to depend on the log level value. Since the log level is set to debug once again, we get as much information as possible regarding the scoring process. Once again we see information regarding our model such as the number of layers, the kinds of layers, the number of parameters, and how the loss, MCE, and mean IoU, changed as we scored our batch of validation images. The tableout option dictates what the name of the output table that contains the predicted images is going to be. The table generated by the score action is the unet_output.

Figure 6. Sample of dlmzScore Action Output.

If we want to save these images we can use the image action set, and from the image action set use the saveImages action. Simply exploring these images is going to show that the output appears like just a plain black image, but the reason for this is because these images are outputting values of zeros and one, a pixel value of 0 corresponds to background, while a pixel value of 1 corresponds to a turtle. In order to fully visualize the mask, a little bit of additional pre-processing was performed to visualize the output.

Figure 7. Image Input (Left), Raw Mask Prediction (Middle), and Processed Mask (Right).

Visualizing these images we can see that this model performed pretty well, which is also corroborated by the mean IoU on the validation dataset. It should be noted that the purpose of the blog is to act as an introduction into segmentation models, Model Zoo and it's syntax, and as such, not much time was spent fine tuning this model. I hope that this blog has allowed you to have a better understanding of segmentation models and hopefully provided you with enough context to get started with the action set!

Additional Resources

Find more articles from SAS Global Enablement and Learning here.