Image Embedding Models in SAS DLPy

5 Likes

Deep Convolution Neural Networks (CNN) are powerful tools for image classification tasks. Image classification involves training a model using a large dataset of examples of each class. The fitted model is then used to make predictions about the instances of each class. For example: if we want to classify images of vehicles, we would ideally train the model with many images of each type of vehicle. Some networks can identify thousands of classes with very good performance because they are trained with large data sets. However, training on a large dataset is not always feasible.

What if we don’t have enough images of each class to build a robust model? What if the classes change frequently so that retraining and redeploying the model is not feasible? We can address these issues by using the concept of image embedding—a method used to represent an input image in a condensed form known as a low-dimensional feature vector. When we build an image embedding model, the low-dimensional vector representation we create of the image is then used to map similar images close together and dissimilar images far apart within the embedding space. In order to compare two images, we would simply compute the distance between their respective embeddings; if the distance is less than a threshold, the two images are the same. Thus, an image classification problem can be transformed into an image similarity problem. An application of image embedding would be facial recognition systems. Instead of training a model with many images of each person’s face, we can use image embedding to find out how similar one picture of a face is to another. Additionally, this technique could be used for image searching, anomaly detection, feature generation, etc.

Figure1 represents the architecture of an image embedding model. Passing an image through a series of convolution, pooling, and fully connected layers would result in a feature vector— a set of numbers that represents the embedding of the input image. Typically, for a classification model, this feature vector feeds into an output softmax layer. But we will not use that output layer for building an image embedding model. Instead, the feature vectors for each image class is compared for similarity as shown in Figure 1. The output of the embedding layer can be further passed on to other machine learning techniques such as clustering, k nearest-neighbor analysis, etc.

Figure 1: High-level architecture of an Image Embedding Model

Building an Image Embedding Model using SAS DLPy

SAS provides a variety of different layer types within its deep learning framework including an embedding loss layer. The embedding loss layer computes losses such as contrastive loss, triplet loss, and quartet loss, in order to compare the feature vectors. The source layers for these three loss functions must be fully connected dense layers that have the exact same number of neurons. In this context, the number of neurons define the embedding dimension. As the name suggests, contrastive loss (Siamese) networks, triplet loss networks, and quartet loss networks each ingests two, three and four input image data streams respectively.

Figure 2 shows each of the three architectures. What do the different input streams represent in each of the three architectures? In a Siamese network, the similarity/difference between the input streams is tracked by an additional ‘target’ variable. In triplet network, we require three input image data streams. The first input layer contains anchor(reference) examples, the second contains positive examples (images similar to anchors), and the third contains negative examples (images dissimilar to anchors). As such, there is no need to specify the target column. All the weights from three model branches are shared. The triplet loss minimizes the distance between anchors and positives while it maximizes the distance between anchors and negatives up to a user-specified threshold. In Quartet loss the first three input layers follow the same requirements as the triplet loss. The fourth input layer contains another set of negative examples.

Figure 2: Siamese, Triplet, and Quartet networks

Building the Architecture in Three Easy Steps

An image embedding architecture can be defined in a few easy steps:

First, define a base network. A variety of architectures exist that can be used to build the base network. The code below defines a RESNET-18 base architecture in DLPy:

Resnet18_model = ResNet18_Caffe(s, width=w, height=h, random_mutation=’random’)

Next, define the embedding layer, which is a dense fully connected network consisting of n neurons. The length of the feature vector(embedding) is determined by the number of neurons in this layer. The following DLPy code defines the embedding layer.

Embedding_layer = Dense(n=n, act=’identity’)

Finally, invoke the built-in DLPy function called “build_embedding_model”, where you can specify the base architecture (such as the RESNET-18 model in step 1), the embedding_model_type as either “Siamese”, “triplet”, or “quartet”, and the embedding layer as the fully connected dense layer defined in step 2.

Model_tr = EmbeddingModel.build_embedding_model(resnet18_model, model_table=’test_tr’,
                                              embedding_model_type=’siamese’, margin=m,
                                              embedding_layer=embedding_layer)

Figure 3 shows the architecture of a Siamese network (not all layers are shown) built in DLPy. There are two branches built from a base RESNET-18 network. DLPy handles replicating the RESNET-18 architecture across the two branches, removing the output layer from the base architecture, adding the fully connected layer to each branch, and finally bringing them all together in the embedding loss layer. The two branches share the same parameters and weights. For a triplet and quartet network, there would be three and four similar branches respectively. Note the option called “margin” in above code snippet. For a Siamese network, the margin parameter is the upper bound for the distance between two dissimilar data samples. It helps to control the distance from exploding.

Figure 3: Siamese network built in SAS DLPy

Training the model

The training data resides in a CASLIB-accessible location so we can employ a server-side load of the data. Images are separated into directories by class, with the directory name indicating the class label. Creation of the input CAS table and resizing of the images can be performed on-the-fly during the training process by simply specifying the path to the data. The model training is carried out in three steps:

Create a batch of n_samples images.
Train on that batch for max_epochs specified in the DLPy optimizer object.
Repeat steps 1 and 2 for max_iter iterations.

res = model_tr.fit_embedding_model(optimizer=optimizer, gpu=True, seed=1234,
                                   path='path_to_image_files_on the_server_side', 
                                   n_samples=n_samples, max_iter=max_iter, 
                                   resize_width=w, resize_height=h)

If the model performs as desired, it can be applied on the validation/test data to verify the predictions.

Deploying the model using an ASTORE

An analytic store, or ASTORE, is a binary file that you create to save the information about the state of an analytic object (such as a predictive model) after the training stage is completed. DLPy allows us to save the model ASTORE from an image embedding model in two modes:

A single branch of the image embedding model can be saved by specifying model_type=’branch’. In this case the embedding loss layer is removed and replaced by a generic output layer.
The full model can be saved by specifying the option model_type=’full’.

branch_model = model_tr.deploy_embedding_model(output_format='astore', model_type='full', 
                                path='path_to_astore')

In order to score a new data set, simply invoke the saved ASTORE and apply it on new data. The result of scoring new data with an ASTORE will produce an n-dimensional feature vector, where n is determined from the training model. Note that when scoring with an ASTORE saved as option 1 above, we can pass a single stream of images. However, an ASTORE saved under option 2 will require as many streams of images as determined by the type of network (Siamese/triplet/quartet). This example takes you through the steps for creating an embedding table in DLPy. The resulting table could be scored against a full ASTORE. Finally, the feature vector obtained as a result of scoring can be ingested by other analytical tasks such as k-means clustering, k nearest neighbors etc. as shown in this example to rank or cluster the most similar objects.

Several applications in the industry, ranging from facial recognition to document ranking can be tackled via placing contextually similar data points close by and dissimilar data points far away from each other in the projected embedding space. In this article, we looked at how image embedding can be performed using SAS DLPy. For more information on creating image embedding models please visit the SAS DLPy GitHub Page and let us know how you plan to use it in your business problems.