Deep Convolution Neural Networks (CNN) are powerful tools for image classification tasks. Image classification involves training a model using a large dataset of examples of each class. The fitted model is then used to make predictions about the instances of each class. For example: if we want to classify images of vehicles, we would ideally train the model with many images of each type of vehicle. Some networks can identify thousands of classes with very good performance because they are trained with large data sets. However, training on a large dataset is not always feasible.
What if we don’t have enough images of each class to build a robust model? What if the classes change frequently so that retraining and redeploying the model is not feasible? We can address these issues by using the concept of image embedding—a method used to represent an input image in a condensed form known as a low-dimensional feature vector. When we build an image embedding model, the low-dimensional vector representation we create of the image is then used to map similar images close together and dissimilar images far apart within the embedding space. In order to compare two images, we would simply compute the distance between their respective embeddings; if the distance is less than a threshold, the two images are the same. Thus, an image classification problem can be transformed into an image similarity problem. An application of image embedding would be facial recognition systems. Instead of training a model with many images of each person’s face, we can use image embedding to find out how similar one picture of a face is to another. Additionally, this technique could be used for image searching, anomaly detection, feature generation, etc.
Figure1 represents the architecture of an image embedding model. Passing an image through a series of convolution, pooling, and fully connected layers would result in a feature vector— a set of numbers that represents the embedding of the input image. Typically, for a classification model, this feature vector feeds into an output softmax layer. But we will not use that output layer for building an image embedding model. Instead, the feature vectors for each image class is compared for similarity as shown in Figure 1. The output of the embedding layer can be further passed on to other machine learning techniques such as clustering, k nearest-neighbor analysis, etc.
SAS provides a variety of different layer types within its deep learning framework including an embedding loss layer. The embedding loss layer computes losses such as contrastive loss, triplet loss, and quartet loss, in order to compare the feature vectors. The source layers for these three loss functions must be fully connected dense layers that have the exact same number of neurons. In this context, the number of neurons define the embedding dimension. As the name suggests, contrastive loss (Siamese) networks, triplet loss networks, and quartet loss networks each ingests two, three and four input image data streams respectively.
Figure 2 shows each of the three architectures. What do the different input streams represent in each of the three architectures? In a Siamese network, the similarity/difference between the input streams is tracked by an additional ‘target’ variable. In triplet network, we require three input image data streams. The first input layer contains anchor(reference) examples, the second contains positive examples (images similar to anchors), and the third contains negative examples (images dissimilar to anchors). As such, there is no need to specify the target column. All the weights from three model branches are shared. The triplet loss minimizes the distance between anchors and positives while it maximizes the distance between anchors and negatives up to a user-specified threshold. In Quartet loss the first three input layers follow the same requirements as the triplet loss. The fourth input layer contains another set of negative examples.
An image embedding architecture can be defined in a few easy steps:
Resnet18_model = ResNet18_Caffe(s, width=w, height=h, random_mutation=’random’)
Embedding_layer = Dense(n=n, act=’identity’)
Model_tr = EmbeddingModel.build_embedding_model(resnet18_model, model_table=’test_tr’,
embedding_model_type=’siamese’, margin=m,
embedding_layer=embedding_layer)
Figure 3 shows the architecture of a Siamese network (not all layers are shown) built in DLPy. There are two branches built from a base RESNET-18 network. DLPy handles replicating the RESNET-18 architecture across the two branches, removing the output layer from the base architecture, adding the fully connected layer to each branch, and finally bringing them all together in the embedding loss layer. The two branches share the same parameters and weights. For a triplet and quartet network, there would be three and four similar branches respectively. Note the option called “margin” in above code snippet. For a Siamese network, the margin parameter is the upper bound for the distance between two dissimilar data samples. It helps to control the distance from exploding.
The training data resides in a CASLIB-accessible location so we can employ a server-side load of the data. Images are separated into directories by class, with the directory name indicating the class label. Creation of the input CAS table and resizing of the images can be performed on-the-fly during the training process by simply specifying the path to the data. The model training is carried out in three steps:
res = model_tr.fit_embedding_model(optimizer=optimizer, gpu=True, seed=1234,
path='path_to_image_files_on the_server_side',
n_samples=n_samples, max_iter=max_iter,
resize_width=w, resize_height=h)
If the model performs as desired, it can be applied on the validation/test data to verify the predictions.
An analytic store, or ASTORE, is a binary file that you create to save the information about the state of an analytic object (such as a predictive model) after the training stage is completed. DLPy allows us to save the model ASTORE from an image embedding model in two modes:
branch_model = model_tr.deploy_embedding_model(output_format='astore', model_type='full',
path='path_to_astore')
In order to score a new data set, simply invoke the saved ASTORE and apply it on new data. The result of scoring new data with an ASTORE will produce an n-dimensional feature vector, where n is determined from the training model. Note that when scoring with an ASTORE saved as option 1 above, we can pass a single stream of images. However, an ASTORE saved under option 2 will require as many streams of images as determined by the type of network (Siamese/triplet/quartet). This example takes you through the steps for creating an embedding table in DLPy. The resulting table could be scored against a full ASTORE. Finally, the feature vector obtained as a result of scoring can be ingested by other analytical tasks such as k-means clustering, k nearest neighbors etc. as shown in this example to rank or cluster the most similar objects.
Several applications in the industry, ranging from facial recognition to document ranking can be tackled via placing contextually similar data points close by and dissimilar data points far away from each other in the projected embedding space. In this article, we looked at how image embedding can be performed using SAS DLPy. For more information on creating image embedding models please visit the SAS DLPy GitHub Page and let us know how you plan to use it in your business problems.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.