Document embeddings (or vectors, as the fashionable like to say) have emerged as a popular area due to the focus on Generative AI. Visual Text Analytics, a SAS Viya offering providing Natural Language Processing (NLP) capabilities, provides an option to train embeddings through the Topics node, backed by the Singular Vector Decomposition algorithm. I encourage you to refer here for a detailed discussion of topics.
The purpose of this article is to highlight a sometimes overlooked task when applying document embeddings for purposes of similarity-based search. Normalisation of vectors helps obtain relevant matches.
First, let's consider vector embeddings. Simply put, these are numerical representation of text contained within a document. Represented as a series of columns in a table, each column refers to some feature (also known as a dimension) of the source document, and together, these columns represent the document as a whole.
Why do we need to transform a document into embeddings in the first place? Text content can be represented in multiple styles and forms, making it hard to organise, classify and analyse. Motivations for embedding documents include the following:
Now, let's consider the definition of a vector. In mathematics, a vector's a quantity that contains magnitude (length) and direction. Therefore, it isn't just one number (which would make it a scalar) but a set of numbers which represent the number of dimensions.
This is an extremely useful property, since it allows for operations which measure how similar two documents are based on the distance between their vectors. Let's take a simple case involving a two-dimensional vector.
Yes, I know. Poor William's turning over somewhere in Stratford-upon-Avon, but that's the price you pay for fame.
The image above shows vectors for two documents depicted in two-dimensional space. Given their coordinate points, vectors enable calculation of distance between the embedding, a simple and common implementation of which is Euclidean distance. This works out to 1.414 (the approximate square root of 2). As the graph also shows, the vector distance can be viewed as the deviation in direction between the two vectors. A low value indicates that the two documents are more or less similar, which seems to be the case here, albeit to the horror of purists.
However, the utility of the above measure is limited! The reason is that this distance is highly vulnerable to scaling differences which may have been introduced during the embedding training process. Note that embeddings could originate from different sources and we cannot take their representation as standard. This also affects the extent to which we interpret any distance measure that's derived. Is 1.414 small (indicating similar) or large (divergent)? I'll never know until I use a standard. This is achieved through a process known as normalisation.
The principle behind vector normalisation is intuitive. Let's consider the same example again.
Let's introduce the unit vector. The unit vector refers to the vector values within the small green box bounded by (1,1). A unit vector is defined as a vector with a magnitude of 1. A magnitude, simply expressed, refers to the length of a vector. Recalling Pythagoras who used to haunt our geometry books, this can be calculated using the formula to calculate the hypothenuse of. a right angled triangle, namely,
Square root ( Sum of squares of dimensions)
Another name for the magnitude is norm, hence the term normalising the vector. To arrive at a normalised value, you simply divide the individual vector values by the magnitude. The resultant vector is a unit vector, which acts as a standard for carrying out similarity search and other vector-based operations.
In our simple example, the unit vectors work out to:
Document | Dimension 1 | Dimension 2 |
Text 1 | 3 / square root(90) | 9 / square root(90) |
Text 2 | 4 / square root(80) | 8 / square root(80) |
Gladly. Please refer here for a SAS program which takes in an input table (or dataset) with vectors, and normalises the columns to a magnitude of 1.
The business end of this program can be found between lines 197 to 329. Notice that this program can run on both CAS and SAS (i.e. SAS 9 / SAS Compute or SAS Programming Runtime Environment) engines and uses array logic to normalise the vectors. Also to be noted is the use of the dictionary.columns table which helps us identify all "vector" columns in the input table which conform to a given name pattern. Highly convenient when dealing with typical vector data which does tend to run in the 100s of columns. Imagine writing an array for each one of those!
Give the code a whirl and let me know your feedback. You might also notice that the code has a lot of other programs wrapped around the same, a strong hint of my intention to also make it available as a SAS Studio Custom Step. Soon.
Cash would be better. Actually, thank you, but no sweat. I'm happy to answer further questions, though. You can email me by clicking here. Glad you enjoyed it.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.