Started ‎05-06-2024 by
Modified ‎05-09-2024 by
Views 952

Document embeddings (or vectors, as the fashionable like to say) have emerged as a popular area due to the focus on Generative AI.  Visual Text Analytics, a SAS Viya offering providing Natural Language Processing (NLP) capabilities, provides an option to train embeddings through the Topics node, backed by the Singular Vector Decomposition algorithm.  I encourage you to refer here for a detailed discussion of topics.

# Why is this important?

First, let's consider vector embeddings.  Simply put, these are numerical representation of text contained within a document.  Represented as a series of columns in a table, each column refers to some feature (also known as a dimension) of the source document, and together, these columns represent the document as a whole.

Why do we need to transform a document into embeddings in the first place?  Text content can be represented in multiple styles and forms, making it hard to organise, classify and analyse.  Motivations for embedding documents include the following:

• data standardisation - similar terms are packed as close numbers within dimensions rather than get treated as  distinct units
• feature engineering - data is organised under different dimensions each of which may carry different meaning
• transformation for downstream applications such as analytics and machine learning, for which numbers are more amenable
• masking - data is no longer represented as readable text, but as numerical proxies

Now, let's consider the definition of a vector.  In mathematics, a vector's a quantity that contains magnitude (length) and direction.  Therefore, it isn't just one number (which would make it a scalar) but a set of numbers which represent the number of dimensions.

This is an extremely useful property, since it allows for operations which measure how similar two documents are based on the distance between their vectors.  Let's take a simple case involving a two-dimensional vector.

Yes, I know.  Poor William's turning over somewhere in Stratford-upon-Avon, but that's the price you pay for fame.

The image above shows vectors for two documents depicted in two-dimensional space.  Given their coordinate points, vectors enable calculation of distance between the embedding, a simple and common implementation of which is Euclidean distance.  This works out to 1.414 (the approximate square root of 2).   As the graph also shows, the vector distance can be viewed as the deviation in direction between the two vectors.   A low value indicates that the two documents are more or less similar, which seems to be the case here, albeit to the horror of purists.

However, the utility of the above measure is limited!  The reason is that this distance is highly vulnerable to scaling differences which may have been introduced during the embedding training process. Note that embeddings could originate from different sources and we cannot take their representation as standard.  This also affects the extent to which we interpret any distance measure that's derived.  Is 1.414 small (indicating similar) or large (divergent)?   I'll never know until I use a standard.  This is achieved through a process known as normalisation.

# So, what should I do?

The principle behind vector normalisation is intuitive.  Let's consider the same example again.

Let's introduce the unit vector.  The unit vector refers to the vector values within the small green box bounded by (1,1).  A unit vector is defined as a vector with a magnitude of 1.   A magnitude, simply expressed, refers to the length of a vector.  Recalling Pythagoras who used to haunt our geometry books, this can be calculated using the formula to calculate the hypothenuse of. a right angled triangle, namely,

Square root ( Sum of squares of dimensions)

Another name for the magnitude is norm, hence the term normalising the vector.  To arrive at a normalised value, you simply divide the individual vector values by the magnitude.  The resultant vector is a unit vector, which acts as a standard for carrying out similarity search and other vector-based operations.

In our simple example, the unit vectors work out to:

 Document Dimension 1 Dimension 2 Text 1 3 / square root(90) 9 / square root(90) Text 2 4 / square root(80) 8 / square root(80)

# Do it for me, please ?

Gladly.  Please refer here for a SAS program which takes in an input table (or dataset) with vectors, and normalises the columns to a magnitude of 1.

The business end of this program can be found between lines 197 to 329.  Notice that this program can run on both CAS and SAS (i.e. SAS 9 / SAS Compute or SAS Programming Runtime Environment) engines and uses array logic to normalise the vectors.  Also to be noted is the use of the dictionary.columns table which helps us identify all "vector" columns in the input table which conform to a given name pattern.  Highly convenient when dealing with typical vector data which does tend to run in the 100s of columns.  Imagine writing an array for each one of those!

Give the code a whirl and let me know your feedback.  You might also notice that the code has a lot of other programs wrapped around the same, a strong hint of my intention to also make it available as a SAS Studio Custom Step.  Soon.

# I want to meet you, shake your hand, and shower praise upon you.

Cash would be better.  Actually, thank you, but no sweat.  I'm happy to answer further questions, though.  You can email me by clicking here.  Glad you enjoyed it.

Version history
Last update:
‎05-09-2024 03:47 PM
Updated by:
Contributors
Article Labels
Article Tags