Visualizing High Dimensional Data Using t-SNE

3 Likes

Today it is imperative to deal with the high dimensional data. The dimension of your data refers to the number of input variables. The analysis of high-dimensional data offers a great challenge because the human intuition about geometry of high dimensions fails. One of the biggest challenges in data visualization is to find general representations of data that can display the multivariate structure of more than two variables. Appropriate visuals are especially helpful when you're trying to find relationships among hundreds or thousands of variables to determine their relative importance – or if they are important at all.

The t-distributed stochastic neighbor embedding (t-SNE) is a method in SAS Viya for visualizing high-dimensional data. The t-SNE method computes a low-dimensional representation, also called an embedding, of high-dimensional data into two or three dimensions.

Unlike other dimension reduction methods, such as principal component analysis (PCA), t-SNE is appropriate for nonlinear data and emphasizes existing groupings in the data. The method is named t-SNE because it models the pairwise distances in low dimensions according to Student’s t-distribution. The t-distribution with one degree of freedom has heavier tails than the Gaussian distribution, which means that it assigns higher probability values to large distances. This enables t-SNE to relax pairwise distances for non-neighboring observations, whereas distances between closely neighboring observations are more exactly preserved. This behavior is desirable because it mitigates the crowding problem in high-dimensional data representation and makes existing groups in the data visually evident.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

For example, in the above illustration, the point-of-interest observation is represented in green, and the other two observations, 1 and 2, are represented in orange. Observation 1 has a higher similarity score than observation 2. Thus, it is closer to the point-of-interest observation.

The t-SNE plot is available using the Data Exploration node in Model Studio, which might help identify naturally occurring clusters in the data. You can also use the TSNE procedure in a SAS Code node.

The t-SNE method operates on an input data table. For each observation in the input data table, the procedure returns either two or three computed columns that contain the embedding coordinates. It computes the embedding by minimizing the Kullback-Leibler divergence between the joint probabilities in high dimensions and the joint probabilities in low dimensions, and it stores the embedding in the output data table. To speed up computations, the t-SNE method uses the Barnes-Hut approximation to the perplexity loss.

In the Data Exploration node, the t-SNE projection property specifies whether to perform t-SNE projections of input variables. This option is deselected by default. The t-SNE perplexity property specifies the t-SNE perplexity that controls the separation of points in the projected space. Possible values range from 1 to 100. The default value is 30.

Following example shows how to use the TSNE procedure to obtain an embedding from observations in a data table. The example uses the Iris data from Fisher (1936), which contain morphological measurements of 50 specimens from each of three different species of iris flowers: Iris setosa, I. versicolor, and I. virginica.

proc tsne
data = mycas.iris
nDimensions = 2
perplexity = 5
learningRate = 100
maxIters = 500;
input SepalLength SepalWidth PetalLength PetalWidth;
output out = mycas.tsne_out copyvars=(id species);
run;

The PROC TSNE statement and at least one INPUT statement are required. You can specify multiple INPUT statements. The NDIMENSIONS=2 option requests that the model return two embedding dimensions; the PERPLEXITY=5 option specifies the perplexity value; the LEARNINGRATE=100 option specifies the learning rate for the optimization; and the INPUT statement specifies that the SepalLength, SepalWidth, PetalLength, and PetalWidth variables be used as inputs.

PROC TSNE returns a two-dimensional representation of each observation. The OUTPUT statement requests that the embedding be written to the data table mycas.tsne_out, and the COPYVARS= option requests that the ID and Species variables be copied to the output.

The following PROC SGPLOT statements plot embedding dimension _DIM_2_ against embedding dimension _DIM_1_:

proc sgplot data=tsne_out;
title "Iris embedding";
title1 "Scatter plot of iris embedding";
scatter x=_DIM_1_ y=_DIM_2_ / group=species
markerattrs=(symbol=CircleFilled);
run;

The visualization below shows the resulting scatter plot.

The colors in the scatter plot indicate three distinct clusters, which correspond to the three iris species. The two axes here in a low dimensional space do not have a particular meaning.

There are two methods available for the algorithm to use for gradient calculation: METHOD=BARNES_HUT | EXACT. By default, METHOD=EXACT. For larger data sets, Barnes-Hut method is recommended over Exact method as it gives good visualisation accuracy when plotted.

The t-SNE method doesn't construct explicit mappings relating the high dimensional and low dimensional spaces. In fact, t-SNE captures structure in the sense that neighboring points in the high-dimensional input space will tend to be neighbors in the resulting low-dimensional space. Care should be exercised while interpreting the t-SNE plot:

Unlike PCA or MDS, larger distances can't necessarily be interpreted. If points are separated in input space, t-SNE separates them in the low dimensional space, however, it doesn't care how far they are.
At times, t-SNE breaks overlapping or continuous segments of data into contiguous pieces and artificially separates them, particularly at low perplexity settings.

Note that the t-SNE method is well suited for visualization of high-dimensional data, and less so for dimension reduction, feature engineering and preprocessing for subsequent clustering and modeling.

Related Links:

For more information on The TSNE Procedure see the Data Mining and Machine Learning Procedures documentation: (https://go.documentation.sas.com/doc/en/pgmsascdc/default/casml/casml_tsne_overview01.htm)
StatQuest: t-SNE, Clearly Explained (https://www.youtube.com/watch?v=NEaUSP4YerM) is a good resource to understand the concept behind the t-SNE method.

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library