BookmarkSubscribeRSS Feed

Scalable R Clustering and Visualization with SAS Viya Workbench by Anand Phand

Started ‎05-15-2025 by
Modified ‎07-28-2025 by
Views 1,349

Scalable R Clustering and Visualization with SAS Viya Workbench by Anand Phand

 

Summary 

In this blog, we dive into the seamless integration of R programming within the SAS Viya Workbench, highlighting how users can write and execute R code directly in notebooks. This powerful feature allows data scientists to leverage the flexibility of R while benefiting from the scalability and collaboration capabilities of the SAS environment. 

We illustrate this through a complete end-to-end use case using the popular penguins ‘dataset. The journey begins with exploratory data analysis (EDA) to understand key patterns and distributions. We then handle missing values using imputation techniques, followed by feature scaling to prepare the data for clustering. Using the K-means algorithm, we uncover natural groupings within the dataset. Finally, we bring the analysis to life with insightful visualizations using R libraries like ggplot2, and plotly. 

This use case highlights how R users can comfortably perform advanced analytics in the SAS Viya ecosystem, making it a versatile platform for modern data science workflows. 

 

Introduction to SAS Viya Workbench 

Hello R Users! With SAS Viya Workbench now you can choose your favorite IDE and start coding in R. Creating an R project is super easy with Viya Workbench as it takes minimal efforts to spin up session in the cloud in seconds. As per the project requirements, you can spin up your server by selecting number of cores, memory, and GPU support. In this blog, we will briefly go through the server setup and then perform a simple data analysis with Penguin dataset. We will perform clustering, an unsupervised technique, to create clusters based on features to identify 3 groups of penguin species and use a Plotly library to visualize results.  

 

The following link describes a step-by-step procedure to start a workbench instance SAS Tutorial | Getting Started with SAS Viya Workbench for Learners. Once the resource is created for you, you can click on the options button to start a workbench instance. 

Erin_SASCI_0-1747332497948.png

 

After the status changes from stopped to running in green, you can select your preferable IDE from the given drop-down menu. For this blog, we will select Jupyter Lab-Python and R to launch.  

Erin_SASCI_1-1747332497950.png

 

And that is it! Your IDE will open with default workspace folder as your work directory and from the Launcher, tab options will be given to create a R or python notebooks.  

Erin_SASCI_2-1747332497951.png

Once your notebook is created, you can rename the file and start writing your code in R.  

 

About Data: Penguin is a new Iris!  

For many years, Iris (published by Annals of Eugenics in 1936 under the title: The use of multiple measurements in taxonomic problems By Sir Ronald A. Fisher) was the go to data for students, researchers, and practitioners to understand and study the statistical machine learning algorithms. Originally, this data was used for discriminant analysis and classification problems. It later proved to be an ideal data source for understanding segmentation, decision trees, support vector machines, logistic regression, etc. 

Later, in the year 2020, another such dataset was published in the open-source programming world through R package “palmerpenguins”, collected in 2007-2009 by researcher Dr. Kristen Gorman. It soon gained its popularity across the community as another candidate for reliable data from a real-world study. In this blog, we will analyse penguins’ data with visualization tools, perform clustering and applying a dimension reduction technique using t-SNE algorithm that can be visualized with an interactive plot using plotly library in R. 

 

Working with R on SAS Viya Workbench 

We will install all required libraries and load it in the session.  

Erin_SASCI_3-1747332497952.png

 

 We will load penguins’ data from palmerpenguins library as df and check the top 5 rows for data understanding. The dataset contains data on 3 penguin species observed in Antarctica, with features such as flipper length, body mass, bill dimensions, and island of observation. 

Erin_SASCI_4-1747332497952.png

 

Exploratory Data Analysis 

We will perform basic analysis on the data to check presence of missing values, distribution of the numeric variables, and the frequency of the target variable, species. 

First, use the str() function to check basic structure of the R object.  

Erin_SASCI_5-1747332497952.png

 

Get a count of each penguin species and summary statistics of all variables.  

Erin_SASCI_6-1747332497953.png

 

Missing Value Imputation 

There are missing values present in the bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sex variables. Let’s write a utility function to handle missing values using mean imputation for numeric columns. Mean imputation is simple but can bias the dataset. More advanced methods like KNN or model-based imputation might be preferred depending on the analysis goals. For the sex variable we will simply remove the rows corresponding to the missing observations. 

Erin_SASCI_7-1747332497953.png

 

Erin_SASCI_8-1747332497954.png

 

After handling missing values, the summary statistics show that there are no more missing values present in the data, and the row count reduced from 344 down to 333.  

 

Feature Scaling and One-Hot Encoding 

Next, we can perform feature scaling on numeric variables and one-hot encoding of the categorical variables. 

 

Erin_SASCI_9-1747332497954.png

 

Only the most important categorical levels are encoded, assuming “Torgersen” island and “female” sex are the base levels. Visualization for Distribution Analysis 

Overlaying a histogram of a numeric variable by a category variable is a powerful visualization technique. Overlaying histograms helps you compare distributions of a numeric variable across distinct categories. This helps answer questions like: 

  • Do the groups have different means or spreads? 
  • Are the distributions skewed? 
  • Are there overlaps between groups? 

This is especially helpful for: 

  • Clustering: Do groups naturally separate based on the variable? 
  • Classification: Would this variable help a model distinguish categories? 

If the histograms are well-separated, that variable is a strong predictor. 

Erin_SASCI_10-1747332497954.png

 

Erin_SASCI_11-1747332497955.png

 

Here we have plotted distribution of Bill Depth (mm) variable which indicates significant difference in the distribution between Gentoo and other two species. Species ‘Adelie’ and ‘Chinstrap’ are overlapping, and there is no strong separation between the two. However, there could be other features present in the data that may have significantly different distribution between ‘Adelie’ and ‘Chinstrap’ species. You can try out plotting histograms with other feature variables and observe the differences.  

 

Heatmap for clustering on variable and observations 

A heatmap is a graphical representation of data where individual values are represented by colour. When applied to numeric variables, it provides deep statistical insights on their similarity with each other. 

Heatmaps can show: 

  • Clusters of variables with similar behavior (via hierarchical clustering) 
  • Block patterns that might suggest latent structures in the data 

In R, you can plot an intuitive heatmap of a numeric matrix that provide insights on grouping of observations as well as variables. This is useful for selecting features for clustering. 

Erin_SASCI_12-1747332497955.png

 

Erin_SASCI_13-1747332497955.png

 

 

Now we are ready to apply a clustering algorithm on scaled data with few selected features that we identified from exploratory data analysis. 

 

Apply K-Means Clustering 

Using kmeans() function, we will apply a clustering algorithm on our scaled data with selected features only. As we know already that we have 3 species of penguins in the dataset, we will try generating 3 clusters based on features and then check if these predicted clusters are mapped or aligned with the actual species.

Erin_SASCI_14-1747332497956.png

 

 

As we can see here, the cross tabulation shows that the predicted clusters are aligned with the known species labels, and the clustering algorithm has performed well with selected features.  

 

Visualization of clusters using T-SNE for dimensionality reduction 

A t-SNE plot (short for t-Distributed Stochastic Neighbor Embedding) is a powerful tool used in data visualization, particularly for high-dimensional data. Here though, because we have fewer number of variables, we can still try out this algorithm and see if we can visualize our clusters in a 3D plot. 

t-SNE is particularly good at preserving local structure — meaning: 

  • Points that are close together in high dimensions will remain close together in the 2D/3D plot. 
  • This reveals natural clusters, even if no clustering algorithm was applied. 

We will provide a full feature set for dimensionality reduction. T-SNE algorithm will reduce it to 3 variables and the final dataset, with transformed features, will be used for visualization. 

Erin_SASCI_15-1747332497956.png

Using plotly library, we will visualize the clusters in a 3-dimenssional plot 

Erin_SASCI_16-1747332497956.png

 

Erin_SASCI_17-1747332497956.png

 

 

Erin_SASCI_18-1747332497957.png

 

Concluding Remarks 

Exploring R programming within the SAS Viya Workbench opens powerful possibilities for data scientists and analysts. The ability to write, execute, and manage R code seamlessly within a scalable, secure environment bridges the flexibility of open-source R with the enterprise strength of SAS. Through our end-to-end analysis of the penguins dataset — from exploratory data analysis and data cleaning to clustering and visualization — we've seen how easy it is to build complete analytical workflows. As more organizations embrace multi-language data science platforms, SAS Viya Workbench stands out as a versatile and collaborative space to unlock the full potential of R. 

References: 

https://www.sas.com/en_us/software/viya/workbench.html 

https://allisonhorst.github.io/palmerpenguins/ 

https://medium.com/data-science/t-sne-clearly-explained-d84c537f53a 

https://medium.com/@hdpoorna/export-3d-plots-in-python-with-plotly-dfa0cbff671c 

https://uc-r.github.io/kmeans_clustering 

Contributors
Version history
Last update:
‎07-28-2025 10:49 AM
Updated by:

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

Latest on SAS Viya Workbench
Want more? Visit our blog for more articles like these.
Article Tags