yesterday
Sundaresh1
SAS Super FREQ
Member since
04-05-2019
- 36 Posts
- 202 Likes Given
- 2 Solutions
- 16 Likes Received
-
Latest posts by Sundaresh1
Subject Views Posted 29 yesterday 121 Wednesday 493 3 weeks ago 1622 09-03-2024 04:42 PM 1759 08-09-2024 02:58 PM 1417 07-31-2024 01:18 AM 1863 07-11-2024 05:50 PM 1531 05-06-2024 08:04 PM 1794 03-11-2024 09:44 AM 2135 02-21-2024 04:43 PM -
Activity Feed for Sundaresh1
- Posted How To Make Your LLM Work For You Slides and On-Demand Recording on Ask the Expert. yesterday
- Posted How the NLP - Extract Rule Configuration custom step saved my sorry... soul. on SAS Communities Library. Wednesday
- Liked Add, update and remove compute context attributes with new pyviyatools for DavidStern. 2 weeks ago
- Posted Custom Step Development: Make Your Environment Work For You on SAS Communities Library. 3 weeks ago
- Tagged Weekly Workbench Tip: Bonus function for python users on SAS Viya Workbench Discussion. 02-03-2025 11:50 AM
- Liked Weekly Workbench Tip: Bonus function for python users for ajaypanjwani. 02-03-2025 11:49 AM
- Liked Repeat After Me: Understanding Correlation Matrices in Repeated Measures Models for DannyModlin. 01-28-2025 10:22 PM
- Liked FREE! SAS Sample Data Sets for Forecasting for BethEbersole. 11-26-2024 10:46 PM
- Liked POI- Optimizing Retail with Product Demand Forecasting for arunsenthil. 11-13-2024 02:24 PM
- Liked Council Insights Heidelberg - Data Analytics on Council Work using LLM for benj_gaertner. 10-25-2024 04:03 PM
- Liked Re: SASOQ in SAS Viya Workbench for matt_becker. 10-17-2024 04:52 PM
- Got a Like for Weekly Workbench Tip: configure Autoexec.sas for your Workbench environment. 10-01-2024 10:51 AM
- Liked SASOQ in SAS Viya Workbench for JuanS_OCS. 09-28-2024 11:50 AM
- Liked Re: SASOQ in SAS Viya Workbench for matt_becker. 09-28-2024 11:50 AM
- Liked CAS: Making DVR the Default for StephenFoerster. 09-24-2024 06:50 PM
- Liked Deploying SAS Viya 2024.06 (or later) with an external Postgres Database- what you should know for RPoumarede. 09-24-2024 06:43 PM
- Liked Weekly Workbench Tip: Run a SAS Notebook, %INCLUDE style for ChrisHemedinger. 09-12-2024 02:53 PM
- Tagged Weekly Workbench Tip: configure Autoexec.sas for your Workbench environment on SAS Viya Workbench Discussion. 09-03-2024 04:44 PM
- Posted Weekly Workbench Tip: configure Autoexec.sas for your Workbench environment on SAS Viya Workbench Discussion. 09-03-2024 04:42 PM
- Liked Computer Vision Based Quality Inspection in Injectable Ampoules/Vials for tushar_sonawane. 08-30-2024 09:56 AM
-
Posts I Liked
Subject Likes Author Latest Post 3 3 4 3 83 -
My Liked Posts
Subject Likes Posted 1 09-03-2024 04:42 PM 3 08-09-2024 02:58 PM 2 07-31-2024 01:18 AM 2 07-11-2024 05:50 PM 1 02-17-2023 08:31 AM -
My Library Contributions
Subject Likes Author Latest Post 0 0 0 6 2
yesterday
Watch this Ask the Expert session to explore core techniques for effectively building with large language models (LLMs).
Watch the Webinar
You will get:
An overview of LLMs and GenAI development.
Encouraged to develop LLM-based solutions.
LLM and GenAI use cases in the life sciences domain.
Q&A
To be added
Recommended Resources
Example of Retrieval Augmented Generation through Azure OpenAI
Example of maintaining prompts through a Prompt Catalog
Examples of Interacting with vector databases and search
Safety Signals from Patient Narratives through Generative AI
Please see additional resources in the attached slide deck.
Want more tips? Be sure to subscribe to the Ask the Expert board to receive follow up Q&A, slides and recordings from other SAS Ask the Expert webinars.
... View more
Labels:
Wednesday
Okay, let me abuse the term 'stream of consciousness' and provide a quick outline of a situation faced (and rectified within 20 mins) yesterday.
Among other things, I dabble in Natural Language Processing (NLP). I also dabble in another seemingly unrelated area of creating low-code reusable components, which is basically a posh way of referring to wrappers on top of SAS programs. We call them SAS Studio Custom Steps. Both capabilities are offered on SAS Viya, a unified analytics platform which is highly addictive for people like me who like to ... well, dabble.
Which is why, yesterday, upon facing one of those weird errors that crop up in Visual Text Analytics (the SAS Viya product offering NLP), I panicked for only 15 minutes. Me knowing me, that's a record.
The error occurred because I decided to do something which I have decided I won't ever do in future: change the structure of an input table for a Visual Text Analytics project. Although VTA does offer an option to replace a data source (which is useful in many cases such as refreshing input data for some crazy and complicated projects), there is a link maintained between columns in the input table and metadata in a text analytics project. A change to a data source of the same structure (same number of columns with the same names) should be pretty seamless around 99% of the time (not 100%, because there is a god or something like that), but a change to the structure of input data (such as a new column added or an important text column getting deleted) increases the chances of error (because, well, there does seem to be a god).
I don't know the root cause of the error yet, but probably will be able to rectify it soon. The error is not the point of this article. The impact due to the error is my main area of focus. I had been working on something pretty interesting : a fairly comprehensive project with information extraction and categorisation rules for a complex taxonomy on technical paper abstracts for PharmaSUG 2025. Not heard of PharmaSUG? You should attend, if you are interested in applications of SAS and allied open source technologies to improve processes and outcomes in Life Sciences. Read here to learn more.
The point is, I had a project comprising 30 different information extraction rules and 20 different categorisation rules, which no human can remember. I needed to take remedial measures. But, first, I needed to scream.
Figure 1: The VTA error encountered. One day, I will find out what happened.....
Minutes 16 to 20
As mentioned, the first 15 minutes comprised of various childish activities. Then, I remembered an indiscretion of my somewhat recent (1-1.5 years) past. Then, I had contributed a SAS Studio Custom Step, the "NLP - Extract Rule Configuration" step to the SAS Studio Custom Steps GitHub repository. It had seemed fun at that time, and was motivated primarily by the following factors:
1. Transparently surface rule logic for aiding understanding by stakeholders
2. Identify changes to a set of rules
3. Help satisfy governance requirements.
I had even written an article about the same previously, available here. Now, the time had come to add an additional requirement satisfied by this step.
- A recovery mechanism in case of stuff happening
You may understand why this tends not to be the primary message behind positioning the custom step, as it hints at the possibility of an error, which nobody likes to talk about. But, the reality is, stuff does happen, more often than you think, and no system is immune or foolproof (even if one exists, well, stuff just hasn't happened yet). It's beneficial to build in mechanisms which help you to be resilient.
Back to action. The Extract Rule Configuration step helps extract rule configurations from an existing Visual Text Analytics project. This may refer to either a Concepts (information extraction) model or a Categories model. The step requires a reference to the VTA project in order to get started. More specifically, the project is tied to an Analytics caslib (a folder location for CAS tables) which contains all back-end tables and metadata created by the project. The easiest possible way of obtaining this is to refer to the front page of the VTA project.
Figure 2: Identifying the Project caslib location for VTA projects in Model Studio
After copying this somewhere, I then opened SAS Studio and created a new SAS Studio Flow. Since the input tables in the project caslib happen to be CAS tables, I first established a connection to SAS Cloud Analytics Services (CAS) as follows:
/* provide a name for your CAS connection and connect */;
cas ss;
/* Optional - to ensure caslibs are assigned to a SAS libref */;
caslib assign _all_ ;
Then, I dragged a copy of the NLP - Extract Rule Configuration on to the canvas. Refer here for instructions on how to make a custom step available in your SAS Viya environment.
In my particular case, I had two nodes to extract the rules for. To identify them, I need to obtain a list of rule configuration tables within the caslib, which happens to be the first option available in the step (Generate a list of rule configurations). Select the first option and then provide the name of the analytics caslib in the space provided. Finally, attach an output port (right click -> attach output port) to the step and provide the name of a SAS dataset (which can be located in WORK; this is only a temporary dataset to hold names of the config tables).
Figure 3: Generate a list of rule configurations
Now that we have a list of config tables ready, let's go ahead and extract them. For this purpose, drag a copy of the same step again to the canvas and this time, select the option named "Extract all rule configurations as per an input list" . Connect the ruleconfig list (the WORK dataset created during the first run of the step) to this step. In this case, all you have to do is to provide the name of a libref pointing to a CAS engine (I prefer PUBLIC since it's a shared caslib and easy to access), and this happens to be the place where the configuration tables get output. The names of the tables that get saved are located in the config list (WORK.RULECONFIG) created earlier.
Figure 4: The list of rule configuration tables generated. Note that there's a Concept and a Category table
Figure 5: Extract Rule Configuration tables to a Caslib
Load these tables to memory and open them using SAS Visual Analytics, the simplest application through which you can take a quick look at the rules. From this stage onwards, I found it extremely easy to export the rules over to an excel table. I then created a new Visual Text Analytics project (my recovery project, added a Concepts node followed by a Categories node, and used my Excel sheet to quickly copy paste all the rule names and rules into the new project. In a very quick period of time, I was up back and running, and had shelved my previous error for later analysis. My project's purring along perfectly now, thank you very much.
Figure 6: An example of the rule configuration table in Visual Analytics, which I then exported to Excel to easily copy over to a new project. Time : ~ 5 mins (end-to-end)
In Summary, Did I Learn Anything?
Yes, actually, and not just on the personal discovery side (such as , I never knew I knew such words). For one, this was one more reinforcement of how a change mid-project contains risks. In my case, they pertained to how I decided to change my input table structure. The other lesson is to not trust in software capabilities alone. Even the best designed software can be stumped with new edge cases, and there's a reason why testing continues to be relevant in any software process, and my support to all software developers and testers out there (lest they put a hex on me or something).
The biggest lesson I learnt, as mentioned earlier, was on the importance of being resilient in the face of a setback. I'm not going to go all 'Taleb' on you and venture into philosophy (though this is a pretty darn good book, if only I can get past page 101). What I appreciate most, from this experience, is the value of going beyond the UI. Automated GUIs with their bells and whistles are nice, but it always helps to understand how things work under the hood. That knowledge helped me design the custom step a while back, and I found that it came to the rescue this time.
A final point is that there are always other options for recovery, and this specific case is no exception. As a classic example of being wiser after the fact, I acknowledge that saving my rules periodically to the Exchange (favourite number one in this , by the way), which acts as a repository of templates, would have significantly reduced rework. At the same time, maybe I wouldn't want to use the Exchange for one-off projects with little scope for reuse. Another option would have been to just refrain from making a change to the main project's table, and to have rather created a new project meant to use the new table structure. You live and learn.
Do you have any recovery 'miracles', or perhaps sob stories to share? Or even further questions about the NLP-related Custom Steps? Drop me an email and we can chat more.
... View more
- Find more articles tagged with:
- SAS Studio Custom Steps
3 weeks ago
SAS Studio Custom Steps provide a low-code user interface to SAS programs, promoting code reusability and accelerating analytics development. As a contributor to the SAS Studio Custom Steps GitHub repository (and a maintainer of other similar custom step repos), which continues to grow, I have found myself increasingly preoccupied with the following thoughts:
Multiple interfaces benefit from same code: From a SAS perspective, this appeals to me because many programmers execute SAS programs without custom steps, some prefer alternatives to SAS Studio such as Visual Studio Code (for which a SAS extension for Visual Studio Code is available). Also, during the past year, SAS launched a developer-centric offering, SAS Viya Workbench (refer some demos here) which does not require access to an enterprise SAS Viya environment. Leaving SAS-specific considerations aside, I still find the idea appealing due to potential ease of maintenance.
Robust code requires rigorous testing: Most custom steps require numerous input parameters, some combinations of which might result in error. A simple example is when input data requires different execution paths based on whether it is a SAS dataset or a CAS table. Human nature leads custom step designers to confine themselves to ‘happy paths’ and limited edge cases (also remember, most custom steps are created by analytics practitioners looking for quick solutions). As the number of test scenarios increase, it becomes labour-intensive to design and execute the same. In effect, I wait for my ‘customers’ (users) to helpfully raise bugs in future.
These thoughts drove me to devise an approach which eases custom step code testing, which I’d like to share through this article. Interestingly, this approach does not have much to do with the custom step builder in SAS Studio, but is more concerned with the broader programming environment under which code runs. This is because I prefer testing the code standalone first (in Visual Studio Code) and then wiring it to a custom step UI (involving further tests out of scope for this article). This also implies that, if useful, you can extend this approach to testing any SAS program, not just those which power custom steps. (A quick side-note: SAS code also refers to Python script called during the execution of SAS programs.)
About the Environment
The environment is what the environment facilitates. Okay, I allow my practical side latitude over the pedantic in this description. I focus upon just two aspects of the programming environment that help me in my objective, viz.
Autoexecs
The autoexec.sas file is a SAS program that runs every time you start a SAS session, and enables you to specify environment variables (and/or macro variables) relating to different test scenarios that can be picked up by the custom step’s SAS program ( ‘test’ program).
Let us consider this custom step that, incidentally, is under development at the time of writing this article (and will likely be published shortly). Without getting into details, this step helps users interact with an LLM (Large Language Model) passing data contained in a SAS dataset (or a CAS table). As you may notice from the list of variables used in this step (available here), there are quite a number of input parameters asked for. Multiple options provide additional opportunity for error. For example, the user is free to specify either SAS and CAS engines in both input and output tables, in itself necessitating four different tests (and that’s only testing one aspect of the code!)
The SAS program which powers the custom step is available for testing here. Now, let us consider the programming interface. As mentioned, I prefer the SAS extension for Visual Studio Code but, in deference to those who like SAS Studio, here’s how you access the autoexec in a SAS Studio session:
Go to SAS Studio (Develop Code and Flows)
Select Options -> Autoexec file
As depicted below. Interestingly, this was an opportunity for me to sign in as a non-administrator after a long time, just to ensure that steps are relevant for non-admin users (who have lesser privileges than administrators). Thankfully, no surprises.
What if you were in Visual Studio Code? Here, you don’t have the benefit of being inside the 'mother ship'. Rather, imagine yourself as a small rocket trying to communicate with a larger space station (or whatever it’s called, conquering outer space is best left to some others). Here, not only do you have to submit your test program, but also submit code that you want executed prior to the program. Follow instructions provided here.
Autoexec Contents
The next question could be: but, what do I enter in the autoexec code? Specify values for your required test input parameters either as global macro variables (recognised within SAS programs) or as environment variables, through the options set command. While I have used 'options set' throughout for reasons of uniformity (these are environment variables set in the underlying operating system), your mileage may differ and you might need to consider this decision carefully. Be cognizant of the differences between environment variables and macro variables, and their respective advantages and limitations.
Using the earlier example, my list of parameters works out to something like the below. Of course, I’m being careful and don’t expose all details (i.e. sensitive details). I’m sure you’ll let me know if so (seriously, please do let me know through comments; I need to address and improve).
Remember, when you use “options set=..”, you are basically dealing with environment variables. Some environment variables can be introduced even at the level of the node which hosts the compute session pod. Therefore, be aware of the variables and options you set and the scope of the same, especially on a shared system.
Compute Contexts
Let’s shift gears a bit and consider compute contexts. Note that compute contexts are controlled by a SAS Administrator, always a reminder that it helps to be in the good books of those people (by the way, elevated privileges on a Viya environment gives you a SAS Administrator role). Compute contexts are documented here. We don’t need to get into the details, but, suffice it to say that compute contexts provide a medium to specify information that’s used when running a workload (program) in a SAS Compute server session. This information includes autoexec settings, which is where contexts prove useful.
For a majority of cases, it may be enough to use autoexec.sas directly to change test settings. However, as experienced frequently, a change made to a program at later stages of development may harm tests that passed earlier, a classic regression testing indicator. In such a case, instead of manually changing autoexec contents (or swapping out autoexec.sas files), you might find it easier to just switch contexts, each of which gets wired to a set of test parameters. Therefore, as you engage in development of a custom step, you might find it useful to work with your administrator beforehand and set up contexts for testing as shown below.
For Visual Studio Code, there’s an added bonus:not only can you use multiple contexts (that your administrator helped create), you can also define multiple profiles that are associated with autoexec.sas files containing different parameters. This also implies you can reuse a standard context such as SAS Job Execution Compute Context, and specify different autoexec.sas files per VS Code profile, reducing dependency on the admin for multiple contexts.
To define multiple profiles, press Ctrl (or Cmd, in case of Mac) + Shift + P in your Visual Studio Code window, type SAS: Add New Connection Profile, and provide required parameters. As shown below. Note however that autoexecs execute upon the start of a SAS session, not by just merely switching between profiles on Visual Studio Code.
Wiring your Program for Tests
Given a SAS program under development, how do you ensure it receives updated test parameters without repeated changes? For this purpose, I make it a point to add a commented “debug section” (call it a test section if you like) in my code. The debug section, when uncommented, simply takes its values from system options or global macro variables defined upstream.
The variables defined in the section below are from the earlier example, and obviously change based on your program’s objective.
/*-----------------------------------------------------------------------------------------*
DEBUG Section
Code under the debug section SHOULD ALWAYS remain commented unless you are tinkering with
or testing the step!
*------------------------------------------------------------------------------------------*/
/* Provide test values for parameters */
data _null_;
call symput('inputData',"%sysget(inputData)");
call symput('systemPrompt', "%sysget(systemPrompt)");
call symput('userPrompt', "%sysget(userPrompt)");
call symput('userExample', "%sysget(userExample)");
call symput('docId', "%sysget(docId)");
call symput('textCol', "%sysget(textCol)");
call symput('azureKeyLocation', "%sysget(azureKeyLocation)");
call symput('azureOpenAIEndpoint', "%sysget(azureOpenAIEndpoint)");
call symput('azureRegion', "%sysget(azureRegion)");
call symput('openAIVersion', "%sysget(openAIVersion)");
call symput('outputTable', "%sysget(outputTable)");
call symput('genModelDeployment', "%sysget(genModelDeployment)");
call symputx('temperature', %sysget(temperature));
run;
What happens during a production run, i.e. once the Custom Step has been published? The debug section is commented and does not run. Only upon an error which demands trial runs and iterations, does the step consumer need to uncomment this section and try changing parameters to see what happens. This way, a test tool (during development) becomes a debug tool (during production).
The other advantage is that the user needn’t confine themselves to autoexecs and variables defined during contexts, but are free to modify the input parameters ad-hoc, as long as it helps them in resolving their error.
Custom step creators can also take advantage of this mechanism to include some tests or examples as a way to demonstrate the SAS program behind a published step.
In summary
Autoexecs, compute contexts, and features in third-party editors such as Visual Studio Code enable Custom Step creators to test SAS programs more rigorously prior to hooking them to UI (or even run standalone). Easy testing experience also drives readiness to design and execute more test scenarios, thus ensuring rigour in the process.
This improves quality of output and first-pass yield, reducing future bug possibilities. Also, creators can decouple code generation from UI and work on each component in a modular and focussed manner. Such focus can yield a SAS program that is suitable or easily modifiable for other targets (SAS Viya Workbench, Visual Studio Code extension and SAS Job Execution, among others).
As a practice, I follow a folder structure which helps me implement this framework effectively. I save my SAS program and my UI components separately and only combine them when it’s time to build the final Custom Step. In GitHub repos, as this recent example shows, the program resides in an /extras folder, explained here.
Progress on this front has also motivated other tools to help semi-automate creation of the UI, and even build the step, details of which can be shared in a subsequent post.
Feel free to experiment with this approach. It works for me, but perhaps you might like to do something slightly different, even radically different. In any case, please share your views through comments, or you can email me here: Sundaresh.sankaran@sas.com.
... View more
- Find more articles tagged with:
- SAS Studio Custom Steps
09-03-2024
04:42 PM
1 Like
The concept of an autoexec file (specifically, an autoexec.sas file) has been around in SAS for many years. SAS programs rely on a host of contextual information in the form of data locations (paths, file references or libraries), system options, predefined macro variables and many others. Users find it beneficial to use autoexec.sas to execute pre-processing (you may consider them pre-'program') code because it saves time and simplifies maintenance of execution code, allowing it to run on multiple environments with different settings.
SAS Viya Workbench provides you multiple environments to test, iterate and develop solutions at scale. The ephemeral nature of Workbench instances, however, requires that you take certain steps to ensure that your autoexec specifications are recognised across those instances. Here's a quick overview:
Remember to set up a home directory!
Many prior unsuccessful attempts at using autoexec.sas stem from omitting this first step. As mentioned earlier, every Viya Workbench instance is spun up using containers and unless persisted, data on these containers is specific to the instance alone and go away with its deletion. One therefore requires a persistent storage to house the autoexec.sas file, and this role is performed by the home directory. Watch this quick video to learn how to set up home storage location.
As you might observe, this is fairly easy. Note that users can create only one home storage location, which is automatically treated as the user's home folder only for subsequent workbench instances (not already running ones).
Create an autoexec.sas file
Keep an example autoexec.sas file ready. You might like to keep things simple for the first test. I typically test the same by defining a new libname (pointing to a location on my persistent storage) and then checking if this libname is automatically recognised in a future Workbench instance. Note this means you have to start a Workbench instance (without an autoexec.sas defined, yet), with storage attached, create a file, and then take steps to apply the same. As shown in the following video:
Note: In the video, for convenience, we saved the test_autoexec.sas in our persistent storage area. Ideally, however, you'd like to save this in your home directory for reasons which will soon be apparent.
Associate your autoexec.sas with the SAS session
To do this, we take advantage of the settings within the Visual Studio Code Extension For SAS, available here. A bonus is that you can edit this settings file for any use of the Visual Studio Code extension for SAS (not just Workbench). The following video shows you how.
It's easy to slip up here, so let me point out that you need to click on the tab named "Remote[xxxx.my,workbench.sas.com]" to get to the correct settings.json page. Otherwise you only edit the Workspace settings, which are local to the session.
Further, why bother adding a new autoexec.sas file, when there seem to be some default autoexec.sas files already existing? While you are free to edit those files directly, it might be convenient to add a new reference to your autoexec.sas file so that you can easily manage the same and maintain a separation between default autoexec settings and some settings which may be short-term or more dynamic in nature.
Another question could crop up regarding the need for a home directory. In addition to providing some persistent storage, the home directory also happens to be unique to a user's profile, i.e. there can be only one home directory available. On the other hand, there could be multiple other persistent storage locations defined, but the user's allowed to mount only one of them at a time.
Test your changes
Once you have made your edits, it's time to test if your changes work. As you might observe in the following video, upon creating a new Workbench instance and directly calling a program using a libname MYTEST, I find that MYTEST has been automatically assigned, proving that my autoexec.sas file works!
Have fun testing your autoexec.sas files. Remember, they take effect only for subsequent Workbench instances that you create.
... View more
08-09-2024
02:58 PM
3 Likes
In an earlier tip, we shared a GitHub repository with examples to help get started with SAS Viya Workbench. Repositories from GitHub (and similar services such as GitLab or BitBucket) are not just for examples. They are also used for source code management, collaboration and governance.
To optimise your experience working with ephemeral Workbench environments, Git repositories help you with a centralised residence for your SAS or Python (and even production or development) code.
Option 1: Using the Extension
Let us look at two ways through which you access your code from Git repositories. The first makes use of the built-in Visual Studio Code Git extension. Watch the following video which shows how to use the Git extension to pull from a repository and start work in no time.
Some gotchas to look out for:
1. First, close the workspace folder in order to get the "Clone a repository" button come up on screen.
2. The sign-on to GitHub (or other provider) is shown as an example. It's an optional step, but highly likely to be encountered, especially if you're planning to access a restricted / private repository. Note that Visual Studio Code server , which operates inside a container, may not be that readily 'aware' of the connections it needs, compared to something like Visual Studio Code running on your server, therefore may require authentication when interacting with external repositories.
3. Remember to choose your workspace folder (also identified by the $WORKSPACE environment variable) as the place where a local copy of your repository should reside. Since Workbench instances are ephemeral, only contents saved in the workspace folder are persisted and available later, should you decide to access the code through another Workbench instance.
Before you commit your changes..
Assuming you have made changes to your code, it's time to commit your changes. The extension comes in useful, but, as you might notice at the end of the following video, you might run into a problem when trying to commit your changes.
Git config - one more gotcha
Wait - you notice an error stating that Git has not been configured! Any guesses as to the reason why, 🙂 ? Recall that Workbench instances are ephemeral, and therefore settings necessary for interacting with a GitHub repository at a 'write' level (for example, committing changes and pushing them to a repo) require git to be properly configured. Note that I spelt git in lowercase, indicating that this refers to the 'git' command-line utility and not the Git repository.
Git configuration can be easily carried out via the VS Code terminal, using a git config command. At a minimum, the user.name and user.email fields need to be set. Additional configuration might be required for accessing more restricted repositories.
# Text within braces are placeholders
git config --global user.name {your Git repo's username}
git config --global user.email {email used to access Git repo}
Once you've done this, it's time to push your changes, as shown in the following video. For other useful operations, you might like to refer to tutorials such as this.
Option 2: Using git commands
Users who happen to be comfortable with the git command-line utility can also use the same in order to clone, commit, and carry out other operations on a Git repository. The same gotchas mentioned earlier apply. Ensure you have carried out the git config statements and that you persist your local repo in the workspace folder. If interested, you can refer this tutorial for more git commands.
Have fun porting your code across Workbench instances of your choice.
... View more
07-31-2024
01:18 AM
2 Likes
Advantages offered by SAS Viya Workbench include fast availability of a compute environment, and support for multilingual analytics (Python or SAS, based on skills and preference).
When it comes to Python environments, however, users demand flexible and convenient package installation to aid rapid iteration and experimentation. The default method is to use the standard Python package installer (pip) to install packages. However, pip installs take a long time and package installation becomes complex when there are many packages and dependencies involved.
This tip suggests the use of virtual Python environments to help isolate and manage your Python package installation.
Using this opportunity, for informational purposes only, we also describe another Python installer utility called uv. The uv package is written in Rust and claims to be 10x - 100x faster than traditional pip and pip-tools, through use of a global cache and better dependency resolution. It follows a syntax very similar to pip, making it easy to adopt and use. Check out its PyPi page for more details.
IMPORTANT: Use uv based on your own discretion. The SAS Viya Workbench documentation specifies pip install as the standard approach to install packages. This tip regarding uv is only meant as a suggestion and I recommend you first try it out on a local Python environment for experience.
Virtual Environment creation
Virtual environments in Python allow you to create separate sub-environments within an existing Python environment. Any additional packages you install in the virtual environment are not accessible by the base (main) Python environment. This provides you, the Python user, benefits such as isolation, control over package dependencies, and rapid experimentation through different virtual environments. As environment definitions can be persisted, you also have an option of discarding your work in case of issues, and reverting to the base Python environment.
Creating a virtual environment in Workbench is easy and can be achieved through the following commands (for e.g., through a Terminal window on Workbench's Visual Studio Code application):
# Create a virtual env - provide your desired name in place of new_env
python -m venv new_env
# Activate virtual env
. new_env/bin/activate
The 'uv' angle: It's also possible to create virtual environments using uv. Install uv, using pip, into your target environment just like any other Python package. Once that's done, you can create a virtual environment and activate the same.
# Install uv and upgrade pip
pip install --upgrade pip uv
# Create virtual env using uv
uv venv new_uv_venv
# Activate virtual env
. new_uv_venv/bin/activate
Note how the commands for creating a virtual environment using uv are similar to traditional Python commands, i.e. python -m venv <virtual_environment_name>.
Deactivation: In order to revert to the base Python environment, use the deactivate command to leave the virtual environment. This applies for both virtual environments created through venv or uv.
# Deactivate virtual env
deactivate
echo "deactivated system"
# OPTIONAL - Remove virtual env folder
rm -rf new_env
Package installation
Once you've activated a virtual environment, install packages using the pip install command, either through a list of required packages, or a text file, conventionally named requirements.txt, containing a list of packages. Pip is a powerful utility which also provides other operations. The pip documentation is a useful reference.
# Install packages
pip install --upgrade -r requirements.txt
The 'uv' angle: The uv command is as easy as the traditional pip command. Just prepend the 'uv' command to a standard pip installation command, either specifying the packages as a list, or using a requirements.txt file. As shown in the following snippet.
# Install packages from requirements.txt
uv pip install --upgrade -r requirements.txt
Let's take a look at the effect. Taking an example where a minimal set of packages (for e.g., torch for machine learning, plus some additional helper packages) were specified in requirements.txt, the following animated gif shows how quickly uv downloads the same (faster than pip).
A quick impression about uv's speed improvements
To form a quick impression, we specified an example requirements.txt file consisting of the following common machine learning and analytics packages:
torch
swat
tensorflow
matplotlib
pandas
flask
Note that this is NOT meant as an exact test, rather a rough assessment which was sufficient to form an initial impression. We found that using uv took, on an average, 30% of the time compared to using traditional pip commands.
Note that the time to install can be affected by package composition and several other infrastructure-specific / local details. uv provides its own claims and test details on its Python project page. An assessment of uv is beyond the scope of this tip.
If you are interested in a detailed script containing the commands provided above, feel free to access them from this GitHub repository.
Have fun trying out package installation and virtual environment methods on SAS Viya Workbench and share your experiences.
... View more
07-11-2024
05:50 PM
2 Likes
SAS Viya Workbench is now available for you to accelerate your code development! Here's a link to the product page on the AWS Marketplace: https://aws.amazon.com/marketplace/pp/prodview-oyybm2xk34dos?applicationId=AWSMPContessa&ref_=beagle&sr=0-1
Get Started!
As with any new software, you might feel a bit shy to try things out. Guess what? We've put together a set of tips that'll make you shed your inhibitions in a flash! Take a look at SAS Viya Workbench Examples, our GitHub repo containing a number of
- Example datasets,
- Python programs, and
- SAS programs
which provide you an experience of how to rapidly develop and prototype analytics using SAS Viya Workbench.
Stay tuned for more tips in future!
... View more
05-06-2024
08:04 PM
6 Likes
Document embeddings (or vectors, as the fashionable like to say) have emerged as a popular area due to the focus on Generative AI. Visual Text Analytics, a SAS Viya offering providing Natural Language Processing (NLP) capabilities, provides an option to train embeddings through the Topics node, backed by the Singular Vector Decomposition algorithm. I encourage you to refer here for a detailed discussion of topics.
The purpose of this article is to highlight a sometimes overlooked task when applying document embeddings for purposes of similarity-based search. Normalisation of vectors helps obtain relevant matches.
Why is this important?
First, let's consider vector embeddings. Simply put, these are numerical representation of text contained within a document. Represented as a series of columns in a table, each column refers to some feature (also known as a dimension) of the source document, and together, these columns represent the document as a whole.
Why do we need to transform a document into embeddings in the first place? Text content can be represented in multiple styles and forms, making it hard to organise, classify and analyse. Motivations for embedding documents include the following:
data standardisation - similar terms are packed as close numbers within dimensions rather than get treated as distinct units
feature engineering - data is organised under different dimensions each of which may carry different meaning
transformation for downstream applications such as analytics and machine learning, for which numbers are more amenable
masking - data is no longer represented as readable text, but as numerical proxies
Now, let's consider the definition of a vector. In mathematics, a vector's a quantity that contains magnitude (length) and direction. Therefore, it isn't just one number (which would make it a scalar) but a set of numbers which represent the number of dimensions.
This is an extremely useful property, since it allows for operations which measure how similar two documents are based on the distance between their vectors. Let's take a simple case involving a two-dimensional vector.
Yes, I know. Poor William's turning over somewhere in Stratford-upon-Avon, but that's the price you pay for fame.
The image above shows vectors for two documents depicted in two-dimensional space. Given their coordinate points, vectors enable calculation of distance between the embedding, a simple and common implementation of which is Euclidean distance. This works out to 1.414 (the approximate square root of 2). As the graph also shows, the vector distance can be viewed as the deviation in direction between the two vectors. A low value indicates that the two documents are more or less similar, which seems to be the case here, albeit to the horror of purists.
However, the utility of the above measure is limited! The reason is that this distance is highly vulnerable to scaling differences which may have been introduced during the embedding training process. Note that embeddings could originate from different sources and we cannot take their representation as standard. This also affects the extent to which we interpret any distance measure that's derived. Is 1.414 small (indicating similar) or large (divergent)? I'll never know until I use a standard. This is achieved through a process known as normalisation.
So, what should I do?
The principle behind vector normalisation is intuitive. Let's consider the same example again.
Let's introduce the unit vector. The unit vector refers to the vector values within the small green box bounded by (1,1). A unit vector is defined as a vector with a magnitude of 1. A magnitude, simply expressed, refers to the length of a vector. Recalling Pythagoras who used to haunt our geometry books, this can be calculated using the formula to calculate the hypothenuse of. a right angled triangle, namely,
Square root ( Sum of squares of dimensions)
Another name for the magnitude is norm, hence the term normalising the vector. To arrive at a normalised value, you simply divide the individual vector values by the magnitude. The resultant vector is a unit vector, which acts as a standard for carrying out similarity search and other vector-based operations.
In our simple example, the unit vectors work out to:
Document
Dimension 1
Dimension 2
Text 1
3 / square root(90)
9 / square root(90)
Text 2
4 / square root(80)
8 / square root(80)
Do it for me, please ?
Gladly. Please refer here for a SAS program which takes in an input table (or dataset) with vectors, and normalises the columns to a magnitude of 1.
The business end of this program can be found between lines 197 to 329. Notice that this program can run on both CAS and SAS (i.e. SAS 9 / SAS Compute or SAS Programming Runtime Environment) engines and uses array logic to normalise the vectors. Also to be noted is the use of the dictionary.columns table which helps us identify all "vector" columns in the input table which conform to a given name pattern. Highly convenient when dealing with typical vector data which does tend to run in the 100s of columns. Imagine writing an array for each one of those!
Give the code a whirl and let me know your feedback. You might also notice that the code has a lot of other programs wrapped around the same, a strong hint of my intention to also make it available as a SAS Studio Custom Step. Soon.
I want to meet you, shake your hand, and shower praise upon you.
Cash would be better. Actually, thank you, but no sweat. I'm happy to answer further questions, though. You can email me by clicking here. Glad you enjoyed it.
... View more
- Find more articles tagged with:
- generative AI
- SAS programming
- vectors
03-11-2024
09:44 AM
Hi @touwen_k ,
Thank you for trying it out.
I copied your code and got the following output. At first, PROC_PYPATH didn't resolve since it wasn't declared as global. I mentioned this as global in the docstring but didn't explicitly add a %global PROC_PYPATH statement, thank you for pointing it out.
Regardless of above, the values of the two macro variables do show up in the log. A suggestion would be to show them as notes, as below, in order to make them more prominent.
%put NOTE: The value of error_flag is &error_flag ;
%put NOTE: &error_desc ;
If you do not observe anything in the log, I'd suggest 'breaking the macro' and running the statements in smaller blocks. Starting with the top level if block and then getting into the nested blocks. Let me know in case of any help. Sundaresh.sankaran@sas.com
Log when PROC_PYPATH was not declared as global:
Log when PROC_PYPATH is declared as global:
... View more
02-21-2024
04:43 PM
1 Like
I gave myself what I considered a simple task - to design a macro that checks whether Python is available to a given compute or batch session. As things turned out, I learnt a bit about SAS system options and environment variables used inside SAS sessions.
In a cloud-native world, most environments can be considered ephemeral. Workloads can be intended for more than one predesignated compute environment. Those environments may have different configuration.
Preflight checks, run through SAS macros prior to the main execution code, are extremely useful in assessing an environment for necessary characteristics.
Also, I found this task useful because it allowed the program to fail gracefully if it couldn't find Python.
Graceful failure, if accompanied with right level of log messages, can help the developer quickly take remedial steps. It also helps avoid 'cleanup' or rollback situations where part of a program has already run and datasets may have been modified or created.
Access the Macro
The macro can be accessed from a GitHub repository I maintain for various utility SAS programs. I also use them in underlying code of many SAS Studio Custom Steps (low-code SAS Studio components which promote ease of use, code reusability and automation). Some of my custom steps happen to have proc python blocks, and in future, you may expect to see me including this macro in those steps.
Link to the Python check macro program: click here
Link to the GitHub repository (of utility programs): click here
Link to an example test code for the above macro: click here
How does it work?
We'll not venture too deeply into the inner workings of the macro's code here (which, at the end of the day, is pretty simple), but highlight key decision variables which help in determining access to Python. These can be understood through the following questions:
Does the SAS session know where Python is located?
This is informed by an environment variable called PROC_PYPATH. As Scott McCauley describes the process in his article on configuring Python in SAS Viya, PROC_PYPATH is set when configuring SAS Viya to access open-source languages, and provides a path to a Python executable invoked whenever PROC PYTHON is run.
Even if specified, does the Python executable really exist?
Environments can break and it's possible Python might have never been installed, installed incorrectly or installed somewhere else. The macro carries out a check of the contents of the PROC_PYPATH variable to check if the python executable file (e.g. python3) mentioned actually exists and is known to the session. Note that in batch sessions, sometimes, a path to Python may have not been mounted as a volume, an error situation which could be identified in this case.
Is LOCKDOWN enabled or not?
LOCKDOWN is a security-centric status in SAS servers which disables certain operations and access methods to protect the system. Settings that allow access to external environments and languages like Python are disabled by default and have to be explicitly enabled. Certain environmental variables control whether LOCKDOWN is enabled or not. These include COMPUTESERVER_LOCKDOWN_ENABLE & BATCHSERVER_LOCKDOWN_ENABLE, applied to compute and batch server sessions, respectively. Note that when they are set to 0, LOCKDOWN is disabled for that SAS server! This is not a desirable situation (even though it means that Python can run in that session) because it carries potential for compromise from a security perspective.
Are methods required to run Python enabled?
A final check is to ensure the following three access methods - PYTHON, PYTHON_EMBED and SOCKET - have been enabled. This means they would form part of the values in an environment variable called VIYA_LOCKDOWN_USER_METHODS. Although PYTHON_EMBED is specific to one way of running Python (using a submit block in PROC PYTHON), we include it as part of the check all the same. You can edit PYTHON_EMBED out if you don't want to perform this check.
When all these checks pass, a set of macro variables, passed as arguments to the macro, are populated with values that indicate there is no error, and also a description that states that a path to Python is available in this compute session and that Python has been enabled.
Even with all these checks, the macro should not be considered foolproof. One callout is when Python is accessed in the SAS Cloud Analytics Services (CAS) server through proc cas inside a compute server session. Even though the compute server is used in this case, the Python under question refers to the environment available to CAS, and may be further governed by a SAS External Language Settings File (EXTLANG). Extlang provides its own messages back to the calling program in case a user without necessary privileges attempts to run Python through a CAS action.
At the same time, it seems a safe assumption to say that remaining situations warranting further checks are rare and can be considered edge cases. Do feel free to write in if you have additional suggestions and pointers which can improve this macro!
Specify the macro within your SAS programs
The simplest way to use this macro within a SAS program would be to directly copy and paste it in your SAS program and then call the same. However, as you may have noticed, the macro's code is pretty long (including comments :)). Here's an alternative method which hopefully makes it easier to define the macro. It uses the Filename statement which creates a reference to the URL where this macro is located, and then "includes" it in the SAS program. This inclusion causes the macro to be specified (but not executed, yet) in your SAS session.
filename getsasf URL "https://raw.githubusercontent.com/SundareshSankaran/sas_utility_programs/main/code/Check_For_Python/macro_python_check.sas";
%include getsasf;
filename getsasf clear;
Of course, ensure you have a connection to the GitHub repository (you should, as long as your application's connected to the internet). There's always copy-paste as your best friend should you find things difficult.
Where, within a SAS program, would you specify and call this macro? While preferences and structures vary, I have found that it's useful to divide your code into "function code" and "execution code". "Function code" is usually defined upfront and tends to consist of macros, any user-defined functions, or other modularised elements you would like to call in your "execution code". You may like to define the macro within your "function code" and then call the macro (next section) in your execution code. Of course, this is only a suggestion. Use this wherever you like, as long as it works for you! :).
Call the macro
You typically call the macro at the start of your execution code. First, define the following macro variables (you can name them whatever you like).
1. A macro variable for an error flag : Specify this variable as global so that it can be used downstream. This macro variable represents a flag with a value of 0 indicating no errors from the check, and a value of 1 indicating some error.
2. A macro variable for an error message: Specify this as a global variable too. This is mean to hold a description of the error (or the absence of an error) that may have occurred.
An example is shown below:
%global python_error_flag;
%global python_error_desc;
Next, call the macro. You have a choice here. The check depends on the type of SAS server - whether a compute server or a batch server - you happen to execute your workload from. A compute server is the type of environment used when you open applications such as SAS Studio in the Viya platform. A batch server is typically used for batch submissions made using the sas-viya command line interface (CLI) in batch mode. Organisations may in some cases like to develop code using compute servers and then schedule them to run in batch.
If you neither know nor can control the target server, the _env_check_python macro can be used. It makes a determination about the server where the code runs and calls the relevant macro.
/* Note that the names of the error flag and error description macro variables are quoted when sent over as arguments - this is required.*/;
%_env_check_python("python_error_flag","python_error_desc");
An important note is that the names of the error flag and error description macro variables are quoted when provided as arguments. The macro is designed to take their names as references to the variables that need to be used.
If you do happen to know the type of target server, then either the _env_check_python_compute or the _env_check_python_batch macro can be called directly. The syntax is the same. For example,
/* In case of a compute server */;
%_env_check_python_compute("python_error_flag","python_error_desc");
/* In case of a batch server */;
%_env_check_python_batch("python_error_flag","python_error_desc");
Here's an example reference to the macro variables and result (for a successful identification of a path to Python). Error descriptions, when error situations occur, differ according to the circumstances and stage of the check where they were found. A quick read of the macro will provide you different error messages.
Acknowledgements
Thanks go out to a number of people who helped me in learning more about the variables that define access to Python, either directly or as a sounding board. Thanks especially to Wilbram Hazejager (@Wilbram-SAS) who identified potential improvements to an initial, lazy attempt and set me off on a path to find out what really goes on with all those options and environment variables. Also many thanks to Quan Zhou, Bengt Pederson, Rob Collum, Edoardo Riva, Doug Haigh and others who helped me.
... View more
- Find more articles tagged with:
- DataOpsWeek - Environmental Management
- open-source
- Preflight Checks
- python
11-09-2023
12:09 AM
Analytics developers require flexible and integrated pipelines where they can access all available tools for their needs.
Sometimes, developers may wish to use methods and functionality from open-source languages such as Python and R. SAS Viya provides access to these languages through its integration with open source and specifically, in SAS Studio, through procedures like Proc Python.
A challenge is that Python and SAS compute engines operate in their separate environments. Data needs to be transferred among those environments in a seamless manner in order to take full advantage of integration. A new contribution - the Python - Load Objects to SAS Custom Step helps facilitate this data exchange.
To illustrate, once your Python program has done its job, you can transfer desired data objects to a SAS Viya in-memory environment (or the SAS Programming Runtime Environment (also known as SPRE)) for accessing specific functionality or better performance. At the same time, you can easily free up memory taken up by these objects within Python.
An important note: this step may be new, but similar capability has existed for a while in the form of the SAS object in Proc Python. This custom step extends such functionality. Those who are familiar with the SAS object and the SAS.sd2df (SAS dataset to Data Frame) method are free to continue using them (and it's also used within this Custom Step). The additional benefits provided by this custom step are:
1. It provides a low-code wrapper around the data exchange process and makes it more transparent instead of buried in code.
2. Provides additional options for more Python data objects beyond pandas data frames, such as single objects (strings and integers), lists, and dictionary objects.
3. Promotes good memory and object management by deleting the Python object after transfer and running garbage collection on the same
Those who are new to the paradigm of programming with SAS and Python in a combined fashion will find this step a useful aid in development of their programs.
Access and Use
This custom step is part of the SAS Studio Custom Steps GitHub repository, a collection of low-code components providing a productive and enjoyable developer experience. These steps provide a user interface for entering parameters, abstraction and enable code reusability for many analytics tasks and programs.
Access the "Python - Load Objects to SAS" step from:
Link to the repository folder:
Python - Load Objects to SAS
Link to the README:
README
To use this within SAS Studio within a SAS Viya environment, a recommendation is to follow instructions to upload a selected custom step to SAS Viya. Another alternative is to make use of Git integration functionality already available in SAS Studio. Clone the SAS Studio Custom Steps GitHub repository and make a copy of required custom steps in your SAS Content folders. Refer this post for some useful tips.
Application
The most common scenario where developers are expected to use this step would be to transfer Pandas data frames to a corresponding SAS table (either a sas7bdat dataset or an in-memory table in SAS Cloud Analytics Services (CAS)). Pandas data frames are one of the most commonly used Python data structures in the area of data science, and analytics practitioners may use them to carry out transformations such as calculating a new column, transforming columns and reshaping data inside Python.
As mentioned earlier, a built-in SAS.df2sd method exists in Proc Python, which is meant for transferring data frames to target SAS tables. The SAS object has been created for use within a SPRE environment (also known as SAS compute), but can also be used for CAS table targets.
When CAS targets are specified, there's a dependency that a CAS session should exist prior to the SAS callback object's execution, something which all analysts may not be aware of. For this purpose, the custom step offers an alternative to the SAS.sd2df method for CAS targets, making use of the Python swat package to transfer the dataset from Pandas to a CAS table. Some Python coders may already be familiar with the swat package as a means of running CAS actions from Python, and may choose the custom step for this purpose.
Garbage Collection
After a handoff transferring data from Python to SAS, the pandas data frame continues to reside in memory and in the namespace of the Python environment. A reasonable expectation is that in most cases the data goes through further transformations in SAS anyways, and so the pandas data frame may no longer be needed. However, in Python, deleting the data frame only removes the link between the object's name (the data frame in this case) and the data it points to. To actually free up memory, a process called garbage collection is required, which operates under some constraints and has to be explicitly coded by the developer. To make this convenient, this custom step provides an option to remove the data frame and perform garbage collection after the transfer is executed. This helps keep memory lean.
The "Quick Promote"
Some flows may perform the entire analytics process in Python entirely, and use a CAS table only as a final destination for visualisation purposes, for example in SAS Visual Analytics (VA). For such purposes, this custom step also provides an option where users can promote the transferred table to global scope in CAS. This makes the table accessible from Visual Analytics where it can be used within a report.
Other Python Objects
A bit of trivia : did you know that what's commonly referred to as a variable is called either an object or reference in Python? There are some discussions on the net around this (and interestingly, some are related to the constraints around garbage collection mentioned earlier). In any event, this custom step enables you to transfer other objects (i.e. not data frames) to corresponding SAS objects. These other common object types are:
Pandas dataframes can be transferred to either CAS tables or SAS datasets - we've already covered this.
Standard Python objects (int, str etc.) can be transferred to SAS macro variables
Lists (array-like data structures) can be transferred to CAS tables or SAS datasets (with a user-specified name serving as the column name for the list)
Python Dict objects which resemble JSON, can be transferred to CAS tables or SAS datasets, using pandas data frames as an intermediary medium.
All the above options are available in the step; feel free to play with them.
Have fun with the "Load Objects to SAS" custom step and I hope it helps you with your open source integration initiatives. Feel free to get in touch with any thoughts or questions.
... View more
- Find more articles tagged with:
- DataOpsWeek - Environmental Management
- SAS Studio Custom Steps
11-07-2023
05:34 PM
Text data varies in structure, size and content. Some documents may be short and brief (responses to pointed questions in a survey), while others may tend to be longer and address multiple areas of interest (such as, for example, the review you post after spending the night at an inexpensive rat-infested motel with a window overlooking the sea).
Natural Language Processing (NLP) applications tend to analyse documents in their entirety. In certain cases, however, analysis at a more granular level, such as at the sentence or paragraph-level, may be considered because:
1. It enhances the quality of analysis by enabling more granular analysis and localising context in some situations.
2. In some other situations, it may contribute towards more efficient processing by allowing opportunities to remove noise and reduce size of individual payload processed against a set of rules.
A recent open-source contribution, the NLP - Sentence Splitter SAS Studio Custom Step helps data scientists and data engineers split text observations into constituent sentences while retaining identifiers and other metadata. This is an easy to use pre-processing component which can be applied prior to running NLP.
This is part of the SAS Studio Custom Steps GitHub repository, a collection of low-code components providing a productive and enjoyable developer experience. These steps provide a user interface for entering parameters, abstraction and enable code reusability for many analytics tasks and programs.
Link to the repository folder:
NLP - Sentence Splitter
Link to the README:
README
Access & Use
We first start with the question of accessing these steps from within a SAS Viya environment. A recommendation is to follow instructions to upload a selected custom step to SAS Viya. Another alternative is to make use of Git integration functionality already available in SAS Studio. Clone the SAS Studio Custom Steps GitHub repository and make a copy of required custom steps in your SAS Content folders. Refer this post for some useful tips.
Using this Custom Step is easy. Referring to the README, you need:
1. A table loaded to SAS Cloud Analytics Services (CAS)
2. A column containing text (which may comprise one or more sentences)
3. A column to be used as a document ID
Upon connecting the input table, assigning the required parameters and specifying an output table, you'd notice the following:
1. The output table will now contain a column called _match_text_. This column contains sentences as separate observations.
2. Each sentence is identified by an ID variable, the _sentence_id_. Note that this sentence ID is within the context of the original document, i.e. Doc 1 will contain sentence IDs of 1,2,3... n (for the number of sentences within) and Doc 2 will also contain 1,2,3... k sentences.
3. To ensure a unique ID for each combination of sentence and document ID, a new ID column called Obs_ID concatenates the Document ID with the Sentence ID (with some padding applied).
4. Individual offsets (_start_ and _end_) denoting where sentences start and end are also provided.
An Application
Let's now consider a quick application showing the benefit of sentence-level analysis: feature-level sentiment! Consider this delightful review of an establishment I have had the honour of recently staying at (with their name changed since they don't like fame, naturally):
"I felt wonderful when I noticed the pretty facade of the Jolly Rodent motel. The kindly lady at the counter was gracious enough to let us know we were early, but she would still make a room ready in five minutes. Passing through the courtyard, I admired the ornate water fountain in the middle of a floral garden. Alas! The scene proves completely different when I took in our room. Not that there was much to take in, it seemed to end right where it began! Gritting my teeth and reassuring myself that a bathroom door located next to the cooking range was a novelty, not an hazard, I tried to freshen up. However, the moldy nature of the bathroom and the rust stains on the basin freaked me out. The restaurant food was middling and insipid, and the rude service of the waiters made us wish we were out of there as soon as possible."
Note that I've colour-coded the passage above as per differing levels of sentiment (green indicates positive sentiment, red indicates negative, and the brown is somewhat negative to neutral). On the whole, upon reading the entire review, it might seem that the review expresses a negative sentiment in general. But this isn't really fair to the proprietors of the Jolly Rodent, who should be complimented for their beautiful building, front desk service and water fountain. This is an ideal opportunity for the Sentence Splitter to uncover insights that might have otherwise been obfuscated. Another benefit is that the smaller text sentences may provide efficiency benefits by allowing developers to weed out noise prior to analytics. Finally, sentence-level analysis helps support feature-level sentiment analysis or allows for a categorisation or inference to be extracted from a smaller, localised context.
Having run the Sentence Splitter on a table containing the above text, we can then apply an NLP task (such as sentiment analysis) on individual sentences and obtain the following:
Notice how the Sentence Splitter has now enabled individual (positive) sentences to stand out on their own from within an overall negative review? This use case can be extended and modified for Feature - Sentiment association, categorisation, progression of sentiment and other applications.
Have fun using the NLP - Sentence Splitter step. Feel free to get in touch in case of any questions or comments.
... View more
- Find more articles tagged with:
- SAS Studio Custom Steps
09-25-2023
02:39 PM
2 Likes
Modernization of analytics platforms requires focus on costs and higher efficiency.
Workload Management, the SAS Viya solution to handle many, intensive compute workloads efficiently, has been generally available as an add-on to SAS Viya since Nov 2021.
Two recent developments bring renewed focus upon Workload Management:
It’s now more easily accessible! From monthly stable 2023.08 onwards, Workload Management is provided out of the box with the majority of SAS Viya offerings. You only need to enable and configure the same.
Configuration now enables automated scaling of compute nodes to accommodate workloads of varied profile! This provides multiple benefits. For one, automation can quickly address current demand. More significantly, administrators can differentiate resources to suit the type of workload being submitted.
The documentation on how to administer autoscaling policies is pretty straightforward and can be found here. This article demonstrates how we configured an example deployment for autoscaling and were able to execute workloads using the right resource level and type.
Infrastructure
We’ve based our example on Azure cloud resources, but configuration and setup follows a similar pattern across providers. When considering infrastructure, we made use of the GitHub Viya 4 Infrastructure-as-Code (IAC) repository for Azure. In this, we can specify our desired infrastructure which is used by an automation tool (terraform) to interact with the cloud provider and make such ready. Here’s the topology of SAS Compute nodes we took into account (note that since Workload Management is currently concerned with Compute nodes only, we have omitted information on other node pools, network and storage, but those also have to be factored in).
Table 1: Topography of Compute Nodes
Node Pool Name
Purpose
Machine Type
Min – Max Range
Labels
Compute
For Interactive users (SAS Studio and similar)
Standard_E4bds_v5
4 vCPUs, 32GiB RAM, 150 GiB disk space
1 - 1
wlm/nodeType=”interactive”
Combatsm
Small SAS jobs submitted through batch
Standard_E4bds_v5
4 vCPUs, 32GiB RAM, 150 GiB disk space
0 – 8
wlm/nodeType=”batchsmall”
Combatmd
Medium SAS jobs submitted through batch
Standard_E8bds_v5
8vCPUs, 64GiB RAM, 300 GiB disk space
0 - 2
wlm/nodeType=”batchmed”
Combatlg
Large SAS jobs submitted through batch
Standard_E16bds_v5
16 vCPUs, 128GiB RAM, 600 GiB disk space
0 - 1
wlm/nodeType=”batchlarge”
For All Node Pools
workload.sas.com/class=”compute”
launcher.sas.com/prepullImage=”sas-programming-environment”
Some salient points:
As already stated, this is only for SAS Compute node pools. SAS Compute node pools are designated to handle programs which run in either a compute server or batch server session. Mostly SAS programs which usually run on SAS datasets, they may also cover programs which make calls to other compute engines such as SAS Cloud Analytics Services (CAS) or Python or R (if Python or R has been configured). All nodes forming part of the above node pools need to be labelled as workload.sas.com/class=”compute”. Only nodes labelled as such will consider Compute workloads for processing.
The above is opinionated, meaning we, as administrators, decided to provision the above configuration for purposes of this example. How do organizations decide on the type of node pools they can harness? Some may choose to go with just one, while some others may have a wider range at their disposal. This choice is based on their current profile of SAS workloads and many other factors (including cost). One tool which can help your organization make this decision is Ecosystem Diagnostics, described further in this article.
Notice a majority of the planned node pools start with a minimum of 0 nodes. This signals that even were we to provision a variety of node pools, autoscaling enables a scale up from zero and we wouldn’t need to pay for compute resources unless they are actually used.
At the same time, notice the outlier (“Compute”, for Interactive purposes) node pool which has a minimum of 1. This is done on purpose, because interactive users appreciate a compute server which is always on due to the nature of interactive workloads. Simply put, in this day and age, you don’t want interactive users staring at spinning wheels or cranking up a node to make it start. You have the flexibility to keep a warm node to serve a subset of users, maybe with a small machine in order to keep cloud costs low.
Don’t forget to add the additional label: launcher.sas.com/prepullImage=”sas-programming-environment” for all nodes. Saves you a lot of angst.
Configuration
This is the fun part. Given provisioned infrastructure, let’s look at how to configure Workload Management & optimize usage as per needs.
You can configure Workload Management through an plugin called Workload Orchestrator (WLO) in the SAS Environment Manager application. Administrators use WLO to implement decisions regarding the appropriate resource to run a workload. For users, this is a great resource to monitor the status of jobs.
The process of configuration is greatly eased when you consider the entire configuration as a single JSON containing all the required details. In the configuration page of WLO, simply click on “Import”, import the sample configuration provided here, make required changes, and you are set. Conversely, you can also export the configuration at any time to make use of the same in a different environment later.
Of course, even more fun is actually looking at the individual components making up the configuration. The official documentation on configuring Workload Orchestrator provides more details. Here, we’ll focus on tasks which support the following basic flow of events.
Figure 1: Workload Execution - user flow
A lot going on in the picture above, therefore let’s attempt to summarize:
Users require execution of their workloads (SAS programs).
They submit workloads through interfaces like SAS Studio (which contain an element of interactivity) or via batch jobs from the command line.
Every submission is flagged with a context, which indicates the broad set of parameters under which this job will run. The context could be either a Compute context or a batch context.
This context is wired to run in a SAS Workload Orchestrator queue. Queues are defined in Workload Orchestrator to govern, among other things, where and when the jobs may be executed.
The queue is configured to request that the job be run on a host type where such is defined.
Host types are configured with host properties that specify the labels (from the table in Infrastructure, above) which identify candidate nodes to run the workload on.
The host types are also flagged as being enabled for autoscaling or not.
Given a request for a job to run on a host type, Workload Manager makes a request for an available node. If the node is available, the requested session (either sas-compute or sas-batch) is started on that node to execute the job.
If a node is not available, but the autoscaling flag is enabled on that host type, then Workload Manager works along with the Kubernetes cluster-autoscaler to signal a need for a node to execute the session on. The cluster-autoscaler responds to this signal by requesting a node from the cluster, which is then spun up to execute the job.
There are a number of conditions which determine whether a node is available or not, which is explained in detail in the following documentation link.
Experience
Let’s now actually watch things in action! In an initial state, with everybody goofing off - er, let’s just say it’s the start of the day – the following is the state of the system as represented by SAS Workload Orchestrator.
Figure 2: Dashboard of SAS Workload Orchestrator
Figure 2 is the dashboard view of Workload Orchestrator. On the left hand side, there is information about the version, license expiration, build date and GUI build date. The top half view lists Queue Status. In our case, there are four queue status tiles representing what’s configured - default, batch-large-queue, batch-small-queue, batch-medium queue. From the queue status, you can see all the queues are open and active. No closed queue. All have 0 jobs pending. The lower half view is the Host Status. There’s one server shown as Open and in OK status.
Which can also be picturized as follows:
Figure 3: Initial State
Whoa! – you may chortle in righteous indignation. How come there’s a machine switched on if there’s no work? Well, that’s the warm node kept alive for interactive users. It’s a small price you pay to have nodes available for users who may come back to their desks and start coding. Luckily, thanks to Workload management, you can keep this lean by provisioning only a small machine (minimum cores and CPUs) to satisfy this usage pattern and keep cloud costs low.
Let’s now take up a case when work actually starts to happen.
Figure 4: Screenshot of sas-viya CLI
Figure 4 demonstrates submitting a program using the Command Line Interface. The Runme.sas program is submitted using the default (batch) context. The default context maps to the batch-small-queue. In Figure 3, there is one job pending in the batch-small-queue. Since autoscaling is enabled, the job is kept pending until the cluster auto-scaler requests a node available, and that node becomes available and ready to be used. As mentioned above, the cluster auto-scaler obtains this signal (to auto-scale) from Workload Management, based on the configuration provided.
Figure 5: Updated Dashboard View
Figure 5 shows a change in state in the updated Dashboard where the program RunMe.sas is in a pending state in the batch-small-queue. The job will stay in the pending state until the new node is available.
Figure 6: Updated Dashboard View
Figure 6 shows a change in state in the updated Dashboard where the program RunMe.sas is now running in the batch-small-queue and a new node is up.
Figure 7: Calling README.sas through the Command Line Interface
Figure 7 displays the job completed and Figure 8 now displays the current state of the Dashboard. The job completed. However, the new compute node is still active showing Open-Ok, looking for any other jobs before it scales down. After lapse of a certain time period (governed by configuration), if the node continues to be idle, it is picked up by Kubernetes for termination. This link describes the conditions which trigger a scale-down of nodes.
Figure 8: View of Workload Orchestrator after program run
Now it is time to really have fun. Jobs are being submitted to the batch-large-queue, batch-medium-queue and default queues in Figure 7.
Figure 9: View Queues tab
The view of the queues tab shows 1 job is running on the batch-small-queue. 1 job is pending in the batch-large-queue and 1 job is batch-medium-queue. Why are they pending? Remember from Figure 8, since we already ran a job from the batch-small-queue a node was up and waiting for more requests. Now, we are waiting for nodes from the batch-medium context and batch-large context to fire up and enter a state of readiness, which leads to an Open-OK against the host.
Figure 10: New WLO view
Figure 10 shows the dashboard view again. We focus on the Host status where we can see the interactive host waiting on interactive jobs. The hosts are associated with the batch-small-queue is active since we have a job running. Figure 11 shows another host that scale up running the job from the batch-medium-queue. Figure 12 displays the batch-large-queue is running a job and the host associated with the server is Open and Ok.
Figure 11: Dashboard View with batch-medium-host available
Figure 12: Dashboard View with batch-large-host available
The above (Figure 12) is the state when all available Compute node types (as detailed in the infrastructure section) are utilized. As workloads increase (based on business needs) , the extent to which these node types are accessed will vary, highlighting the ability to differentiate resources as per the needs of the workload. Pictorially, Figure 12 can also be represented as follows:
Figure 13: Workload Management in a busy state
Figure 14: Hosts tab
Figure 14 displays the hosts tab. The information on the Dashboard and the host tab are the same information but the information is presented in a different view.
With the completion of all the jobs running, the hosts have scaled down and we are left with the Interactive Host waiting for interactive jobs in Figure 15.
Figure 15: WLO rests
In summary
As evidenced by SAS Viya’s move to cloud-based architecture, modernization of analytics platforms focusses on costs and higher efficiency. Workload Management, through its recent autoscaling capabilities and other elements, facilitates the following:
Reduced idle capacity
Differentiated & right-sized resources per workload
Automated decision making on resources, triggered by user activity
Reduced pending jobs and higher queue utilization
Centralized administrative activity and interfaces (less overlap between Kubernetes & SAS Viya administration control)
Drop us an email with any additional questions.
References
About Azure Virtual Machines: https://learn.microsoft.com/en-us/azure/virtual-machines/
SAS Ecosystem Diagnostics: https://communities.sas.com/t5/Ask-the-Expert/Why-Do-I-Need-SAS-Enterprise-Session-Monitor-and-Ecosystem/ta-p/854162
Documentation related to Workload Management and cluster autoscaler: https://go.documentation.sas.com/doc/en/sasadmincdc/default/wrkldmgmt/n1s5vpyfr4sq3zn1i1dp1aotpzka.htm
... View more
- Find more articles tagged with:
- DataOpsWeek - Environmental Management
- DataOpsWeek - Orchestration
- Grid
- SAS Workload Management
09-05-2023
09:09 AM
Thanks, @Wilbram-SAS , links on the article have been updated.
... View more
08-27-2023
01:42 PM
2 Likes
Beyond a certain point, programming language shouldn't matter.
Analytics developers appreciate unified platforms which accommodate different programming environments and languages, be they SAS, Python, R or any other. Access to multiple programming languages also requires that we consider seamless interoperability.
SAS Studio, an application within SAS Viya, offers powerful data engineering and analytics through low-code and programming components. SAS Studio already provides easy interface to Python through a Python editor and the Proc Python procedure. However, at present, there doesn't exist similar easy access to R.
This article describes R Runner, a SAS Studio Custom Step, which helps you program in R from SAS Studio on SAS Viya.
Teams who code in both SAS and R can now develop more integrated analytics. This especially benefits certain industries where we've noticed a lot of interest in using SAS and R together, such as Pharma, Healthcare & Life Sciences, pockets of the insurance industry, and the public sector.
Watch this video for a quick description of what you can do with R Runner.
Access R Runner
SAS Studio Custom Steps are low-code components which abstract complex programming logic into an easily consumable package, used across programs and sessions in a repeatable manner.
Access R Runner from this folder on the SAS Studio Custom Steps GitHub repository. Also, here's a direct link to the README. Note and follow instructions within to import the custom step into a SAS Viya environment.
Use R Runner
R Runner offers a simple, no-nonsense capability: run R programs from a SAS Studio session. This can be done standalone, or as part of a larger process, typically designed as a SAS Studio Flow. Here are some simple building blocks to help illustrate.
Provide input data
Most analytics require input data. Attach input data to R Runner through an input table port. Note that if you are running the step within a flow, you may not notice this the first time. Right click on the step and select "Add input port" for the "inputtable" port. If running the step standalone, select an input table as directed in "Provide an input table".
Here's what happens. Given an attached input table, upon execution, proc python converts this table to a Pandas data frame using the SAS callback method under the covers. Then, a Python package called rpy2 is used to convert this Pandas data frame to an R frame, the type of tabular data structure that R processes. There's no need for the user to provide code for this conversion. Once this conversion is complete, user-submitted R code will be taken up for execution. The data frame uploaded will henceforth be known as "r_input_table" inside the R environment.
Run an R snippet
Users have two options to submit code to the step for execution. Sometimes, you may wish to run a short set of commands in R. These could be simple tasks such as summarising a data frame or creating a frequency table. Users can provide such short snippets / commands inside the text area on the custom step. The text area is limited to contain a maximum of 32768 characters. Users are encouraged to attach an R program for longer programs.
Here's what happens. R commands submitted inside the text area get written to a temporary file which is then passed on to the r object within rpy2.robjects. Refer here for an example provided in the rpy2 documentation which shows the code executed behind the scenes. The benefit to the user is that similar code is baked into this custom step and therefore R programs are passed over to the R interpreter seamlessly using the rpy2 package.
In the interest of full transparency, note the text area is a component of the SAS Studio Custom Steps framework and therefore should not be considered as an editor capable of interpreting R code. In short, do not expect features like syntax check, automatic indenting or any of the other magic you may encounter in editors like Visual Studio Code or RStudio, among others.
Refer an R program
Not all R programs are short enough to be submitted directly to the text area described above. For longer, more involved code that you wish to lift and shift to SAS Viya, simply attach your R code to the step for execution. This makes it convenient to quickly use existing R codebase and minimise scope for changes.
Here's what happens. When you provide an R file reference, R Runner checks the location of the file, and then refers it directly within the r object of rpy2.robjects. This ensures direct access to the code as-is without any intermediate processing.
The second advantage of referring an R program: other conveniences offered by SAS Studio such as Git integration allows users to refer R programs located in a local folder linked to a Git repository. Upstream changes in the R codebase, once managed with Git, could seamlessly sync a local folder and your process (which may use this custom step with other SAS Studio objects) automatically picks up the same! Users are advised to refer to R programs located in the filesystem (disk storage attached to the Viya environment), and not files located in SAS Content (the Infrastructure Data Server). The custom step has been built to work with filesystem content so that program artefacts can be easily transferred (as well as take advantage of Git).
Export output data
Finally, once your R program has finished, it's time to access any output data generated. The output data in R could be in the form of an R variable or an R data frame (unless the program explicitly writes results out to a file). Users might like to use this output data in downstream processes, which may involve other SAS programs (one of the advantages of running R under a unified platform). This custom step makes it easy to export output data frames from the R process to a SAS dataset for downstream processing. Users can specify the name of the desired output data frame and provide an output dataset to hold the resultant data.
Here's what happens. This part of the process is in some ways the reverse of what occurs when ingesting input data. First, the specified R data frame is transferred to a Pandas data frame using the rpy2 package. This Pandas data frame gets converted to the desired SAS dataset using proc python and the SAS callback method. Once the data is output, users are free to interact with it any way they like (using SAS or Python programs). At the same time, if they wish to continue working on the R data frame, they can simply dispense with creating an output dataset and instead call the same R global environment variable (i.e. the data frame name) in a subsequent R Runner step. The same R session is maintained as long as a valid Python session (with the rpy2 object) is in place.
An eye to the future
We've released the first version of R Runner and focussed on delivering the essentials - i.e. ensure R code can be executed, data can be sent to R, and data output from R, in the service of an overall larger integrated process in SAS Studio. We are actively considering and developing some exciting future improvements, including an easy outlet for graphics (plotting images and charts), better redirection of output wherever feasible and more transparent logging. We welcome your suggestions for enhancement and improvements. Please drop us an email by clicking here. Have fun with R Runner.
... View more
- Find more articles tagged with:
- DataOpsWeek - Environmental Management
- Open Source Integration
- r
- SAS and R
- SAS Studio Custom Steps