Develop corpus of SAS coding data to train LLM

AndrewZ · ‎01-24-2025

Increasingly, software developers and data scientists rely on LLMs to help with coding, but LLMs are poor at SAS coding. Hallucinations are common, and the LLM-generated SAS code often does not run without major changes. This combination of circumstances may lead to poor outcomes, such as SAS coders who may turn to alternatives like Python which have top-tier support in LLMs, whether used in the traditional dialog format or embedded in coding assistants such as GitHub Coplot and Windsurf.

The academic paper "The Llama 3 Herd of Models" in section 4.3.1 lists their top 10 top-tier languages (notably, not including SAS), and the paper details how Meta improved the ability of LLMs to generate better code. One potential solution to LLMs' struggle with SAS coding is for SAS to emulate Meta's approach by developing a corpus of SAS-specific training data that all LLMs can freely use. Then, SAS could publish this data set on Hugging Face and promote it to Meta, OpenAI, Google, and Claude.

The Llama paper gives a template for this process. In the case of SAS, coding questions and solutions could be automatically collected from resources such as the SAS documentation (example code), this SAS forum, StackOverflow, SAS support cases, and SAS blogs (to the extent permissible by copyrights, licenses, and ToS). Some strategies in the Llama paper: remove PII, automatically evaluate by LLMs, automatically write unit tests, automatically testing solutions in sandbox environments.

An intriguing strategy would be to develop a list of Python data science and business intelligence question, and then translate the solutions to SAS. This assumes that coders in each language are facing similar questions, but the Python questions are more abundant on the open Internet.

quickbluefish · ‎01-24-2025

This would be great, though honestly this is just another case of SAS as a company shooting itself in the foot. I say this as a SAS user of nearly 20 years. As long as SAS insists on not open-sourcing the language (while still of course maintaining a for-profit model for its products, EG, Viya, Grid, etc. and conferences) and stipulating fairly draconian rules for use of SAS OnDemand (cannot even be used for training purposes the last I checked) which, as a consequence, severely limit the amount of content on places like YouTube and the size of the self-taught user community, the amount of code out there in the wild on which to train LLMs will be very limited in comparison to open source alternatives like R and Python. Just my $0.02. Not trying to start fights. I use SAS basically every day, all day. Open-sourcing is probably a bridge too far and a complete non-starter for a for-profit company, but, e.g., relaxing the terms for use of SAS OnDemand would be a major step to broadening the user base and the amount of content out there.

AndrewZ · ‎01-24-2025

@quickbluefish

Yes 💯! The consequences affect this issue several ways. First, there are fewer SAS users overall, reducing organic SAS-related on the open internet, so there is less training data. Second, because of a lack of popularity, SAS does not register as a priority, even as second-tier language like PHP, in LLM development. Third, even if LLM developers wanted to better support SAS, they couldn't run SAS to follow the data synthesis process outlined in the Llama paper.

I've been using SAS daily for about 15 years.

jleirer · ‎01-30-2025

@AndrewZ ,

I appreciate you suggestion and your commitment to SAS and the SAS Language. We also noticed that that quality of SAS code generated by foundation LLMs is not up to our or our users standards. So we've embarked on a process to create a SAS Viya Copilot that can, among other features, generate SAS code. We are currently in the process of further testing and fine-tuning our models, which are available in a private preview.

You can learn more about it here:
https://blogs.sas.com/content/sascom/2024/04/17/3-ways-you-can-use-a-copilot-in-sas/

If you'd like to be a part of the private preview, reach out to your sales representative and they can get in touch with the rights folks here at SAS to see if you and your organization would be a good fit for the private preview.

Thanks.

jleirer · ‎01-30-2025

We launched the private preview for this at Innovate 2024. Keep a look out for more announcements and recent developments at Innovate 2025.

AndrewZ · ‎01-30-2025

@jleirer

A proprietary LLM is not a great solution. The existing LLMs are smarter in general, moving at a facer pace, and integrated into a variety of tools that developers already like.

We don't use SAS Viya anymore, and we're not getting any more SAS licenses. Our team has been using SAS since the 1990s, but next year, IT plans to not renew the SAS contract because of high costs. I'd like to keep the SAS licenses, but I'm not sure that's an option.

Also, our IT department makes it almost hard to get new AI tools approved, so our SAS license will have expired before the SAS copilot software is approved.

AndrewZ · ‎01-30-2025

Status changed to: Suggestion Implemented

It's not implemented until the training data is shared with non-SAS LLMs like those by Meta, OpenAI, and Claude. I suggest posting the data to Huggingface.

ballardw · ‎01-30-2025

Please at least once describe or expand on any given TLA's (three letter acronyms). So I have a chance of knowing if I should care what an LLM may be at all.

jleirer · ‎01-30-2025

@ballardw ,

An LLM is a Large Language Model. They are machine learning models primarily built on transformer architecture for natural language processing (NLP) tasks. They have been a key driver in the Generative AI boom. One popular task that LLMs have seen early success in is coding. To learn more about how SAS is working with LLMs, I recommend the following resources:
https://www.sas.com/en_us/solutions/ai/generative-ai.html
https://blogs.sas.com/content/subconsciousmusings/2024/04/05/llm-prompts-with-sas/
https://communities.sas.com/t5/SAS-Communities-Library/Generative-AI-and-Large-Language-Models-Demys...

Thanks.

AndrewZ · ‎01-30-2025

@ballardw

Common examples of LLMs are ChatGPT (GPT=Generative pre-trained transformer), Anthronic Claude, Google Gemini, and Meta Llama. The name LLaMa itself is a sort of pun of LLM. People often use LLMs to generate code, autocomplete when coding, comment code, and answer questions about code.

Why does it matter? Many coders have their favorite coding LLMs and favorite coding tools (e.g., Visual Studio), and an important part of my suggestion is to let coders continue to use their favorite LLMs and favorite coding tools for SAS. They may not want to be locked in to SAS Viya and SAS's LLM. Widespread support for SAS across LLMs would continue to help SAS thrive (e.g., increasing productivity of seasoned SAS coding, making it easier to onboard new staff to SAS), so at a deeper level, my goal is for SAS to not decline as a language or as a company. It's a complex issue, but it's a layer of protection against users turning to alternatives. It's relevant as LLM-driven coding is at the beginning of a boom, while SAS is on the sidelines.

An important TLA here is SAS: Semicolon, Always Semicolon.... j/k 🤣