A more streamlined title would be: "Comparing GPT-4 Models on SWAT Code Generation for SAS Viya: A Study of 18 Prompts." How accurate is GPT-4 at generating SWAT code to perform light data management tasks in SAS Viya? With eighteen sets of prompts we tested two custom agents: one using the “base” GPT-4 model versus a second using a GPT-4 model grounded in documents highly relevant for SWAT code generation. Which one performed better? Are there any significant advantages using a RAG approach? Read the post and watch the videos to find out more.
Left: DALL-E generated image. Right: "Show me Morpheus" image. Source: imgur.
In the context of the movie "The Matrix," Morpheus responded to a claim of knowledge, when Neo exclaimed "I know Kung Fu", with "Show me," indicating a request for a demonstration of the skills in question.
If a GPT-4 custom agent would state "I know SWAT" (SAS Wrapper for Analytics Transfer) as a parallel to Neo's line, my response would follow along Morpheus' lines, inviting the custom agent to demonstrate the knowledge in a relevant situation.
And here came the response to that hypothetical "Show me".
In our experiments, we compared side-by-side the SWAT code generation skills of two Azure OpenAI's GPT-4 models, version 1106-preview:
This approach builds upon the methods we detailed in our previous post, SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain: Behind the Scenes.
To understand the experiment, you might want to watch the following short video:
After running eighteen distinct sets of prompts, we've compiled the outcomes of our experiments with two GPT-4 models: the standard 'Base' model and an enhanced version incorporating a Retrieval-Augmented Generation (RAG) technique. Here's how they performed:
Results | GPT-4 "Base" | GPT-4 with RAG |
Successful | 13 | 14 |
Partial Success (Different Results) | 2 | 2 |
Unsuccessful | 3 | 2 |
Total Tasks | 18 | 18 |
Achieving a success rate of 15 or 16 out of 18 represents a strong performance. This suggests that both GPT-4 models are quite adept at handling light data management tasks, with the RAG-enhanced model showing a slight edge.
Nevertheless, we must approach these figures with a discerning eye. In the age of Business Intelligence (BI), it was not uncommon for five different dashboards to present five distinct sales figures. Language models, including the latest LLMs like GPT-4, haven't entirely resolved this issue. It's crucial to remember that while language models can significantly aid in data management tasks, the reliability of their outputs must be thoroughly vetted, particularly when those outputs inform critical business decisions.
We prompted a series of data management tasks to evaluate the capabilities of two configurations of GPT-4, the 'Base' model and the enhanced 'GPT-4 with RAG' model. Our tasks varied in complexity:
The models' performance should be viewed in light of their training data; the quality of their output is influenced by the data they've been exposed to during training. Here's how they fared:
ID | Prompt | GPT-4 "Base" | GPT-4 with RAG | Conclusion |
1 | List files | Pass | Pass | Similar |
2 | List caslibs | Pass | Pass | Similar |
3 | List in-memory tables | Pass | Pass | Similar |
4 | Load a CSV from an URL to a promoted table | Pass | Pass | Similar |
5 | Confirm the table has been loaded | Pass | Pass | Similar |
6 | Column info | Pass | Pass (with an extra prompt). | Base model slightly ahead. |
7 | Table summary statistics | Pass | Pass | Similar |
8 | Describe a table | Pass | Pass | Similar |
9 | Filter a table. Provide row counts. | Pass | Pass | Different results. Trust issue. |
10 | New calculated column | Pass | Pass | Similar |
11 | Top n | Pass | Pass | Similar |
12 | Group by + aggregate. | Pass | Pass | Different results. Trust issue. |
13 | Rename a column. Column info to confirm. | Pass | Pass | Similar, RAG unaware of success. |
14 | Unique count for values in a column. | Pass | Pass | RAG has better intent understanding. |
15 | Count the number of missing values from a table. | Pass | Pass | RAG performs better, Base needs guidance. |
16 | Create a new promoted (global) table with a few lines of data. | Fail | Pass | RAG handles promotion well. Base model fails to promote. |
17 | Join a table with the newly created table. Model must figure the key to join on. | Fail | Fail | Task is challenging for both models. |
18 | Save a table and promote it. Filter an existing table, save as a promoted table. | Fail | Fail | Both models struggle with table saving. |
This video presents a head-to-head challenge of ten rounds, where we prompt the model to tackle a series of data management tasks in SAS Viya. The tasks range from simpler ones, such as describing columns and summarizing tables, to more complex operations like creating calculated columns, sorting, and identifying top values. We also cover grouping with aggregations, renaming columns, performing unique counts, and saving tables after applying filters.
In most of the cases, both agents succeed at the given tasks, with the same results. Sometimes they both succeed but the results are different! Sometimes the RAG agent needs an extra “nudge” or further instructions. At other times, the RAG agent succeeds where the other failed or they both fail.
I won’t comment the full 28 minutes of the video, although I added a few explanations. Enjoy watching or scrolling through!
Two custom agents were tested:
Overall, the results slightly favor the GPT-4 with RAG model, indicating a marginal edge in understanding and executing complex data management tasks.
Ultimately, the performance difference between the two models is relatively small. Considering the additional resources and time required to set up the RAG, one must weigh these against the need for precision.
For rapid outcomes where the highest accuracy is not critical, the 'Base' model is your go-to option. It provides quick results without the extra setup. The GPT-4 model 1106-preview version is a far cry compared with an earlier davinci-code-003 model I tested for SAS code generation.
However, if your priority is tailored accuracy and you're dealing with complex tasks where nuanced understanding is key, the 'GPT-4 with RAG' model is likely the better choice, despite the additional investment.
The study emphasizes the importance of verifying the output of language models, especially when informing critical business decisions.
I hope you found this article insightful. Please feel free to reach out with feedback or suggestions for enhancing the agent or taking its capabilities to the next level.
Thanks to Peter Styliadis for his great SWAT Series.
Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about the approach. If you wish to get more information, please write me an email.
Find more articles from SAS Global Enablement and Learning here.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.