How Accurate is GPT-4 at SAS Viya Data Management Tasks?
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
A more streamlined title would be: "Comparing GPT-4 Models on SWAT Code Generation for SAS Viya: A Study of 18 Prompts." How accurate is GPT-4 at generating SWAT code to perform light data management tasks in SAS Viya? With eighteen sets of prompts we tested two custom agents: one using the “base” GPT-4 model versus a second using a GPT-4 model grounded in documents highly relevant for SWAT code generation. Which one performed better? Are there any significant advantages using a RAG approach? Read the post and watch the videos to find out more.
Left: DALL-E generated image. Right: "Show me Morpheus" image. Source: imgur.
In the context of the movie "The Matrix," Morpheus responded to a claim of knowledge, when Neo exclaimed "I know Kung Fu", with "Show me," indicating a request for a demonstration of the skills in question.
If a GPT-4 custom agent would state "I know SWAT" (SAS Wrapper for Analytics Transfer) as a parallel to Neo's line, my response would follow along Morpheus' lines, inviting the custom agent to demonstrate the knowledge in a relevant situation.
And here came the response to that hypothetical "Show me".
GPT-4 Base vs GPT-4 with RAG
In our experiments, we compared side-by-side the SWAT code generation skills of two Azure OpenAI's GPT-4 models, version 1106-preview:
- The 'Base' model refers to the standard deployment of GPT-4.
- The 'GPT-4 with RAG' variant, on the other hand, was enhanced with a Retrieval-Augmented Generation process and informed by a collection of nineteen documents. These documents comprised posts by Peter Styliadis, focusing on Getting Started with Python Integration to SAS® Viya® which we downloaded and converted to Word files to serve as a knowledge base for the model.
This approach builds upon the methods we detailed in our previous post, SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain: Behind the Scenes.
To understand the experiment, you might want to watch the following short video:
- Chapters
- descriptions off, selected
- captions settings, opens captions settings dialog
- captions off, selected
This is a modal window.
Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.
Summary of Results for Data Management Tasks Using GPT-4
After running eighteen distinct sets of prompts, we've compiled the outcomes of our experiments with two GPT-4 models: the standard 'Base' model and an enhanced version incorporating a Retrieval-Augmented Generation (RAG) technique. Here's how they performed:
Results | GPT-4 "Base" | GPT-4 with RAG |
Successful | 13 | 14 |
Partial Success (Different Results) | 2 | 2 |
Unsuccessful | 3 | 2 |
Total Tasks | 18 | 18 |
Achieving a success rate of 15 or 16 out of 18 represents a strong performance. This suggests that both GPT-4 models are quite adept at handling light data management tasks, with the RAG-enhanced model showing a slight edge.
Nevertheless, we must approach these figures with a discerning eye. In the age of Business Intelligence (BI), it was not uncommon for five different dashboards to present five distinct sales figures. Language models, including the latest LLMs like GPT-4, haven't entirely resolved this issue. It's crucial to remember that while language models can significantly aid in data management tasks, the reliability of their outputs must be thoroughly vetted, particularly when those outputs inform critical business decisions.
Detailed Results for Data Management Tasks Using GPT-4
We prompted a series of data management tasks to evaluate the capabilities of two configurations of GPT-4, the 'Base' model and the enhanced 'GPT-4 with RAG' model. Our tasks varied in complexity:
- Light Tasks: Listing caslibs, files, and tables.
- Medium Tasks: Generating table summaries, filters, top n results, group by operations, aggregations, and calculated columns.
- Heavy Tasks: Creating and saving tables, determining join columns for table joins, and promoting tables—some of the most challenging tasks for the model.
The models' performance should be viewed in light of their training data; the quality of their output is influenced by the data they've been exposed to during training. Here's how they fared:
ID | Prompt | GPT-4 "Base" | GPT-4 with RAG | Conclusion |
1 | List files | Pass | Pass | Similar |
2 | List caslibs | Pass | Pass | Similar |
3 | List in-memory tables | Pass | Pass | Similar |
4 | Load a CSV from an URL to a promoted table | Pass | Pass | Similar |
5 | Confirm the table has been loaded | Pass | Pass | Similar |
6 | Column info | Pass | Pass (with an extra prompt). | Base model slightly ahead. |
7 | Table summary statistics | Pass | Pass | Similar |
8 | Describe a table | Pass | Pass | Similar |
9 | Filter a table. Provide row counts. | Pass | Pass | Different results. Trust issue. |
10 | New calculated column | Pass | Pass | Similar |
11 | Top n | Pass | Pass | Similar |
12 | Group by + aggregate. | Pass | Pass | Different results. Trust issue. |
13 | Rename a column. Column info to confirm. | Pass | Pass | Similar, RAG unaware of success. |
14 | Unique count for values in a column. | Pass | Pass | RAG has better intent understanding. |
15 | Count the number of missing values from a table. | Pass | Pass | RAG performs better, Base needs guidance. |
16 | Create a new promoted (global) table with a few lines of data. | Fail | Pass | RAG handles promotion well. Base model fails to promote. |
17 | Join a table with the newly created table. Model must figure the key to join on. | Fail | Fail | Task is challenging for both models. |
18 | Save a table and promote it. Filter an existing table, save as a promoted table. | Fail | Fail | Both models struggle with table saving. |
Ten Rounds of Prompts
This video presents a head-to-head challenge of ten rounds, where we prompt the model to tackle a series of data management tasks in SAS Viya. The tasks range from simpler ones, such as describing columns and summarizing tables, to more complex operations like creating calculated columns, sorting, and identifying top values. We also cover grouping with aggregations, renaming columns, performing unique counts, and saving tables after applying filters.
In most of the cases, both agents succeed at the given tasks, with the same results. Sometimes they both succeed but the results are different! Sometimes the RAG agent needs an extra “nudge” or further instructions. At other times, the RAG agent succeeds where the other failed or they both fail.
I won’t comment the full 28 minutes of the video, although I added a few explanations. Enjoy watching or scrolling through!
- Chapters
- descriptions off, selected
- captions settings, opens captions settings dialog
- captions off, selected
This is a modal window.
Beginning of dialog window. Escape will cancel and close the window.
End of dialog window.
This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.
Excellence or Instead of Conclusions
Two custom agents were tested:
- GPT-4 "Base" excels in 2 tasks: column info and rename a column followed by column info.
- GPT-4 with RAG shows superiority in 3 tasks: unique counts, missing values, promoting tables.
Overall, the results slightly favor the GPT-4 with RAG model, indicating a marginal edge in understanding and executing complex data management tasks.
Ultimately, the performance difference between the two models is relatively small. Considering the additional resources and time required to set up the RAG, one must weigh these against the need for precision.
For rapid outcomes where the highest accuracy is not critical, the 'Base' model is your go-to option. It provides quick results without the extra setup. The GPT-4 model 1106-preview version is a far cry compared with an earlier davinci-code-003 model I tested for SAS code generation.
However, if your priority is tailored accuracy and you're dealing with complex tasks where nuanced understanding is key, the 'GPT-4 with RAG' model is likely the better choice, despite the additional investment.
The study emphasizes the importance of verifying the output of language models, especially when informing critical business decisions.
I hope you found this article insightful. Please feel free to reach out with feedback or suggestions for enhancing the agent or taking its capabilities to the next level.
Acknowledgements
Thanks to Peter Styliadis for his great SWAT Series.
Additional Resources
- How Retrieval Augmented Generation (RAG) Works?
- SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain: Behind the Scenes.
- SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain.
- SASPY: Submit workloads to SAS Viya from Python.
- LangChain Custom Agent.
- GPT-4 Assisted Data Management in SAS Viya: A Custom LangChain Agent Approach.
- How to Create Your Custom LangChain Agent for SAS Viya.
- Conversing with Data: Turning Queries into Conversations with SAS Viya, Azure OpenAI and LangChain.
- Exploring LangChain and Azure OpenAI’s Ability to Write SQL and Join Tables To Answer Questions.
Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about the approach. If you wish to get more information, please write me an email.
Find more articles from SAS Global Enablement and Learning here.