How Accurate is GPT-4 at SAS Viya Data Management Tasks?

A more streamlined title would be: "Comparing GPT-4 Models on SWAT Code Generation for SAS Viya: A Study of 18 Prompts." How accurate is GPT-4 at generating SWAT code to perform light data management tasks in SAS Viya? With eighteen sets of prompts we tested two custom agents: one using the “base” GPT-4 model versus a second using a GPT-4 model grounded in documents highly relevant for SWAT code generation. Which one performed better? Are there any significant advantages using a RAG approach? Read the post and watch the videos to find out more.

Left: DALL-E generated image. Right: "Show me Morpheus" image. Source: imgur.

In the context of the movie "The Matrix," Morpheus responded to a claim of knowledge, when Neo exclaimed "I know Kung Fu", with "Show me," indicating a request for a demonstration of the skills in question.

If a GPT-4 custom agent would state "I know SWAT" (SAS Wrapper for Analytics Transfer) as a parallel to Neo's line, my response would follow along Morpheus' lines, inviting the custom agent to demonstrate the knowledge in a relevant situation.

And here came the response to that hypothetical "Show me".

GPT-4 Base vs GPT-4 with RAG

In our experiments, we compared side-by-side the SWAT code generation skills of two Azure OpenAI's GPT-4 models, version 1106-preview:

The 'Base' model refers to the standard deployment of GPT-4.
The 'GPT-4 with RAG' variant, on the other hand, was enhanced with a Retrieval-Augmented Generation process and informed by a collection of nineteen documents. These documents comprised posts by Peter Styliadis, focusing on Getting Started with Python Integration to SAS® Viya® which we downloaded and converted to Word files to serve as a knowledge base for the model.

This approach builds upon the methods we detailed in our previous post, SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain: Behind the Scenes.

To understand the experiment, you might want to watch the following short video:

(view in My Videos)

Summary of Results for Data Management Tasks Using GPT-4

After running eighteen distinct sets of prompts, we've compiled the outcomes of our experiments with two GPT-4 models: the standard 'Base' model and an enhanced version incorporating a Retrieval-Augmented Generation (RAG) technique. Here's how they performed:

Results	GPT-4 "Base"	GPT-4 with RAG
Successful	13	14
Partial Success (Different Results)	2	2
Unsuccessful	3	2
Total Tasks	18	18

Achieving a success rate of 15 or 16 out of 18 represents a strong performance. This suggests that both GPT-4 models are quite adept at handling light data management tasks, with the RAG-enhanced model showing a slight edge.

Nevertheless, we must approach these figures with a discerning eye. In the age of Business Intelligence (BI), it was not uncommon for five different dashboards to present five distinct sales figures. Language models, including the latest LLMs like GPT-4, haven't entirely resolved this issue. It's crucial to remember that while language models can significantly aid in data management tasks, the reliability of their outputs must be thoroughly vetted, particularly when those outputs inform critical business decisions.

Detailed Results for Data Management Tasks Using GPT-4

We prompted a series of data management tasks to evaluate the capabilities of two configurations of GPT-4, the 'Base' model and the enhanced 'GPT-4 with RAG' model. Our tasks varied in complexity:

Light Tasks: Listing caslibs, files, and tables.
Medium Tasks: Generating table summaries, filters, top n results, group by operations, aggregations, and calculated columns.
Heavy Tasks: Creating and saving tables, determining join columns for table joins, and promoting tables—some of the most challenging tasks for the model.

The models' performance should be viewed in light of their training data; the quality of their output is influenced by the data they've been exposed to during training. Here's how they fared:

ID	Prompt	GPT-4 "Base"	GPT-4 with RAG	Conclusion
1	List files	Pass	Pass	Similar
2	List caslibs	Pass	Pass	Similar
3	List in-memory tables	Pass	Pass	Similar
4	Load a CSV from an URL to a promoted table	Pass	Pass	Similar
5	Confirm the table has been loaded	Pass	Pass	Similar
6	Column info	Pass	Pass (with an extra prompt).	Base model slightly ahead.
7	Table summary statistics	Pass	Pass	Similar
8	Describe a table	Pass	Pass	Similar
9	Filter a table. Provide row counts.	Pass	Pass	Different results. Trust issue.
10	New calculated column	Pass	Pass	Similar
11	Top n	Pass	Pass	Similar
12	Group by + aggregate.	Pass	Pass	Different results. Trust issue.
13	Rename a column. Column info to confirm.	Pass	Pass	Similar, RAG unaware of success.
14	Unique count for values in a column.	Pass	Pass	RAG has better intent understanding.
15	Count the number of missing values from a table.	Pass	Pass	RAG performs better, Base needs guidance.
16	Create a new promoted (global) table with a few lines of data.	Fail	Pass	RAG handles promotion well. Base model fails to promote.
17	Join a table with the newly created table. Model must figure the key to join on.	Fail	Fail	Task is challenging for both models.
18	Save a table and promote it. Filter an existing table, save as a promoted table.	Fail	Fail	Both models struggle with table saving.

Ten Rounds of Prompts

This video presents a head-to-head challenge of ten rounds, where we prompt the model to tackle a series of data management tasks in SAS Viya. The tasks range from simpler ones, such as describing columns and summarizing tables, to more complex operations like creating calculated columns, sorting, and identifying top values. We also cover grouping with aggregations, renaming columns, performing unique counts, and saving tables after applying filters.

In most of the cases, both agents succeed at the given tasks, with the same results. Sometimes they both succeed but the results are different! Sometimes the RAG agent needs an extra “nudge” or further instructions. At other times, the RAG agent succeeds where the other failed or they both fail.

I won’t comment the full 28 minutes of the video, although I added a few explanations. Enjoy watching or scrolling through!

(view in My Videos)

Excellence or Instead of Conclusions

Two custom agents were tested:

GPT-4 "Base" excels in 2 tasks: column info and rename a column followed by column info.
GPT-4 with RAG shows superiority in 3 tasks: unique counts, missing values, promoting tables.

Overall, the results slightly favor the GPT-4 with RAG model, indicating a marginal edge in understanding and executing complex data management tasks.

Ultimately, the performance difference between the two models is relatively small. Considering the additional resources and time required to set up the RAG, one must weigh these against the need for precision.

For rapid outcomes where the highest accuracy is not critical, the 'Base' model is your go-to option. It provides quick results without the extra setup. The GPT-4 model 1106-preview version is a far cry compared with an earlier davinci-code-003 model I tested for SAS code generation.

However, if your priority is tailored accuracy and you're dealing with complex tasks where nuanced understanding is key, the 'GPT-4 with RAG' model is likely the better choice, despite the additional investment.

The study emphasizes the importance of verifying the output of language models, especially when informing critical business decisions.

I hope you found this article insightful. Please feel free to reach out with feedback or suggestions for enhancing the agent or taking its capabilities to the next level.

Acknowledgements

Thanks to Peter Styliadis for his great SWAT Series.

Additional Resources

Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about the approach. If you wish to get more information, please write me an email.

Find more articles from SAS Global Enablement and Learning here.