BookmarkSubscribeRSS Feed

How Accurate is GPT-4 at SAS Viya Data Management Tasks?

Started yesterday by
Modified yesterday by
Views 195

A more streamlined title would be: "Comparing GPT-4 Models on SWAT Code Generation for SAS Viya: A Study of 18 Prompts." How accurate is GPT-4 at generating SWAT code to perform light data management tasks in SAS Viya? With eighteen sets of prompts we tested two custom agents: one using the “base” GPT-4 model versus a second using a GPT-4 model grounded in documents highly relevant for SWAT code generation. Which one performed better? Are there any significant advantages using a RAG approach? Read the post and watch the videos to find out more.

 

01_BT_IKnowSWAT_ShowMe.png

 Left: DALL-E generated image. Right: "Show me Morpheus" image. Source: imgur.

 

In the context of the movie "The Matrix," Morpheus responded to a claim of knowledge, when Neo exclaimed "I know Kung Fu", with "Show me," indicating a request for a demonstration of the skills in question.

 

If a GPT-4 custom agent would state "I know SWAT" (SAS Wrapper for Analytics Transfer) as a parallel to Neo's line, my response would follow along Morpheus' lines, inviting the custom agent to demonstrate the knowledge in a relevant situation.

 

And here came the response to that hypothetical "Show me".

 

GPT-4 Base vs GPT-4 with RAG

 

In our experiments, we compared side-by-side the SWAT code generation skills of two Azure OpenAI's GPT-4 models, version 1106-preview:

 

  • The 'Base' model refers to the standard deployment of GPT-4.
  • The 'GPT-4 with RAG' variant, on the other hand, was enhanced with a Retrieval-Augmented Generation process and informed by a collection of nineteen documents. These documents comprised posts by Peter Styliadis, focusing on Getting Started with Python Integration to SAS® Viya® which we downloaded and converted to Word files to serve as a knowledge base for the model.

 

This approach builds upon the methods we detailed in our previous post, SWAT Code Generation and Execution in SAS Viya with Azure OpenAI and LangChain: Behind the Scenes.

 

To understand the experiment, you might want to watch the following short video:

 

 

Summary of Results for Data Management Tasks Using GPT-4

 

After running eighteen distinct sets of prompts, we've compiled the outcomes of our experiments with two GPT-4 models: the standard 'Base' model and an enhanced version incorporating a Retrieval-Augmented Generation (RAG) technique. Here's how they performed:

 

Results  GPT-4 "Base" GPT-4 with RAG
Successful 13 14
Partial Success (Different Results) 2 2
Unsuccessful 3 2
Total Tasks 18 18

 

Achieving a success rate of 15 or 16 out of 18 represents a strong performance. This suggests that both GPT-4 models are quite adept at handling light data management tasks, with the RAG-enhanced model showing a slight edge.

 

Nevertheless, we must approach these figures with a discerning eye. In the age of Business Intelligence (BI), it was not uncommon for five different dashboards to present five distinct sales figures. Language models, including the latest LLMs like GPT-4, haven't entirely resolved this issue. It's crucial to remember that while language models can significantly aid in data management tasks, the reliability of their outputs must be thoroughly vetted, particularly when those outputs inform critical business decisions.  

 

Detailed Results for Data Management Tasks Using GPT-4

 

We prompted a series of data management tasks to evaluate the capabilities of two configurations of GPT-4, the 'Base' model and the enhanced 'GPT-4 with RAG' model. Our tasks varied in complexity:

 

  • Light Tasks: Listing caslibs, files, and tables.
  • Medium Tasks: Generating table summaries, filters, top n results, group by operations, aggregations, and calculated columns.
  • Heavy Tasks: Creating and saving tables, determining join columns for table joins, and promoting tables—some of the most challenging tasks for the model.

 

The models' performance should be viewed in light of their training data; the quality of their output is influenced by the data they've been exposed to during training. Here's how they fared:

 

ID Prompt GPT-4 "Base" GPT-4 with RAG Conclusion
1 List files Pass Pass Similar
2 List caslibs Pass Pass Similar
3 List in-memory tables Pass Pass Similar
4 Load a CSV from an URL to a promoted table Pass Pass Similar
5 Confirm the table has been loaded Pass Pass Similar
6 Column info Pass Pass (with an extra prompt). Base model slightly ahead.
7 Table summary statistics Pass Pass Similar
8 Describe a table Pass Pass Similar
9 Filter a table. Provide row counts. Pass Pass Different results. Trust issue.
10 New calculated column Pass Pass Similar
11 Top n Pass Pass Similar
12 Group by + aggregate. Pass Pass Different results. Trust issue.
13 Rename a column. Column info to confirm. Pass Pass Similar, RAG unaware of success.
14 Unique count for values in a column. Pass Pass RAG has better intent understanding.
15 Count the number of missing values from a table. Pass Pass RAG performs better, Base needs guidance.
16 Create a new promoted (global) table with a few lines of data. Fail Pass RAG handles promotion well. Base model fails to promote.
17 Join a table with the newly created table. Model must figure the key to join on. Fail Fail Task is challenging for both models.
18 Save a table and promote it. Filter an existing table, save as a promoted table. Fail Fail Both models struggle with table saving.

 

Ten Rounds of Prompts

 

This video presents a head-to-head challenge of ten rounds, where we prompt the model to tackle a series of data management tasks in SAS Viya. The tasks range from simpler ones, such as describing columns and summarizing tables, to more complex operations like creating calculated columns, sorting, and identifying top values. We also cover grouping with aggregations, renaming columns, performing unique counts, and saving tables after applying filters.

 

In most of the cases, both agents succeed at the given tasks, with the same results. Sometimes they both succeed but the results are different! Sometimes the RAG agent needs an extra “nudge” or further instructions. At other times, the RAG agent succeeds where the other failed or they both fail.

 

I won’t comment the full 28 minutes of the video, although I added a few explanations. Enjoy watching or scrolling through!

 

 

Excellence or Instead of Conclusions

 

Two custom agents were tested:

 

  • GPT-4 "Base" excels in 2 tasks: column info and rename a column followed by column info.
  • GPT-4 with RAG shows superiority in 3 tasks: unique counts, missing values, promoting tables.

 

Overall, the results slightly favor the GPT-4 with RAG model, indicating a marginal edge in understanding and executing complex data management tasks.

 

Ultimately, the performance difference between the two models is relatively small. Considering the additional resources and time required to set up the RAG, one must weigh these against the need for precision.

 

For rapid outcomes where the highest accuracy is not critical, the 'Base' model is your go-to option. It provides quick results without the extra setup. The GPT-4 model 1106-preview version is a far cry compared with an earlier davinci-code-003 model I tested for SAS code generation.

 

However, if your priority is tailored accuracy and you're dealing with complex tasks where nuanced understanding is key, the 'GPT-4 with RAG' model is likely the better choice, despite the additional investment.

 

The study emphasizes the importance of verifying the output of language models, especially when informing critical business decisions.

 

I hope you found this article insightful. Please feel free to reach out with feedback or suggestions for enhancing the agent or taking its capabilities to the next level.  

 

Acknowledgements

 

Thanks to Peter Styliadis for his great SWAT Series.

 

Additional Resources

 

 

Thank you for your time reading this post. If you liked the post, give it a thumbs up! Please comment and tell us what you think about the approach. If you wish to get more information, please write me an email.

 

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
yesterday
Updated by:
Contributors

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags