Vibe Coding with Generative AI and Test-Driven Development

Generative AI is transforming how software engineers and developers approach their work. By enabling code creation and modification through natural language prompts, it has great potential to introduce unprecedented efficiencies into the development process. A 2024 Gartner survey of 400 software engineering leaders in the U.S. and U.K. revealed that up to half of their teams are already leveraging AI tools to enhance workflows. Importantly, Gartner emphasizes that "Generative AI is not a replacement for software engineers, but a powerful augmentation tool". It automates repetitive tasks, freeing engineers to focus on more complex, creative, and strategic challenges—ultimately boosting productivity while preserving the value of human insight and innovation. Other researchers believe that beyond being an augmentation tool, AI assistants can amplify the skill level of engineers to allow them to focus more on problem solving, systems thinking, and design.

In June 2025 the authors of this post, from Customer Intelligence Division at SAS Institute, set out to explore whether agentic Generative AI could be combined with Test-Driven Development (TDD) to build a fully functional application—without writing a single line of code themselves. This ambitious experiment was conducted during a three-day internal Hackathon. The core hypothesis was that if the agentic AI could remain actively engaged in the TDD cycle, the human team could shift their focus from implementation to design guidance. This approach promised not only greater efficiency but also a more structured and disciplined development process, fostering a shared understanding between human and AI collaborators throughout the engineering lifecycle.

Agentic AI

Agentic AI refers to artificial intelligence systems that can operate with a degree of autonomy, making decisions and taking actions to achieve specific goals without constant human intervention. These systems are designed to be proactive, goal-oriented, and capable of adapting to changing environments. Unlike traditional AI models that respond passively to prompts, agentic AI can plan, reason, and execute multi-step tasks—often coordinating with other agents or systems. This makes them particularly powerful in dynamic, complex domains where continuous decision-making is required. For example, it can significantly enhance the way software developers write code by acting as an intelligent, autonomous collaborator throughout the development process. Unlike traditional code assistants that respond to prompts, agentic AI can proactively identify coding needs, plan implementation strategies, and execute tasks with minimal guidance.

Test Driven Development

Test-Driven Development (TDD) is a software development approach where tests are written before the actual code. It follows a simple but powerful cycle known as Red-Green-Refactor. First, in the Red phase, the developer writes a test for a new feature or function that doesn’t yet exist—so the test fails (hence, red). This failure confirms that the test is valid and that the feature is not yet implemented. Next comes the Green phase, where the developer writes just enough code to make the test pass. The goal here is not to write perfect or optimized code, but simply to satisfy the test conditions. Once the test passes (green), the developer enters the Refactor phase. In this step, the code is cleaned up—improving structure, readability, or performance—without changing its behavior. The tests ensure that the refactoring doesn’t break anything. This cycle promotes clean, reliable, and maintainable code, and helps catch bugs early in the development process.

Despite its well-documented benefits, Test-Driven Development (TDD) has seen relatively limited adoption among software developers. Several key factors seem to contribute to this:

Perceived Overhead and Time Pressure - Many developers view TDD as adding extra steps to the development process, especially in fast-paced environments where delivering features quickly is prioritized. Writing tests before writing code can feel counterintuitive or time-consuming, particularly when deadlines are tight. This perception of overhead often discourages teams from adopting TDD, even though it can save time in the long run by reducing bugs and rework.
Lack of Training and Cultural Support - TDD requires a shift in mindset and discipline that many developers haven't been formally trained in. Without strong organizational support or experienced mentors, developers may struggle to apply TDD effectively. In some teams, testing is still seen as a secondary concern or the responsibility of QA, rather than an integral part of development. This cultural gap can hinder widespread adoption.
Tooling and Legacy Code Challenges - While modern tools support TDD, integrating it into existing projects—especially those with large, untested codebases—can be daunting. Developers often find it difficult to retrofit tests into legacy systems or to maintain tests when requirements change frequently. This can lead to frustration and abandonment of the practice.

Hypothesis: Agentic Generative AI and TDD together as a force multiplier

Bringing agentic AI and Test-Driven Development together could have significant benefits. If agentic generative AI is integrated into a Test-Driven Development workflow, then software development teams can achieve significantly higher efficiency and code quality, by offloading repetitive implementation tasks to the AI while maintaining human oversight through test design and validation. This hypothesis rests on the assumption that agentic AI can autonomously define test cases, generate corresponding code, and iteratively refine the implementation based on test outcomes. By keeping the AI "in the loop" of the red-green-refactor cycle, human developers can focus on higher-level design and architectural decisions, while the AI handles the mechanical aspects of coding. This could lead to faster development cycles, fewer bugs, and a more disciplined engineering process.

Starting out – Application requirements

At the start of the hackathon project, Microsoft Copilot (Internet Explorer) was used to brainstorm and refine ideas for a two-player basketball video game. Giving copilot the “vibe” of creating a fast-paced game in the style of “Space Jam”, the human engineers initially asked Copilot to suggest basic gameplay mechanics, such as how players might control movement, shooting, and defense. Copilot responded with a simple framework involving keyboard controls, scoring logic, and game flow. From there, the engineers engaged in a few iterative discussions—asking follow-up questions, tweaking the rules, and exploring different game modes like timed matches or first-to-score challenges. With each iteration, Copilot helped to clarify requirements, refine the game logic, and even suggested enhancements like power-ups or AI-controlled “smack talk” callouts. This back-and-forth process allowed the engineers to quickly evolve a rough concept into a more structured and engaging game design. After a few iterations, the requirements for the video game were captured by asking Copilot to describe the complete application into mark-down format and store this as a downloadable .md file.

Working with agentic AI - Defining the rules of TDD

When using agentic AI, it is possible to provide a set of instructions that it should follow when autonomously performing tasks. Giving structured, detailed instructions to an agentic AI system - especially one embedded in a development environment like Visual Studio Code - serves several important purposes that can enhance its effectiveness, reliability, and alignment with human workflows:

Constrains Behavior to Best Practices - By embedding methodologies like Test-Driven Development, the instructions ensure the AI adheres to industry-standard practices. This can reduce the risk of introducing bugs, encourage modular and testable code, and promote continuous integration.
Reduces Ambiguity - Agentic AI operates based on patterns and context. Clear, step-by-step instructions eliminate ambiguity in how tasks should be approached, especially in complex workflows. This is critical for maintaining consistency and predictability in the AI’s behavior.
Promotes Incremental Progress - The Red-Green-Refactor cycle of TDD enforces a disciplined, incremental approach to development. This helps the AI avoid over-engineering, ensures that each change is validated by tests, and avoids debugging by isolating changes that can be quickly reverted in the event of issues or failure.
Supports Human-AI Collaboration - By requiring the AI to ask for clarification when multiple suggestions are present or when information is missing, the instructions can foster a collaborative dynamic. This keeps the human in the loop and ensures alignment with the developer’s intent.
Improves Traceability and Accountability - Instructing the AI to commit directly to the `main` branch with descriptive messages ensures that every change is documented and traceable. This is especially important in retaining the context of change and is useful in regulated industries where auditability matters. Instructions that that guide the AI through a structured process allows the creation of a paper trail to capture how the system was incrementally designed and implemented.
Encourages Self-Correction and Learning - The inclusion of steps in the instructions for reverting changes after failed tests and explaining why they failed encourages the AI to retain more context on its past actions. This mirrors human practices of learning through fast feedback and helps maintain a clean, working codebase.

The following instructions were given to the agentic AI (Copilot GPT-4.1 in Visual Studio Code) so that it followed the TDD cycle. Feel free to copy these for your own experiments (no warranty given!).

# Copilot Instructions for TDD and Trunk-Based Development

## Test-Driven Development (TDD) with Red-Green-Refactor

For each task, the Copilot agent should follow the TDD methodology:

1. **Red**: Write a single failing test (one at a time).
2. **Green**: Implement code to make that test pass.
3. **Refactor**: Improve the code while ensuring all tests pass.

### Detailed Steps:

You should follow these steps in order as they appear below:

**Important:** Only one feature, behavior, or test case should be implemented at a time. Do not attempt to implement multiple suggestions or 
features in parallel. If multiple suggestions are provided, ask for clarification or proceed with only the first one until it is fully 
completed through the Red-Green-Refactor cycle.

1. Start with one failing test only.
2. Write a test that describes the desired behavior or functionality.
3. Use Gerhkin-style test names to describe the test's purpose.
4. Ensure the test is clear, concise, complete, and accurate.
5. Run the test to confirm it fails (this is the "Red" phase).
6. Implement the minimal code necessary to pass the test (this is the "Green" phase).
7. Check that all tests, including the new one, are passing again to ensure the new code does not break existing functionality.
8. If all tests are passing, go to step 13.
9. Following an attempted implementation, if the test fails, explain why it failed.
10. After you explain why a test failed, revert the changes to the last known good commit and ask for clarification.
11. Only after reverting the changes, remember why the test failed prior to attempting to implement the code.
12. After any reversion, run all tests again to ensure they pass.
13. If all tests pass, commit to the `main` branch except after a reversion.
14. After each commit, please review the entire codebase to determine if any code needs to be refactored.
15. If required, refactor the code to improve readability, maintainability, or performance explaining what you did.
16. Please inform us if no refactoring is required.
17. Run all tests again to ensure they still pass after refactoring.
18. If any test fails after refactoring, revert the changes to the last known good commit and ask for clarification.
19. Only when and if all tests are passing after refactoring, commit the changes to the `main` branch.
20. Use descriptive commit messages that explain the changes made and the tests added.

## Handling Multiple Suggestions

If asked to suggest what to implement next, provide a prioritized list. However, do not attempt to implement more than one suggestion at a 
time. Wait for confirmation or instruction before proceeding to the next item in the list. Always complete the full TDD cycle 
(Red-Green-Refactor) for one item before moving on.

## Trunk-Based Development

- Commit directly to the `main` branch.

Creating context and setting the AI to work

The instructions given above to the agentic AI were refined over a number of iterations. At the beginning of each new work session, the agent was prompted to read the instructions and review the complete code base. It was also told to ask for clarification on any items that were ambiguous or unclear. The agent was then prompted to suggest which work items we should attempt next. By prompting the agent in this way, it was possible to create context before starting any work. This was necessary because agentic AI systems that are based on large language models (LLMs) require context to be explicitly provided because they do not possess persistent memory or an inherent understanding of ongoing tasks. Unlike humans, who can draw on lived experience and situational awareness, LLMs operate by predicting the most likely next output based on the input they receive at a given moment. This means that for an agent to act effectively—whether it's planning, reasoning, or executing tasks—it must be "primed" with relevant information about the goals, constraints, environment, and expectations. Providing this context allowed the model to align its responses and actions with the required intent, effectively simulating understanding and continuity. Without this grounding, an agent may generate plausible but irrelevant or suboptimal outputs, as it lacks the situational awareness needed to make informed decisions. In this experiment, when the agent did not perform as expected, the instructions were updated to fine tune its behavior. After each update to the instructions, the agent was asked again to ensure they were clear and understood. At one point, the agent was also asked to rewrite the instructions so that they were more concise. This seemed to further aid its understanding.

Results of the experiment

Staying in the loop

As development on the application began, the initial red-green-refactor cycles were primarily guided by the human engineers. During its first attempt at contributing, the agent wrote six tests in a single step - prompting a reminder that it should only write one test at a time. Interestingly, this is a common mistake among those new to Test-Driven Development. To address this, the agent’s instructions were updated to clearly emphasize the principle of writing just one test per iteration. This early misstep highlighted the importance of clear, precise instructions when working with agentic AI. Unlike human developers, the agent lacks an intuitive grasp of development norms and must rely entirely on the guidance it receives. As the team iterated, they began to treat the agent more like a junior developer - providing structured feedback, refining prompts, and adjusting expectations based on its behavior. This collaborative dynamic not only improved the agent’s performance but also surfaced valuable insights about how to better scaffold AI contributions in complex workflows.

Commit early and often

As the agentic AI became more proficient, it began to “play all the right notes” - producing code and tests that were technically correct and aligned with the task. However, it soon became apparent that it wasn’t always “playing the notes in the right order.” One telling example was its failure to commit code to the Git repository at each green state in the red-green-refactor cycle. Although the instructions clearly specified this step, the agent often skipped it or delayed it until later in the process. It had to be reminded that committing at each green state is a core discipline of TDD, ensuring traceability and incremental progress. This behavior reflects a fundamental characteristic of large language models: they are excellent at generating plausible outputs, but they don’t inherently understand procedural constraints or the importance of execution order unless those constraints are made explicit and reinforced. When given a list of tasks, the model may treat them as loosely related suggestions rather than a strict sequence to follow. Its internal logic is driven by statistical associations, not by a built-in sense of causality or workflow discipline.

Algorithmic Instructions

To address task sequence issues, the team refined the way instructions were presented. Rather than changing to a step-by-step interaction model, they continued to provide all instructions upfront - but reformatted them into clearly numbered, algorithmic steps. This structure helped the agent better recognize the intended sequence and reduced the likelihood of it skipping or reordering tasks. By making the procedural logic more explicit in the prompt itself, the team was able to preserve the efficiency of a single-file instruction format while improving the agent’s adherence to process. This experience underscored a broader insight: agentic AI doesn’t just need instructions - it needs orchestration. Like a skilled musician sight-reading a score, the agent can perform impressively, but without a conductor to guide the tempo and sequence, the performance can drift. The team’s role evolved from simply assigning tasks to actively managing the rhythm and flow of the agent’s contributions, ensuring that the right actions happened at the right time.

Refactor step matters

Another subtle but important challenge emerged around the practice of refactoring - an essential step in the Test-Driven Development cycle. While the agent was capable of writing tests and implementing features effectively, it often hesitated when it came to refactoring. Even when prompted with the standard red-green-refactor rhythm, the agent would frequently conclude that no refactoring was necessary. It would justify this by asserting that the current implementation was already clean or optimal, based on its internal model of what “good code” looks like. This reluctance appeared to stem from the way large language models generate code: they tend to produce what they perceive as the “ideal” version of a solution at the time of generation. As a result, each new feature was implemented with an implicit assumption that the surrounding codebase was already in its best possible state. From the agent’s perspective, there was no need to revisit or restructure what had just been written - unless explicitly told otherwise. In contrast, human developers understand that refactoring is not just about aesthetics or optimization, but about maintaining long-term clarity, reducing duplication, and improving design as the code evolves.

Prompts are the key to success

To address the importance of refactoring, the team dedicated a specific portion of the instruction set (steps 14 through 19) to that phase of the cycle. These steps outlined the expectation that the agent should pause after each green state to evaluate the code for opportunities to improve structure, readability, and maintainability. The structured emphasis across multiple steps helped signal that refactoring was not optional or occasional, but a routine and essential part of the workflow. This experience highlighted a key difference between human and AI reasoning in software development. Human developers often refactor based on intuition, experience, and a sense of evolving design. The agent, by contrast, relies on pattern recognition and statistical inference. Without a strong signal in the prompt or training data that refactoring is expected - even when the code “looks fine” - it defaults to leaving the code as-is. By embedding the expectation directly into the instruction flow, the team was able to guide the agent toward more disciplined and maintainable development practices.

Test failures must be real

One of the foundational principles of TDD is that tests must fail before they pass. This ensures that the test is meaningful - that it’s actually verifying behavior that doesn’t yet exist. Without this step, there’s a risk of writing tests that are either ineffective or falsely passing due to existing code. The team quickly realized that this principle needed to be explicitly enforced with the agentic AI. Early on, the agent would sometimes proceed directly from writing a test to implementing the corresponding feature, without verifying that the test failed first. This shortcut undermined the core feedback loop of TDD. To correct this, the team prompted the agent with a strict requirement: after writing a new test, the agent had to run it and confirm that it failed. If the test passed unexpectedly, the agent was instructed to revert to the last known green state and investigate why the failure didn’t occur - whether due to an overly broad test, preexisting functionality, or a mistake in the test logic itself. [Note: this enhancement was not made in the instructions file - but should have been].

Agents are good at writing tests when intent is explicit

The failing-test discipline was critical not only for maintaining the integrity of the development process but also for helping the agent internalize the rhythm of TDD. It reinforced the idea that each test should drive the next increment of functionality, and that the absence of a failure is a signal that something may be wrong—not a green light to proceed. Over time, this practice helped the agent become more reliable in its adherence to the red-green-refactor cycle. The need for this kind of enforcement again highlighted a key difference between human and AI reasoning. A human developer might instinctively recognize the importance of a failing test as a validation step. The agent, however, operates based on patterns and probabilities—it doesn’t “understand” the purpose of a failing test unless that purpose is explicitly encoded in its instructions. By embedding this check into the workflow, the team ensured that the agent’s behavior aligned more closely with the disciplined, test-first mindset that TDD demands.

Use Gherkin-style test names

To improve clarity and maintain continuity across sessions, the team adopted Gherkin-style naming conventions for tests - using the familiar “Given-When-Then” format. This approach made each test’s purpose immediately understandable, both to humans and to the agent. For example, a test named `Given the player has possession, when they press the shoot button near the basket, then the shot should animate and score if unobstructed` clearly communicates the scenario being tested without needing to inspect the test body. This naming convention also helped the agent maintain context more effectively. Because Gherkin-style names encode both the preconditions and expected outcomes, they served as a lightweight form of documentation that persisted across sessions. When the agent resumed work or reviewed previous tests, these descriptive names provided a natural anchor for understanding the current state of the game logic and the intent behind each test. As a result, the agent was better able to reason about what gameplay mechanics had already been covered and what behavior still needed to be implemented or refined.

Save the chat history

To ensure transparency and reproducibility, the team had the agent save its prompt history at the end of each TDD cycle and commit it to the repository - just like any other part of the project. This practice served multiple purposes. First, it created a clear audit trail of the agent’s reasoning and decision-making process, which was invaluable for reviewing its work. Second, it helped maintain continuity across sessions, allowing both the agent and human collaborators to pick up exactly where they left off. By treating the prompt history as a first-class artifact of the development process, the team reinforced the idea that the agent’s context is not ephemeral - it’s part of the evolving state of the project.

Mind the OS

When switching machines or environments, the team made it a point to remind the agent to adapt to the active development environment—whether it was macOS, Windows, or Linux. This was especially important for tasks involving file paths, shell commands, or environment-specific tooling. Without this reminder, the agent might default to assumptions from a previous session, leading to subtle errors or incompatibilities. By explicitly including environment-awareness in the prompt, the team ensured smoother transitions and more reliable execution across platforms. It was a small but essential habit that helped maintain consistency in a multi-OS development workflow.

Know your limits (prompt limits, that is!)

In environments where the number of requests or tokens is capped—whether due to API quotas, billing constraints, or platform limitations - agentic development can consume those resources quickly. Each cycle of prompting, testing, and refinement adds up, especially when the agent is verbose or exploratory in its responses. The team learned to be frugal: minimizing unnecessary back-and-forth, streamlining instructions, and reusing context where possible. Treating prompt tokens like a finite resource encouraged more thoughtful, efficient interactions and helped avoid hitting limits mid-development.

Conclusions

This experiment demonstrated that agentic AI, when paired with a structured methodology like Test-Driven Development (TDD), can serve as a powerful collaborator in software engineering. The AI was able to generate meaningful tests, implement features, and follow a disciplined development cycle—provided it was given clear, algorithmic instructions and consistent reinforcement of best practices. The team’s approach of treating the AI like a junior developer—complete with onboarding, feedback, and structured guidance—proved effective in aligning its behavior with human expectations.

Key insights emerged around the importance of orchestration, context-setting, and prompt design. The AI’s tendency to skip steps, reorder tasks, or avoid refactoring highlighted its lack of procedural awareness and the need for explicit instruction. Practices such as using Gherkin-style test names, saving prompt history, and adapting to the development environment helped maintain continuity and clarity. The experiment also revealed that while the AI can simulate understanding, it does not possess intrinsic awareness of process, intent, or quality—making human oversight essential.

Recommendations for Future Investigation

Instruction Optimization and Compression - Explore ways to make instructions more compact and modular without sacrificing clarity. Investigate whether instruction templates or reusable prompt components can reduce token usage while maintaining effectiveness.
Persistent Memory and Context Retention - Evaluate the impact of integrating persistent memory or long-term context tracking to reduce the need for repeated priming and improve continuity across sessions and environments.
Refactoring Awareness Models - Research how to train or fine-tune models to better recognize when refactoring is appropriate, even without explicit prompts—potentially using code quality heuristics or design pattern recognition.
Environment Detection and Adaptation - Develop automated mechanisms for the agent to detect the active OS and development environment, reducing the need for manual reminders and improving cross-platform reliability.
Prompt History as a First-Class Artifact - Formalize the practice of saving and versioning prompt history. Investigate tools or formats that make this history easier to query, visualize, and reuse in future development cycles.
Failure-First Test Validation - Incorporate automated checks to ensure that new tests fail before implementation begins. This could be enforced through tooling or integrated into the agent’s internal logic.
Human-Agent Collaboration Patterns - Study different collaboration models—such as pair programming, review loops, or mentorship metaphors—to identify best practices for guiding agentic AI in real-world development teams.
Token Efficiency Strategies - Investigate techniques for reducing token consumption, such as summarizing prior context, using compact instruction formats, or leveraging retrieval-augmented generation (RAG) to offload context storage.

Ta-Da! ...TurboDunk made with 100% AI generated code.

(view in My Videos)

Thanks go to my co-creators; Trey Hamilton, David Olaleye, and Visual Studio Code Copilot (ChatGPT4.1).