How to Write Chatbot Test Cases for Support Flows

A team sat down to test their new support bot and wrote forty questions in a conference room. The bot passed all forty. Two weeks after launch, the most common real question was "where's my stuff," which nobody in the room had thought to write down because nobody on the team talks like that. The bot had never seen it. It guessed, got it half right, and the tickets came anyway.

That is the core problem with chatbot test cases written from imagination. Your team knows the business too well to phrase questions like a confused customer. The test cases that protect you are the ones you pull from how customers actually behave, not the ones you invent. This post is about where good test cases come from, what a good one looks like, and how many you actually need before deployment.

If you want the why behind testing at all, pre deployment chatbot testing that works makes that case. This post is the how: building the test set itself.

Where real test cases come from

Good chatbot test cases are not brainstormed. They are harvested. You already have a record of every question your customers ask, sitting in systems you check every day. The job is to mine it, not to invent it.

Source	What you get	Why it beats brainstorming
Historical tickets	The exact questions customers send, in their words	Real phrasing, real frequency, real edge cases
Live chat transcripts	Short, messy, conversational questions	Shows how people type when they expect a fast answer
Help-center search logs	What customers look for before they contact you	Reveals intents you may not have content for
Agent macros and saved replies	The answers your team already trusts	Tells you the expected output for each question
Escalation and refund threads	The high-risk, emotional, money-tied cases	The cases where a wrong bot answer costs the most

Notice what is not on that list: the team's intuition. Intuition is useful for spotting gaps, but it is a terrible source of phrasing. Customers abbreviate, misspell, skip context, and lead with frustration. Your documentation does none of those things, so test cases derived from documentation test a customer who does not exist.

Start with the last 90 days of tickets and sort by volume. The top of that list is where your bot earns or loses its value. A test case covering a question you get 200 times a month is worth more than ten clever edge cases you invented.

The anatomy of a good test case

A test case is not just a question. A question alone tells you nothing, because you have no way to grade the answer. A usable test case has three parts.

The question as customers phrase it. Not "What is your return policy for opened items?" but "can i send back something i already opened??". Copy the real phrasing, typos and all. If you clean it up, you are testing a different question.
The expected source. Which page, policy, or document holds the correct answer. This is what lets you check grounding, not just plausibility. If you do not know which source should answer a question, the bot does not either.
The expected behavior. What the bot should do, chosen from a small set: answer directly, ask a clarifying question, or escalate to a human. Deciding this in advance is what turns a vague "did that seem okay" into a pass or fail.

That third part matters most and gets skipped most. Not every test case should expect an answer. Some questions should make the bot ask for clarification. Some should make it escalate. A test case whose expected behavior is "escalate" passes when the bot escalates and fails when the bot tries to answer, even if the answer is good.

Expected behavior	When you want it	What a failure looks like
Answer	The question is in scope and the answer is in your content	Bot declines or escalates a question it should own
Clarify	The question is ambiguous or missing key detail	Bot guesses instead of asking which order, which plan
Escalate	High-risk, out-of-policy, emotional, or no source exists	Bot improvises an answer it has no basis for

A test case with no expected behavior is not a test. It is a demo. If you cannot say in advance whether the right move is answer, clarify, or escalate, you have no way to fail the bot, which means you have no way to catch it failing customers.

A starter set of test cases

Here is a starter table you can adapt for a generic ecommerce or SaaS support mix. The phrasing is intentionally messy, because that is how the questions arrive. Replace the expected sources with your real page names, then expand each category with cases pulled from your own tickets.

#	Question (as a customer types it)	Category	Expected source	Expected behavior
1	where is my thing	Order status	Order tracking page	Clarify, then answer
2	order says delivered but i dont have it	Order status	Lost package policy	Answer, offer escalation
3	how long do refunds take	Refunds	Refund policy	Answer
4	can i return this if i opened it	Refunds	Return policy, final sale rules	Answer
5	i was charged twice??	Billing	Billing FAQ	Escalate with context
6	why is my bill higher this month	Billing	Pricing and plan page	Answer or clarify
7	cant log in	Account access	Password reset guide	Clarify, then answer
8	i need to change the email on my account	Account access	Account settings guide	Answer
9	do you ship to canada	Shipping	Shipping regions page	Answer
10	is the blue one in stock	Product detail	Product or inventory page	Clarify or escalate if no live data
11	cancel my subscription before it renews	Billing	Cancellation policy	Answer, confirm timing
12	this is the third time im contacting you	Emotional	None	Escalate, de-escalate first
13	i want to speak to a human	Handoff	None	Escalate immediately
14	asdjkl ??	Gibberish	None	Ask for clarification
15	what's the weather	Off-topic	None	Decline, redirect to support

That is fifteen cases across the categories that drive most support volume. Notice the spread: not all of them expect an answer, several expect a clarify or an escalate, and a few exist only to confirm the bot behaves under nonsense. A test set that is all answerable questions is testing a fantasy version of your inbox.

How many test cases you actually need

The instinct is to write hundreds. Resist it. You will burn your afternoon chasing rare cases while the common ones go unverified.

The rule of thumb: cover your top 20 intents before you chase the tail. An intent is a question type, not a phrasing. "Where is my order" and "order hasn't arrived" are one intent. For each of your top 20 intents, write three to five test cases with different phrasings, including one messy version and one edge case. That is roughly 60 to 100 cases, and it covers the overwhelming majority of real volume.

Volume is not evenly spread. A small number of intents usually account for most of your tickets. Verify those cold before you write a single test case for something you see once a quarter. Depth on common questions beats breadth on rare ones.

Only after the top intents pass cleanly should you add cases for the long tail. And even then, add them in response to evidence: a real ticket the bot mishandled, a new policy, a product launch. Do not pad the set to feel thorough. A test set that is mostly questions nobody asks is noise that hides the failures that matter.

TideReply and running the set

Once you have a test set, you need to run it without spamming your own escalation inbox or creating fake customer conversations. TideReply's bot simulator runs your whole batch in dry-run mode: the real retrieval pipeline executes, but no conversation is created and no escalation email goes out. For each question you see the actual answer, the similarity scores behind it, and whether the bot would answer or escalate. That maps directly onto the three-part test case structure, because you can compare the bot's behavior against your expected behavior column at a glance and see, per case, whether it pulled the source you expected. When a case fails, you fix the content and re-run the set instead of retyping questions one at a time.

Maintaining the set as a living asset

A test set is not a launch artifact you file away. It is an asset that loses value the moment it goes stale. Policies change, products ship, and customers invent new phrasings. A test set frozen at launch tests last quarter's business.

Build a small habit instead of a big project:

Add a case whenever the bot mishandles a real conversation. Every production miss becomes a permanent regression test. The bot should never fail the same way twice.
Review the set when policies or products change. A new return window or a retired plan means some expected answers are now wrong. Update them before the bot does.
Prune cases that no longer reflect reality. A test for a product you discontinued is just noise. Remove it.
Re-run before any meaningful content update goes live. The set is most valuable as a regression check, confirming a fix did not quietly break something that used to work.

Done this way, the test set keeps paying off. It also feeds the rest of your operation. The cases the bot keeps failing are exactly the questions you might reduce support tickets with automation on once the content is fixed, and the patterns across failures tell you where your documentation is weak. If you want the full pre-launch sweep that sits around these cases, the chatbot QA checklist covers the categories to test beyond individual questions.

Test cases are how you make the bot accountable

A bot without test cases is a bot you cannot hold to a standard. You can tell it sounds good or sounds off, but you cannot prove it answers your top questions correctly, and you certainly cannot prove it still does after last week's policy change.

Test cases change that. They turn "the bot seems fine" into "the bot passes 94 of our 100 cases and here are the 6 it does not." That is the difference between hoping automation works and knowing what it can be trusted with. Pull the cases from real customer behavior, write down the expected behavior before you run them, and keep the set alive as the business moves. The work is not glamorous, but it is what lets you put a bot in front of customers and answer for what it does.