A team sat down to test their new support bot and wrote forty questions in a conference room. The bot passed all forty. Two weeks after launch, the most common real question was "where's my stuff," which nobody in the room had thought to write down because nobody on the team talks like that. The bot had never seen it. It guessed, got it half right, and the tickets came anyway.
That is the core problem with chatbot test cases written from imagination. Your team knows the business too well to phrase questions like a confused customer. The test cases that protect you are the ones you pull from how customers actually behave, not the ones you invent. This post is about where good test cases come from, what a good one looks like, and how many you actually need before deployment.
If you want the why behind testing at all, pre deployment chatbot testing that works makes that case. This post is the how: building the test set itself.
Where real test cases come from
Good chatbot test cases are not brainstormed. They are harvested. You already have a record of every question your customers ask, sitting in systems you check every day. The job is to mine it, not to invent it.
| Source | What you get | Why it beats brainstorming |
|---|---|---|
| Historical tickets | The exact questions customers send, in their words | Real phrasing, real frequency, real edge cases |
| Live chat transcripts | Short, messy, conversational questions | Shows how people type when they expect a fast answer |
| Help-center search logs | What customers look for before they contact you | Reveals intents you may not have content for |
| Agent macros and saved replies | The answers your team already trusts | Tells you the expected output for each question |
| Escalation and refund threads | The high-risk, emotional, money-tied cases | The cases where a wrong bot answer costs the most |
Notice what is not on that list: the team's intuition. Intuition is useful for spotting gaps, but it is a terrible source of phrasing. Customers abbreviate, misspell, skip context, and lead with frustration. Your documentation does none of those things, so test cases derived from documentation test a customer who does not exist.
Start with the last 90 days of tickets and sort by volume. The top of that list is where your bot earns or loses its value. A test case covering a question you get 200 times a month is worth more than ten clever edge cases you invented.
The anatomy of a good test case
A test case is not just a question. A question alone tells you nothing, because you have no way to grade the answer. A usable test case has three parts.
- The question as customers phrase it. Not "What is your return policy for opened items?" but "can i send back something i already opened??". Copy the real phrasing, typos and all. If you clean it up, you are testing a different question.
- The expected source. Which page, policy, or document holds the correct answer. This is what lets you check grounding, not just plausibility. If you do not know which source should answer a question, the bot does not either.
- The expected behavior. What the bot should do, chosen from a small set: answer directly, ask a clarifying question, or escalate to a human. Deciding this in advance is what turns a vague "did that seem okay" into a pass or fail.
That third part matters most and gets skipped most. Not every test case should expect an answer. Some questions should make the bot ask for clarification. Some should make it escalate. A test case whose expected behavior is "escalate" passes when the bot escalates and fails when the bot tries to answer, even if the answer is good.
| Expected behavior | When you want it | What a failure looks like |
|---|---|---|
| Answer | The question is in scope and the answer is in your content | Bot declines or escalates a question it should own |
| Clarify | The question is ambiguous or missing key detail | Bot guesses instead of asking which order, which plan |
| Escalate | High-risk, out-of-policy, emotional, or no source exists | Bot improvises an answer it has no basis for |
A test case with no expected behavior is not a test. It is a demo. If you cannot say in advance whether the right move is answer, clarify, or escalate, you have no way to fail the bot, which means you have no way to catch it failing customers.
A starter set of test cases
Here is a starter table you can adapt for a generic ecommerce or SaaS support mix. The phrasing is intentionally messy, because that is how the questions arrive. Replace the expected sources with your real page names, then expand each category with cases pulled from your own tickets.
| # | Question (as a customer types it) | Category | Expected source | Expected behavior |
|---|---|---|---|---|
| 1 | where is my thing | Order status | Order tracking page | Clarify, then answer |
| 2 | order says delivered but i dont have it | Order status | Lost package policy | Answer, offer escalation |
| 3 | how long do refunds take | Refunds | Refund policy | Answer |
| 4 | can i return this if i opened it | Refunds | Return policy, final sale rules | Answer |
| 5 | i was charged twice?? | Billing | Billing FAQ | Escalate with context |
| 6 | why is my bill higher this month | Billing | Pricing and plan page | Answer or clarify |
| 7 | cant log in | Account access | Password reset guide | Clarify, then answer |
| 8 | i need to change the email on my account | Account access | Account settings guide | Answer |
| 9 | do you ship to canada | Shipping | Shipping regions page | Answer |
| 10 | is the blue one in stock | Product detail | Product or inventory page | Clarify or escalate if no live data |
| 11 | cancel my subscription before it renews | Billing | Cancellation policy | Answer, confirm timing |
| 12 | this is the third time im contacting you | Emotional | None | Escalate, de-escalate first |
| 13 | i want to speak to a human | Handoff | None | Escalate immediately |
| 14 | asdjkl ?? | Gibberish | None | Ask for clarification |
| 15 | what's the weather | Off-topic | None | Decline, redirect to support |
That is fifteen cases across the categories that drive most support volume. Notice the spread: not all of them expect an answer, several expect a clarify or an escalate, and a few exist only to confirm the bot behaves under nonsense. A test set that is all answerable questions is testing a fantasy version of your inbox.
How many test cases you actually need
The instinct is to write hundreds. Resist it. You will burn your afternoon chasing rare cases while the common ones go unverified.
The rule of thumb: cover your top 20 intents before you chase the tail. An intent is a question type, not a phrasing. "Where is my order" and "order hasn't arrived" are one intent. For each of your top 20 intents, write three to five test cases with different phrasings, including one messy version and one edge case. That is roughly 60 to 100 cases, and it covers the overwhelming majority of real volume.
Volume is not evenly spread. A small number of intents usually account for most of your tickets. Verify those cold before you write a single test case for something you see once a quarter. Depth on common questions beats breadth on rare ones.
Only after the top intents pass cleanly should you add cases for the long tail. And even then, add them in response to evidence: a real ticket the bot mishandled, a new policy, a product launch. Do not pad the set to feel thorough. A test set that is mostly questions nobody asks is noise that hides the failures that matter.
TideReply and running the set
Once you have a test set, you need to run it without spamming your own escalation inbox or creating fake customer conversations. TideReply's bot simulator runs your whole batch in dry-run mode: the real retrieval pipeline executes, but no conversation is created and no escalation email goes out. For each question you see the actual answer, the similarity scores behind it, and whether the bot would answer or escalate. That maps directly onto the three-part test case structure, because you can compare the bot's behavior against your expected behavior column at a glance and see, per case, whether it pulled the source you expected. When a case fails, you fix the content and re-run the set instead of retyping questions one at a time.
Maintaining the set as a living asset
A test set is not a launch artifact you file away. It is an asset that loses value the moment it goes stale. Policies change, products ship, and customers invent new phrasings. A test set frozen at launch tests last quarter's business.
Build a small habit instead of a big project:
- Add a case whenever the bot mishandles a real conversation. Every production miss becomes a permanent regression test. The bot should never fail the same way twice.
- Review the set when policies or products change. A new return window or a retired plan means some expected answers are now wrong. Update them before the bot does.
- Prune cases that no longer reflect reality. A test for a product you discontinued is just noise. Remove it.
- Re-run before any meaningful content update goes live. The set is most valuable as a regression check, confirming a fix did not quietly break something that used to work.
Done this way, the test set keeps paying off. It also feeds the rest of your operation. The cases the bot keeps failing are exactly the questions you might reduce support tickets with automation on once the content is fixed, and the patterns across failures tell you where your documentation is weak. If you want the full pre-launch sweep that sits around these cases, the chatbot QA checklist covers the categories to test beyond individual questions.
Test cases are how you make the bot accountable
A bot without test cases is a bot you cannot hold to a standard. You can tell it sounds good or sounds off, but you cannot prove it answers your top questions correctly, and you certainly cannot prove it still does after last week's policy change.
Test cases change that. They turn "the bot seems fine" into "the bot passes 94 of our 100 cases and here are the 6 it does not." That is the difference between hoping automation works and knowing what it can be trusted with. Pull the cases from real customer behavior, write down the expected behavior before you run them, and keep the set alive as the business moves. The work is not glamorous, but it is what lets you put a bot in front of customers and answer for what it does.