guides·10 min read

How to Write Chatbot Test Cases for Support Flows

Learn how to write chatbot test cases before deployment, where real test cases come from, and see chatbot test case examples for common support flows.

Tomas Peciulis
Tomas Peciulis
Founder at TideReply ·

A team sat down to test their new support bot and wrote forty questions in a conference room. The bot passed all forty. Two weeks after launch, the most common real question was "where's my stuff," which nobody in the room had thought to write down because nobody on the team talks like that. The bot had never seen it. It guessed, got it half right, and the tickets came anyway.

That is the core problem with chatbot test cases written from imagination. Your team knows the business too well to phrase questions like a confused customer. The test cases that protect you are the ones you pull from how customers actually behave, not the ones you invent. This post is about where good test cases come from, what a good one looks like, and how many you actually need before deployment.

If you want the why behind testing at all, pre deployment chatbot testing that works makes that case. This post is the how: building the test set itself.

Where real test cases come from

Good chatbot test cases are not brainstormed. They are harvested. You already have a record of every question your customers ask, sitting in systems you check every day. The job is to mine it, not to invent it.

SourceWhat you getWhy it beats brainstorming
Historical ticketsThe exact questions customers send, in their wordsReal phrasing, real frequency, real edge cases
Live chat transcriptsShort, messy, conversational questionsShows how people type when they expect a fast answer
Help-center search logsWhat customers look for before they contact youReveals intents you may not have content for
Agent macros and saved repliesThe answers your team already trustsTells you the expected output for each question
Escalation and refund threadsThe high-risk, emotional, money-tied casesThe cases where a wrong bot answer costs the most

Notice what is not on that list: the team's intuition. Intuition is useful for spotting gaps, but it is a terrible source of phrasing. Customers abbreviate, misspell, skip context, and lead with frustration. Your documentation does none of those things, so test cases derived from documentation test a customer who does not exist.

Start with the last 90 days of tickets and sort by volume. The top of that list is where your bot earns or loses its value. A test case covering a question you get 200 times a month is worth more than ten clever edge cases you invented.

The anatomy of a good test case

A test case is not just a question. A question alone tells you nothing, because you have no way to grade the answer. A usable test case has three parts.

  1. The question as customers phrase it. Not "What is your return policy for opened items?" but "can i send back something i already opened??". Copy the real phrasing, typos and all. If you clean it up, you are testing a different question.
  2. The expected source. Which page, policy, or document holds the correct answer. This is what lets you check grounding, not just plausibility. If you do not know which source should answer a question, the bot does not either.
  3. The expected behavior. What the bot should do, chosen from a small set: answer directly, ask a clarifying question, or escalate to a human. Deciding this in advance is what turns a vague "did that seem okay" into a pass or fail.

That third part matters most and gets skipped most. Not every test case should expect an answer. Some questions should make the bot ask for clarification. Some should make it escalate. A test case whose expected behavior is "escalate" passes when the bot escalates and fails when the bot tries to answer, even if the answer is good.

Expected behaviorWhen you want itWhat a failure looks like
AnswerThe question is in scope and the answer is in your contentBot declines or escalates a question it should own
ClarifyThe question is ambiguous or missing key detailBot guesses instead of asking which order, which plan
EscalateHigh-risk, out-of-policy, emotional, or no source existsBot improvises an answer it has no basis for

A test case with no expected behavior is not a test. It is a demo. If you cannot say in advance whether the right move is answer, clarify, or escalate, you have no way to fail the bot, which means you have no way to catch it failing customers.

A starter set of test cases

Here is a starter table you can adapt for a generic ecommerce or SaaS support mix. The phrasing is intentionally messy, because that is how the questions arrive. Replace the expected sources with your real page names, then expand each category with cases pulled from your own tickets.

#Question (as a customer types it)CategoryExpected sourceExpected behavior
1where is my thingOrder statusOrder tracking pageClarify, then answer
2order says delivered but i dont have itOrder statusLost package policyAnswer, offer escalation
3how long do refunds takeRefundsRefund policyAnswer
4can i return this if i opened itRefundsReturn policy, final sale rulesAnswer
5i was charged twice??BillingBilling FAQEscalate with context
6why is my bill higher this monthBillingPricing and plan pageAnswer or clarify
7cant log inAccount accessPassword reset guideClarify, then answer
8i need to change the email on my accountAccount accessAccount settings guideAnswer
9do you ship to canadaShippingShipping regions pageAnswer
10is the blue one in stockProduct detailProduct or inventory pageClarify or escalate if no live data
11cancel my subscription before it renewsBillingCancellation policyAnswer, confirm timing
12this is the third time im contacting youEmotionalNoneEscalate, de-escalate first
13i want to speak to a humanHandoffNoneEscalate immediately
14asdjkl ??GibberishNoneAsk for clarification
15what's the weatherOff-topicNoneDecline, redirect to support

That is fifteen cases across the categories that drive most support volume. Notice the spread: not all of them expect an answer, several expect a clarify or an escalate, and a few exist only to confirm the bot behaves under nonsense. A test set that is all answerable questions is testing a fantasy version of your inbox.

How many test cases you actually need

The instinct is to write hundreds. Resist it. You will burn your afternoon chasing rare cases while the common ones go unverified.

The rule of thumb: cover your top 20 intents before you chase the tail. An intent is a question type, not a phrasing. "Where is my order" and "order hasn't arrived" are one intent. For each of your top 20 intents, write three to five test cases with different phrasings, including one messy version and one edge case. That is roughly 60 to 100 cases, and it covers the overwhelming majority of real volume.

Volume is not evenly spread. A small number of intents usually account for most of your tickets. Verify those cold before you write a single test case for something you see once a quarter. Depth on common questions beats breadth on rare ones.

Only after the top intents pass cleanly should you add cases for the long tail. And even then, add them in response to evidence: a real ticket the bot mishandled, a new policy, a product launch. Do not pad the set to feel thorough. A test set that is mostly questions nobody asks is noise that hides the failures that matter.

TideReply and running the set

Once you have a test set, you need to run it without spamming your own escalation inbox or creating fake customer conversations. TideReply's bot simulator runs your whole batch in dry-run mode: the real retrieval pipeline executes, but no conversation is created and no escalation email goes out. For each question you see the actual answer, the similarity scores behind it, and whether the bot would answer or escalate. That maps directly onto the three-part test case structure, because you can compare the bot's behavior against your expected behavior column at a glance and see, per case, whether it pulled the source you expected. When a case fails, you fix the content and re-run the set instead of retyping questions one at a time.

Maintaining the set as a living asset

A test set is not a launch artifact you file away. It is an asset that loses value the moment it goes stale. Policies change, products ship, and customers invent new phrasings. A test set frozen at launch tests last quarter's business.

Build a small habit instead of a big project:

  1. Add a case whenever the bot mishandles a real conversation. Every production miss becomes a permanent regression test. The bot should never fail the same way twice.
  2. Review the set when policies or products change. A new return window or a retired plan means some expected answers are now wrong. Update them before the bot does.
  3. Prune cases that no longer reflect reality. A test for a product you discontinued is just noise. Remove it.
  4. Re-run before any meaningful content update goes live. The set is most valuable as a regression check, confirming a fix did not quietly break something that used to work.

Done this way, the test set keeps paying off. It also feeds the rest of your operation. The cases the bot keeps failing are exactly the questions you might reduce support tickets with automation on once the content is fixed, and the patterns across failures tell you where your documentation is weak. If you want the full pre-launch sweep that sits around these cases, the chatbot QA checklist covers the categories to test beyond individual questions.

Test cases are how you make the bot accountable

A bot without test cases is a bot you cannot hold to a standard. You can tell it sounds good or sounds off, but you cannot prove it answers your top questions correctly, and you certainly cannot prove it still does after last week's policy change.

Test cases change that. They turn "the bot seems fine" into "the bot passes 94 of our 100 cases and here are the 6 it does not." That is the difference between hoping automation works and knowing what it can be trusted with. Pull the cases from real customer behavior, write down the expected behavior before you run them, and keep the set alive as the business moves. The work is not glamorous, but it is what lets you put a bot in front of customers and answer for what it does.

Ready to transform your support?

Start delivering instant, accurate support — without growing your team.