New 🚀 Native Support for Creating and Evaluating Chat-Based AI Experiences
Contact Sales

Introducing Chat: 4 Use Cases to Ship a High Quality Chatbot

Nearly 95% of GenAI projects never make it past proof-of-concept. Chatbots that perform well in public benchmarks or demos often stumble when faced with the messy, unpredictable conversations of real users.

To help teams close that gap, we’re introducing Chat in Label Studio, full support for creating and evaluating chat-based AI experiences right inside of Label Studio. Chat conversations are now a native data type that you can annotate and automate just like images, video, sound, and text—at both message and conversation levels

In this blog, we highlight four of the most common ways teams are already using Chat:

  • Evaluate your chatbot to understand performance gaps and uncover areas for improvement
  • Generate realistic conversations for fine-tuning so your chatbot learns from the kinds of questions and behaviors real users actually bring
  • Red-team your chatbot before your users do by stress-testing against jailbreaks and adversarial prompts
  • Continuously improve with RLHF by evaluating production chats with detailed human feedback that refines accuracy and safety over time

With Chat, your chatbots don’t just launch and sit idle, you get the tools and data you need to continuously refine them based on real conversations.

Use Cases

Use Case 1: Evaluate Your Chatbot’s Baseline

Out-of-the-box models rarely perform well enough for production-level chatbots. Maybe you’re starting with an open-source LLM, an early RAG system, or are trying to update a legacy chatbot that just isn’t meeting your needs. The first step is to evaluate your chatbot so you can understand its current quality level and exactly what you need to improve to get it to production. You are assessing the limits of your chatbot’s knowledge and capabilities.

The important part is to ensure that your evaluation is actionable. Consolidating your evaluation results should tell you what part of the chatbot needs to be improved. So, have your team evaluate the assistant's responses to understand if your users' questions are being answered, if the right supporting materials are being provided, and if the tone is right. Inevitably, when your evaluation comes back below your requirements, you’ll know what to do: spin out of a fine-tuning project, improve your knowledge base, and/or enhance your system prompt.

Project Setup:
StepWhat to DoWhy It Matters
Create tasks with conversation data Take a representative sample of conversations from your chatbot and upload them to your projectYou want your evaluation dataset to be indicative of typical user behavior to get an accurate sense of performance
Per-message evaluation (user)Rate the clarity of the request, classify the type of question the user is askingCan help pinpoint the types of questions or instructions the chatbot needs help with and whether the assistant is asking clarifying questions when needed
Per-message evaluation (assistant)Assess the assistant’s clarity and accuracy; classify whether user questions were sufficiently answered, whether sufficient supporting documentation was provided, if proper tone was usedUnderstand if your chatbot can answer user questions, has enough domain knowledge, and matches your brand/desired tone
Limit conversation lengthCap the length of each task by setting the number of messages (maxMessages parameter) equal to the amount of messages in the imported dataEnsure that annotators cannot add messages and extend the conversation
Disable message editingSet message editing to false in your project’s labeling configurationEnsure annotators are only evaluating at this stage

Open in Label Studio

Use Case 2: Generate Conversation Data for Fine-Tuning

After completing the evaluation, you’ll likely have work to do. You might find your chatbot’s responses degrade as a conversation goes on, it might fail to ask clarifying questions when necessary, it might fail to consolidate different types of documentation effectively, or maybe the tone/style is off. Any problem like this, and you’ll want to spin out a project to generate conversational data to fine-tune your chatbot.

Chat makes this process straightforward, giving you two paths:

Generate a Conversation with an LLM of your Choice

Connect any OpenAI-format-compatible LLM (including your own) and run live, interactive conversations. To fine-tune your chatbot effectively, guide your team to create a large volume of exchanges that reflect real user behavior.

Project Setup
StepWhat to DoWhy It Matters
Create tasksDefine conversation goals—topics, tone, and scenarios (e.g. setup, billing, troubleshooting)Ensures data reflects domain-specific user needs and trains your chatbot to specialize on key tasks
Select your LLMConnect any OpenAI-format-compatible LLM (including your own) in Label StudioProvides a foundation for generating live, interactive conversations
Encourage mixed topicsInstruct annotators to weave multiple subjects into one chatMimics real user behavior, teaching your chatbot to switch context smoothly
Enable message editingAllow annotators to refine both user and assistant turnsProduces polished back-and-forth data for fine-tuning
Add conversation-level commentsLet annotators note strengths, weak spots, or tone/style issuesCaptures qualitative insights that guide fine-tuning decisions

Open in Label Studio

Generate a Conversation Without an LLM

Simulate a conversation by acting as both user and assistant (or get creative and invite a teammate over and alternate!). This is important for shaping the tone and personality of your chatbot. LLMs often have identifiable traits: overusing em-dashes, leaning on too many emojis, sounding overly sycophantic, or favoring certain words (looking at you “delve” and “intricate”, source).

While including this information in your system prompt is recommended and will definitely help, we’ve found the best way to match your brand’s tone, and avoid your chatbot defaulting to the style of leading frontier chatbots, is to fine-tune with data that isn’t generated by those chatbots. Some of our clients also have mandates to not use LLM-generated data when training their chatbots.

Project Setup
StepWhat to DoWhy It Matters
Create tasksAssign prompts that specify conversation themes, tone, or scenarios (e.g. onboarding questions, troubleshooting, billing)Ensures coverage of the kinds of conversations your users will actually have
Alternate rolesAct as both user and assistant, or pair teammates to simulate realistic back-and-forthProduces richer, more natural conversations than one-sided drafting
Message editingAllow annotators to refine both user and assistant turnsLets you shape ideal responses and conversational flow for fine-tuning
Conversation-level commentsGive annotators space to leave notes on tone, clarity, or style alignmentCaptures qualitative insights to guide brand voice and training data curation

Open in Label Studio

Use Case 3: Red-Teaming: Break Your Chatbot Before Your Users Do

GenAI is inherently non-deterministic. That unpredictability makes GenAI chatbots and agents powerful, but also risky. Left untested, they can leak private data, produce harmful outputs, or be tricked into unsafe behavior. In some industries (pharma or finance, for example), the cost of an error like this can be extremely high.

That’s why red-teaming matters. This use case is about deliberately bypassing your chatbot’s safeguards so you can find weaknesses before real users exploit them. With Chat, you can design projects that encourage annotators to push the limits using diverse jailbreaking methods, while classifying every step of the conversation.

After a red-teaming project, you’ll have the evidence you need to act: update your system prompt, add guardrails, or fine-tune with negative examples.

Project Setup:
StepWhat to DoWhy It Matters
Jailbreak ideasProvide annotators with prompts like injection techniques, roleplay, obfuscation, or long-running conversations that build “trust.”Ensures coverage across a wide range of realistic exploits.
Minimum messagesRequire a high minimum conversation length.Forces annotators to test the chatbot’s context window and push boundaries.
Per-message evaluation (user)Classify each user message by jailbreak tactic and severity (benign → risky → clear violation).Pinpoints which attack methods are most effective.
Per-message evaluation (assistant)Mark where the chatbot was exploited and rate the quality of refusals.Identifies weaknesses and highlights effective refusal patterns.
Message editingDisable for consumer-facing bots (since users can’t edit). Enable for API chatbots to let annotators send false assistant messages.Matches evaluation to the real-world environment.

Open in Label Studio

Use Case 4: Evaluate and Monitor Production Conversations for Continuous Improvement via RLHF

Once your chatbot is in production, feedback usually comes in the simplest form: a thumbs up or down. That’s useful, but it doesn’t tell you why the chatbot worked, or failed.

With Chat, you can import production conversations and dig deeper. Look at the ones users liked to refine what’s working. Study the ones they disliked to uncover where your chatbot falls short. And don’t just rely on explicit feedback; sample random conversations to get a full picture of user behavior.

Project Setup:
StepWhat to DoWhy It Matters
Upload conversationsImport production chats as predictions into Label StudioMoves beyond simple thumbs-up/down to structured evaluation
Evaluate each messageUse Likert scales to capture detailed preference signalsProvides richer feedback than binary ratings and pinpoints strengths/weaknesses
Aggregate resultsCombine feedback to guide RLHFTurns scattered user reactions into actionable improvements for your chatbot

Open in Label Studio

This approach turns simple thumbs-up/thumbs-down reactions into actionable data, giving your team richer signals to guide continuous improvement.

Looking Forward

While we support the use cases above. We’re excited to continue investing in supporting all of our clients’ GenAI use cases. We have an interest in extending chat capabilities in the following ways:

  1. Multi-LLM Comparison – have a single user message generate multiple responses in parallel (likely from different LLMs) so users can select their preferred response.
  2. Support tool evaluation – understand the tools your chatbot is calling and evaluate if they’re the right ones
  3. Multi-Modal – use image, audio, and video in your requests and responses.

Whether you’re exploring pricing, tackling technical challenges, or looking to scale high-quality labeling, our team is ready to guide you. Chat with one of our humans.

Conclusion

We’re excited to help make your GenAI projects successful. Because GenAI is so open-ended, our focus is on giving you flexibility, so you can adapt Chat to your own use cases. The three examples we covered are just the beginning. We want to learn how you’ll push this feature further.

Here are a couple of ideas to spark experimentation:

  • Run controlled comparisons: Spin up two identical projects, change only the model, system prompt, a different ranker model in your RAG system, and see quantitatively what performs best. Relying on gut feel is what lands most projects in the 95% that never deliver value.
  • Capture reasoning data: Create a project that uses only user messages. Ask SMEs to break down complex problems step by step, then use that structured chain-of-thought data to fine-tune a chatbot that leverages a reasoning model.

We hope you enjoy working with Chat, and we’re looking forward to your feedback!

Related Content