Nearly 95% of GenAI projects never make it past proof-of-concept. Chatbots that perform well in public benchmarks or demos often stumble when faced with the messy, unpredictable conversations of real users.
To help teams close that gap, we’re introducing Chat in Label Studio, full support for creating and evaluating chat-based AI experiences right inside of Label Studio. Chat conversations are now a native data type that you can annotate and automate just like images, video, sound, and text—at both message and conversation levels
In this blog, we highlight four of the most common ways teams are already using Chat:
With Chat, your chatbots don’t just launch and sit idle, you get the tools and data you need to continuously refine them based on real conversations.
Out-of-the-box models rarely perform well enough for production-level chatbots. Maybe you’re starting with an open-source LLM, an early RAG system, or are trying to update a legacy chatbot that just isn’t meeting your needs. The first step is to evaluate your chatbot so you can understand its current quality level and exactly what you need to improve to get it to production. You are assessing the limits of your chatbot’s knowledge and capabilities.
The important part is to ensure that your evaluation is actionable. Consolidating your evaluation results should tell you what part of the chatbot needs to be improved. So, have your team evaluate the assistant's responses to understand if your users' questions are being answered, if the right supporting materials are being provided, and if the tone is right. Inevitably, when your evaluation comes back below your requirements, you’ll know what to do: spin out of a fine-tuning project, improve your knowledge base, and/or enhance your system prompt.
Step | What to Do | Why It Matters |
Create tasks with conversation data | Take a representative sample of conversations from your chatbot and upload them to your project | You want your evaluation dataset to be indicative of typical user behavior to get an accurate sense of performance |
Per-message evaluation (user) | Rate the clarity of the request, classify the type of question the user is asking | Can help pinpoint the types of questions or instructions the chatbot needs help with and whether the assistant is asking clarifying questions when needed |
Per-message evaluation (assistant) | Assess the assistant’s clarity and accuracy; classify whether user questions were sufficiently answered, whether sufficient supporting documentation was provided, if proper tone was used | Understand if your chatbot can answer user questions, has enough domain knowledge, and matches your brand/desired tone |
Limit conversation length | Cap the length of each task by setting the number of messages (maxMessages parameter) equal to the amount of messages in the imported data | Ensure that annotators cannot add messages and extend the conversation |
Disable message editing | Set message editing to false in your project’s labeling configuration | Ensure annotators are only evaluating at this stage |
After completing the evaluation, you’ll likely have work to do. You might find your chatbot’s responses degrade as a conversation goes on, it might fail to ask clarifying questions when necessary, it might fail to consolidate different types of documentation effectively, or maybe the tone/style is off. Any problem like this, and you’ll want to spin out a project to generate conversational data to fine-tune your chatbot.
Chat makes this process straightforward, giving you two paths:
Connect any OpenAI-format-compatible LLM (including your own) and run live, interactive conversations. To fine-tune your chatbot effectively, guide your team to create a large volume of exchanges that reflect real user behavior.
Step | What to Do | Why It Matters |
Create tasks | Define conversation goals—topics, tone, and scenarios (e.g. setup, billing, troubleshooting) | Ensures data reflects domain-specific user needs and trains your chatbot to specialize on key tasks |
Select your LLM | Connect any OpenAI-format-compatible LLM (including your own) in Label Studio | Provides a foundation for generating live, interactive conversations |
Encourage mixed topics | Instruct annotators to weave multiple subjects into one chat | Mimics real user behavior, teaching your chatbot to switch context smoothly |
Enable message editing | Allow annotators to refine both user and assistant turns | Produces polished back-and-forth data for fine-tuning |
Add conversation-level comments | Let annotators note strengths, weak spots, or tone/style issues | Captures qualitative insights that guide fine-tuning decisions |
Simulate a conversation by acting as both user and assistant (or get creative and invite a teammate over and alternate!). This is important for shaping the tone and personality of your chatbot. LLMs often have identifiable traits: overusing em-dashes, leaning on too many emojis, sounding overly sycophantic, or favoring certain words (looking at you “delve” and “intricate”, source).
While including this information in your system prompt is recommended and will definitely help, we’ve found the best way to match your brand’s tone, and avoid your chatbot defaulting to the style of leading frontier chatbots, is to fine-tune with data that isn’t generated by those chatbots. Some of our clients also have mandates to not use LLM-generated data when training their chatbots.
Step | What to Do | Why It Matters |
Create tasks | Assign prompts that specify conversation themes, tone, or scenarios (e.g. onboarding questions, troubleshooting, billing) | Ensures coverage of the kinds of conversations your users will actually have |
Alternate roles | Act as both user and assistant, or pair teammates to simulate realistic back-and-forth | Produces richer, more natural conversations than one-sided drafting |
Message editing | Allow annotators to refine both user and assistant turns | Lets you shape ideal responses and conversational flow for fine-tuning |
Conversation-level comments | Give annotators space to leave notes on tone, clarity, or style alignment | Captures qualitative insights to guide brand voice and training data curation |
GenAI is inherently non-deterministic. That unpredictability makes GenAI chatbots and agents powerful, but also risky. Left untested, they can leak private data, produce harmful outputs, or be tricked into unsafe behavior. In some industries (pharma or finance, for example), the cost of an error like this can be extremely high.
That’s why red-teaming matters. This use case is about deliberately bypassing your chatbot’s safeguards so you can find weaknesses before real users exploit them. With Chat, you can design projects that encourage annotators to push the limits using diverse jailbreaking methods, while classifying every step of the conversation.
After a red-teaming project, you’ll have the evidence you need to act: update your system prompt, add guardrails, or fine-tune with negative examples.
Step | What to Do | Why It Matters |
Jailbreak ideas | Provide annotators with prompts like injection techniques, roleplay, obfuscation, or long-running conversations that build “trust.” | Ensures coverage across a wide range of realistic exploits. |
Minimum messages | Require a high minimum conversation length. | Forces annotators to test the chatbot’s context window and push boundaries. |
Per-message evaluation (user) | Classify each user message by jailbreak tactic and severity (benign → risky → clear violation). | Pinpoints which attack methods are most effective. |
Per-message evaluation (assistant) | Mark where the chatbot was exploited and rate the quality of refusals. | Identifies weaknesses and highlights effective refusal patterns. |
Message editing | Disable for consumer-facing bots (since users can’t edit). Enable for API chatbots to let annotators send false assistant messages. | Matches evaluation to the real-world environment. |
Once your chatbot is in production, feedback usually comes in the simplest form: a thumbs up or down. That’s useful, but it doesn’t tell you why the chatbot worked, or failed.
With Chat, you can import production conversations and dig deeper. Look at the ones users liked to refine what’s working. Study the ones they disliked to uncover where your chatbot falls short. And don’t just rely on explicit feedback; sample random conversations to get a full picture of user behavior.
Step | What to Do | Why It Matters |
Upload conversations | Import production chats as predictions into Label Studio | Moves beyond simple thumbs-up/down to structured evaluation |
Evaluate each message | Use Likert scales to capture detailed preference signals | Provides richer feedback than binary ratings and pinpoints strengths/weaknesses |
Aggregate results | Combine feedback to guide RLHF | Turns scattered user reactions into actionable improvements for your chatbot |
This approach turns simple thumbs-up/thumbs-down reactions into actionable data, giving your team richer signals to guide continuous improvement.
While we support the use cases above. We’re excited to continue investing in supporting all of our clients’ GenAI use cases. We have an interest in extending chat capabilities in the following ways:
Whether you’re exploring pricing, tackling technical challenges, or looking to scale high-quality labeling, our team is ready to guide you. Chat with one of our humans.
We’re excited to help make your GenAI projects successful. Because GenAI is so open-ended, our focus is on giving you flexibility, so you can adapt Chat to your own use cases. The three examples we covered are just the beginning. We want to learn how you’ll push this feature further.
Here are a couple of ideas to spark experimentation:
We hope you enjoy working with Chat, and we’re looking forward to your feedback!