Large language models (LLMs) are powerful, but their outputs aren't always practical in the real world. To get them to perform better, we fine-tune them with human feedback. This is how we align them with our preferences and mitigate bias.
A common technique for fine-tuning LLMs is Reinforcement Learning with Verifiable Rewards (RLVR). RLVR fine-tunes a model with a clear reward signal, but this process only works when you have a large dataset of human-preferred responses to learn from.
This is where Label Studio becomes valuable. It offers ready-made templates that make it easier to gather preference data. Each template supports a different annotation style and outputs data in a structured format. For example, you can:
In this article, we'll understand the role of preference data in RLVR, explore Label Studio's generative AI templates, and create our own custom RLVR template for fine-tuning a math tutor bot. You can also try it out here.
Human feedback is essential for RLVR training as it builds a reward model that acts as a scoring system or guide. This reward model is trained on human-annotated data to learn what is a "good" or "bad" response. Its purpose is to automate the process of giving feedback by assigning a numerical score or "reward" to any text generated by the language model.
However, directly defining a reward function that captures human nuance is nearly impossible. The challenge is that human preferences are complex and subjective, making it extremely difficult to encode them in a static reward function.
Templates provide an effective way to collect annotator data that defines what “good” and “bad” responses look like. These comparisons act as direct signals that help train the system to give better answers over time. Templates also make it easier to gather feedback consistently across projects without the need to build custom tools.
After you've gathered feedback, it's used to shape the reward model, which in turn guides the LLM's learning. This iterative feedback loop is the foundation of the RLVR process. The reward model provides a learning signal, which is used by a reinforcement learning algorithm like Proximal Policy Optimization (PPO) to shape the LLM’s behavior. This process allows feedback to be refined and reused across different projects.
Here are the benefits of using human preference data with RLVR:
Variety of Applications: RLVR is widely used to train language models to generate more helpful responses. It can also train bots and game-playing agents based on human preferences.
Label Studio offers a diverse range of Generative AI templates, along with options for Computer Vision, Natural Language Processing, Audio/Speech, and more. These templates facilitate the creation of datasets for fine-tuning models like GPT-4, LLaMA, and PaLM 2.
Here are a few examples of Label Studio’s generative AI templates:
Each of these templates is useful depending on where you are in the fine-tuning process.
Sometimes the existing templates are not enough for a specific fine-tuning project, so we have to create our own. Let’s say we’re training a math tutor bot with RLVR. In this case, we want annotators to mark detailed, explanatory answers as “good” and short or unhelpful answers as “bad”.
If you want to follow along with the code in this example, you can find it at
When you create a custom Label Studio template, you can configure it so the exported dataset already includes reward signals, 1 for “good” answers and 0 for “bad.” That means the export comes with rewards included, so it’s ready to use for RLVR right away. You can build this in the UI or with a few lines of code using our Python SDK.
Label Studio templates are defined in XML. Here’s the custom config we’ll use:
<View>
<Text name="input" value="$prompt"/>
<Text name="output" value="$response"/>
<Choices name="reward" toName="output">
<Choice value="1">Good</Choice>
<Choice value="0">Bad</Choice>
</Choices>
</View>
The <View> tag in this layout acts as the container for the entire labeling interface. The <Text> tags display the input prompt and the model’s output response. The <Choices> tag gives annotators two simple choices:
The benefit of using Label Studio is that you can inspect and edit the XML behind existing templates as well as build new ones from scratch.
This creates a new project called RLVR Scoring with our custom template using the SDK. You’ll need to provide `LS_URL`, the url that points to your instance of Label Studio, and an `API_KEY`, the access token that can be found in your user settings.
tasks = [
{
"prompt": "Solve for x: 2x + 3 = 7",
"response": "Subtract 3 from both sides: 2x = 4, then divide by 2. So, x = 2."
},
{
"prompt": "What is the derivative of x^2?",
"response": "2x"
},
{
"prompt": "Calculate the area of a circle with radius 3",
"response": "To calculate the area of a circle, use the formula Area = π * r^2. With r = 3, Area = π * 3 * 3 = 9π."
}
]
ls.projects.import_tasks(id=pid, request=tasks)
Each task includes a prompt (the math question) and a response (the tutor bot’s answer).
tasks = [
{
"prompt": "Solve for x: 2x + 3 = 7",
"response": "Subtract 3 from both sides: 2x = 4, then divide by 2. So, x = 2."
},
{
"prompt": "What is the derivative of x^2?",
"response": "2x"
},
{
"prompt": "Calculate the area of a circle with radius 3",
"response": "To calculate the area of a circle, use the formula Area = π * r^2. With r = 3, Area = π * 3 * 3 = 9π."
}
]
ls.projects.import_tasks(id=pid, request=tasks)