Photo credit: venturebeat.com
Researchers from Stanford University and Google DeepMind have introduced a novel technique known as Step-Wise Reinforcement Learning (SWiRL). This method aims to significantly improve the capabilities of large language models (LLMs), particularly in handling intricate tasks that necessitate multi-step reasoning and the use of various tools.
As the integration of AI agents and LLMs into various industries accelerates, SWiRL stands to offer notable advantages for businesses intending to incorporate advanced reasoning models into their operations.
The Challenge of Multi-Step Issues
Many real-life business applications involve processes that are not straightforward but require multiple stages to complete. For instance, strategizing a detailed marketing initiative may encompass conducting market research, analyzing internal data, determining budgets, and evaluating customer service inquiries. Such tasks typically demand engaging in online research, accessing proprietary databases, and executing code.
Conventional reinforcement learning (RL) techniques employed for refining LLMs, such as Reinforcement Learning from Human Feedback (RLHF) and RL from AI Feedback (RLAIF), predominantly focus on enhancing models for singular-step reasoning tasks.
According to the principal authors of the SWiRL study, Anna Goldie from Google DeepMind and Azalia Mirhosseini from Stanford University, traditional training approaches for LLMs fall short in effectively addressing the multi-step reasoning required for real-world applications.
They expressed concern, stating, “LLMs trained via traditional methods typically struggle with multi-step planning and tool integration, meaning they have difficulties performing tasks that require retrieving documents from various sources or engaging in multiple reasoning and arithmetic calculations.”
Understanding Step-Wise Reinforcement Learning (SWiRL)
SWiRL aims to overcome the multi-step challenge through a combination of generating synthetic data and using a tailored RL approach. This technique facilitates the training of models on complete action sequences.
In their research paper, the team explains, “Our goal is to instruct the model on how to break down complex problems into a series of manageable subtasks, determine when to utilize a tool, formulate requests accurately, make effective use of the results, and synthesize the findings appropriately.”
SWiRL operates via a two-phased process. Initially, it creates and screens abundant multi-step reasoning and tool-use data. Subsequently, it applies a step-wise RL algorithm to refine a foundational LLM through these generated trajectories.
The authors highlight a key practical benefit of this approach: “We can swiftly generate large volumes of multi-step training data through parallel calls, preventing delays in the training process due to slow tool execution. Furthermore, this offline method enhances reproducibility since it utilizes a consistent dataset.”
Creating Training Data
SWiRL data generation process Credit: arXiv
The data generation phase is essential for SWiRL’s learning process. An LLM is provided access to specific tools, such as calculators or search engines, and is prompted in a manner that encourages it to create a “trajectory”—a structured sequence of steps aimed at solving a problem. During this process, the model can either provide its internal reasoning, invoke a tool, or produce a final result. If it opts for a tool call, the generated query is executed, and the outcome is reintroduced into the model’s context for subsequent steps. This continues until a conclusive answer is achieved.
Each trajectory—from the initial question to the final output—is deconstructed into overlapping sub-trajectories. Each sub-trajectory highlights the process leading up to a specific action, hence offering a detailed view of the model’s reasoning step-by-step. To assemble substantial datasets, the research team utilized multi-hop question-answering examples (HotPotQA) and math problem-solving benchmarks (GSM8K), resulting in tens of thousands of trajectories.
The researchers also evaluated four distinct data filtering techniques: unfiltered, outcome-based filtering (correctness of the final answer), process filtering (judging the reasonableness of each step), and a combination of both process and outcome filters.
Contrarily to many conventional methodologies that rely heavily on exact “golden labels” and often disregard data leading to incorrect final answers, SWiRL showed that it achieved the best results with process-filtered data. This indicates that even when the end result was incorrect, the model could still learn valuable insights from the intermediary steps that were logically sound.
The team concluded that “SWiRL can learn from trajectories that result in incorrect outcomes. In fact, we find our optimal results arise when incorporating process-filtered data, no matter the final answer’s correctness.”
Training LLMs Utilizing SWiRL
SWiRL training process Credit: arXiv
During the second phase, SWiRL applies reinforcement learning to further train the foundational LLM on the synthetic trajectories produced earlier. At each step, the model is refined to predict the next appropriate action—be it a reasoning step, tool invocation, or the final answer—based on the prior context.
A separate generative reward model provides feedback during each step, evaluating the model’s generated actions in relation to the context preceding them.
The researchers note, “Our detailed, iterative finetuning method allows the model to grasp both local decision-making and the broader trajectory optimization, all while receiving prompt feedback about the accuracy of each prediction.”
When a SWiRL-trained model is put to use, it follows a similar iterative approach: it receives a prompt, generates a response, and if it produces a tool call, the system executes it and reintroduces the outcome into the model’s context. This process continues until a final answer is produced or a limit on steps is reached.
Goldie and Mirhosseini commented, “By instructing the model to take reasonable steps consistently, we aim to mitigate a core limitation exhibited by traditional LLMs—namely, their fragility when facing complex, multi-step challenges, particularly as the likelihood of successful outcomes diminishes exponentially with increasing task complexity. Effective enterprise AI will ultimately need to incorporate a diverse array of tools, intertwining them into elaborate sequences.”
SWiRL in Practice
The team from Stanford and Google DeepMind assessed the performance of SWiRL on various complex multi-step question-answering and mathematical reasoning challenges. Their results showed significant improvements in accuracy relative to baseline models, with variations ranging between 11% and over 21% on datasets such as GSM8K, HotPotQA, MuSiQue, and BeerQA.
The experiments indicated that training a Gemma 2-27B model with SWiRL leveraging process-filtered data yielded superior outcomes, surpassing models trained with outcome-filtered data or through traditional Supervised Fine-Tuning (SFT). This result implies that SWiRL is more effective in grasping the underlying reasoning processes rather than merely recalling correct answer paths, thus enhancing performance on previously unseen problems.
Furthermore, SWiRL displayed impressive generalization capabilities. For instance, a model trained using SWiRL on text-based question-answering made notable strides in mathematical reasoning tasks, despite the absence of direct training in that area.
This ability to adapt across varied tasks and tool categories is increasingly valuable in the burgeoning landscape of agentic applications for language models. Techniques that foster generalization across different datasets and tasks are likely to simplify, expedite, and reduce costs in adapting to new circumstances.
Goldie and Mirhosseini remarked, “While our findings indicate that SWiRL’s generalization is robust within the domains we examined, exploring its effectiveness in new areas like coding would be intriguing. Our results suggest that an enterprise AI model developed using SWiRL for a primary task would likely exhibit considerable performance gains in otherwise unrelated tasks without requiring specific fine-tuning for each task. SWiRL appears to generalize more effectively with larger models, indicating its potential impact will grow as foundational capabilities advance.”
Source
venturebeat.com