Photo credit: robohub.org
By Andre He, Vivek Myers
The pursuit of developing robots capable of understanding and executing human tasks has been a prominent endeavor in the domain of robot learning. Natural language offers a user-friendly method for humans to articulate tasks, yet training robots to effectively interpret such instructions remains a complex challenge. One method, known as language-conditioned behavioral cloning (LCBC), trains robotic policies to replicate expert actions in alignment with specific language cues. However, this approach necessitates an extensive amount of human-annotated data and exhibits limited generalization across varying scenes and behaviors. Conversely, contemporary goal-conditioned methods have shown enhanced performance in manipulation tasks but fall short in providing an intuitive way for humans to communicate tasks. The question then arises: how can we merge the intuitive task specification found in LCBC with the superior performance derived from goal-conditioned learning?
To successfully follow instructions, a robot must possess two essential capabilities: the ability to contextualize language instructions within its physical environment and the competence to execute a sequence of actions to fulfill the specified task. These capabilities do not have to be developed solely from human-annotated data; instead, they can be cultivated independently from appropriate data sources. Utilizing vision-language data obtained from non-robotic environments can aid in teaching language grounding, promoting generalization across various instructions and visual contexts. Additionally, unannotated robot trajectories can facilitate the training of robots to attain specific goal states, even in the absence of corresponding language instructions.
By conditioning on visual goals, such as goal images, we enhance policy learning in complementary ways. Goals serve as a scalable form of task specification, as they can be generated retrospectively (any reached state can become a goal). This capacity allows for training policies through goal-conditioned behavioral cloning (GCBC) using large datasets of unannotated and unstructured trajectories, which may include autonomously gathered data. Goals present a more intuitive grounding mechanism as images, allowing for straightforward pixel-by-pixel comparisons with various states.
Nonetheless, conveying goals to human users may be less intuitive than using natural language. Typically, articulating a desired task through words can be simpler than supplying a goal image, as generating such an image might necessitate executing the task first. By integrating a language interface into goal-conditioned policies, we can harmonize the advantages of both task specification methods, ultimately enabling the command of versatile robots with ease. Our proposed method, detailed below, introduces such an interface, capable of generalizing across a variety of instructions and settings by leveraging vision-language data while enhancing physical skills utilizing extensive unstructured robot datasets.
Goal representations for instruction following
The GRIF model includes a language encoder, a goal encoder, and a policy network. These encoders map language instructions and goal images into a synchronized task representation space, guiding the policy network towards action prediction. The model effectively interprets both language instructions and goal images to anticipate actions, with a primary focus on enhancing the language-conditioned usage through goal-conditioned training.
Our methodology, known as Goal Representations for Instruction Following (GRIF), employs a joint training approach for policies that are conditioned on both language and goals with congruent task representations. Our pivotal discovery is that aligning these representations across the language and goal domains enables an effective fusion of the benefits stemming from goal-conditioned learning and language-specific policies. Following training predominantly on unlabeled demonstration data, the learned policies are capable of effectively generalizing across different languages and scenes.
The GRIF model was trained on a variant of the Bridge-v2 dataset, which comprises 7,000 labeled demonstration trajectories alongside 47,000 unlabeled ones, situated within a kitchen manipulation context. To utilize the vast quantity of unannotated trajectories directly—without the typical extensive manual annotation process—significantly enhances efficiency in training.
To effectively learn from both labeled and unlabeled datasets, GRIF employs a combination of language-conditioned behavioral cloning (LCBC) and goal-conditioned behavioral cloning (GCBC). The labeled dataset encompasses both language and goal specifications, allowing us to supervise predictions in both modalities. The unlabeled dataset comprises solely goals, relying solely on GCBC for training purposes. The distinction between LCBC and GCBC lies in selecting the appropriate task representation from the respective encoder to feed into a shared policy network for action prediction.
By utilizing a shared policy network, we anticipate some improvement from the unlabeled dataset for goal-conditioned training. However, GRIF significantly enhances the transfer of learning between the two modalities by recognizing that certain language instructions and goal images dictate similar behaviors. This structure is leveraged by ensuring that the language and goal representations closely align for identical semantic tasks. Such structural consistency implies that unlabeled data may also augment the language-conditioned policy, as the goal representation approximates the absent instruction.
Alignment through contrastive learning
We establish direct alignment between goal-conditioned and language-conditioned tasks within the labeled dataset via contrastive learning methods.
Given that language often describes relative transformations, we focus on aligning representations of state-goal pairs alongside language instructions rather than purely correlating goals with language. This strategy proves beneficial as it allows the representations to hone in on the critical changes from initial state to goal, often omitting less relevant details present in the images.
This alignment structure is accomplished through an infoNCE objective that operates on the instructions and images from the labeled dataset. We concurrently train dual image and text encoders by executing contrastive learning on matching pairs of language and goal representations. The objective fosters a strong correlation between representations of identical tasks while minimizing similarities across distinct tasks, utilizing negative samples from other trajectories for contrast.
Utilizing straightforward negative sampling often leads to learned representations that overlook the actual task in favor of merely aligning instructions and goals associated with the same settings. For practical application, it is more advantageous for the policy to distinguish different tasks within an identical scene rather than merely associating language with a specific scene. To refine our approach, we implement a hard negative sampling strategy, sourcing up to half of the negative examples from various trajectories occurring in the same scene.
This contrastive learning configuration hints at the potential of pre-trained vision-language models like CLIP, known for their robust zero-shot and few-shot generalization capabilities in vision-language tasks, and for effectively harnessing knowledge from large-scale pre-training datasets. However, conventional vision-language models primarily align static images with their captions rather than comprehending environmental changes, often struggling to focus on a single object amid cluttered scenes.
To overcome these limitations, we designed a mechanism to adapt and fine-tune CLIP for task representation alignment. By restructuring the CLIP architecture to handle pairs of images using early fusion (stacked channel-wise), we achieved an effective initialization for encoding state and goal images, particularly benefiting from prior advantages accrued through CLIP’s pre-training.
Robot policy results
In our primary evaluation, we tested the GRIF policy in real-world scenarios across 15 tasks within three distinct settings. The instructions selected included a combination of familiar directives present in the training data as well as novel commands demanding a degree of compositional generalization. One of the scenes incorporated an unfamiliar array of objects.
We benchmarked GRIF against basic LCBC methods and more advanced baselines inspired by previous research, such as LangLfP and BC-Z. LLfP denotes the joint training of LCBC and GCBC. BC-Z adapts the aforementioned method to our context, training on LCBC, GCBC, and a straightforward alignment term, optimizing cosine distance loss between task representations while excluding image-language pre-training.
During testing, the policies displayed susceptibility to two significant types of failure modes. One of these involved misinterpreting the language instruction, leading to either executing irrelevant tasks or failing to take any useful action. When the grounding of language proved insufficient, a policy might initiate an unintended task after successfully completion of the original task, as the initial instruction was misrepresented.
Examples of grounding failures
“put the mushroom in the metal pot”
“put the spoon on the towel”
“put the yellow bell pepper on the cloth”
“put the yellow bell pepper on the cloth”
The other prevalent failure mode related to object manipulation. This could stem from failing to grasp objects adequately, executing movements with poor precision, or releasing objects prematurely. It is crucial to note that these issues are not fundamental shortcomings of the robotic setup, as a GCBC policy trained on the complete dataset demonstrates consistent success in manipulation tasks. Instead, these failure modes often indicate ineffective application of goal-conditioned training data.
Examples of manipulation failures
“move the bell pepper to the left of the table”
“put the bell pepper in the pan”
“move the towel next to the microwave”
Upon comparing the baselines, each exhibited varying degrees of liability to these types of failures. LCBC’s reliance on a limited labeled trajectory dataset hindered its manipulation proficiency, preventing it from completing any tasks. LLfP, through joint training on labeled and unlabeled data, improved significantly over LCBC in manipulation tasks but struggled to effectively ground more complex instructions. Additionally, BC-Z emerged with enhanced manipulation capabilities due to its alignment strategy likely facilitating improved transfer between modalities. However, it remained challenged in adapting to new instructions in the absence of external vision-language data sources.
In contrast, GRIF outperformed others in generalization while simultaneously exhibiting robust manipulation skills. It demonstrated the ability to accurately ground language commands and execute tasks across diverse scenarios, even amidst a range of possible actions within a single environment. Below are examples of rollout tasks executed by GRIF.
Policy Rollouts from GRIF
“move the pan to the front”
“put the bell pepper in the pan”
“put the knife on the purple cloth”
“put the spoon on the towel”
Conclusion
GRIF empowers robots to leverage substantial amounts of unlabeled trajectory data to develop goal-conditioned policies while furnishing a corresponding “language interface” through aligned task representations of language and goals. Unlike previous language-image alignment techniques, our approach aligns changes in state with corresponding language, yielding considerable advancements over traditional CLIP-based image-language alignment objectives. Our experimental findings indicate that this methodology effectively capitalizes on unlabeled robotic trajectories, resulting in significant performance improvements compared to baseline models and those solely utilizing language-annotated data.
Nonetheless, our method possesses certain limitations that merit consideration for future research endeavors. For instance, GRIF may not be optimally equipped for tasks where instructions delineate the method of execution rather than the end goal (e.g., “pour the water slowly”), which could necessitate alternative alignment losses that account for intermediate execution steps. Furthermore, GRIF assumes that all language grounding is derived from either fully annotated sections or from pre-trained vision-language models. A promising direction for future exploration would be to expand our alignment loss to incorporate human-generated video data, enhancing the learning of intricate semantics from extensive datasets. Such advancements could further optimize grounding of language outside the confines of the robot dataset, facilitating the development of universally adaptable robot policies that efficiently execute detailed user instructions.
This post is based on the following paper:
BAIR Blog is the official blog of the Berkeley Artificial Intelligence Research (BAIR) Lab.
Source
robohub.org