Photo credit: www.therobotreport.com
Figure AI Inc. has unveiled a practical application for its humanoid robots and Helix model: the manipulation of packages for triage within logistics operations.
According to the Sunnyvale, California-based firm, this task necessitates human-like speed, precision, and adaptability, thereby advancing the field of learned manipulation in robotics.
Central to this development is Helix, Figure’s proprietary vision-language-action (VLA) model. This innovative framework, recently announced, integrates perception, language comprehension, and controlled actions into a cohesive unit.
Addressing the Logistics Challenge
Handling and sorting packages is a crucial function in the logistics sector, often involving the transfer of items from one conveyor belt to another while ensuring accurate label orientation for effective scanning. This task poses several challenges, as highlighted by Figure AI. Packages vary widely in size, shape, weight, and rigidity—ranging from rigid boxes to soft bags—making simulation and replication difficult.
The Figure 02 humanoid robot is designed to determine the most effective moments and methods for grasping these moving packages and reorienting them for label visibility. Additionally, it must monitor the flow of numerous items on a constantly moving conveyor, all while maintaining high processing speeds.
Given the unpredictable nature of the environment, the system is also programmed for self-correction. Overcoming these challenges is not only integral to Figure’s business model but has also contributed to significant enhancements in Helix System 1 that benefit various other applications, according to the company.
Advancements in Visual Recognition
Figure AI asserts that its system has gained a sophisticated 3D comprehension of its surroundings, facilitating more accurate depth-aware movements. Previously reliant on monocular visual input, the updated System 1 utilizes a stereo vision backbone combined with a multiscale feature extraction network to effectively capture complex spatial hierarchies.
Instead of processing image features from each camera separately, the system merges inputs from both cameras within a multiscale stereo framework before tokenization. This approach ensures that the overall count of visual tokens fed into Figure’s cross-attention transformer remains constant, thus alleviating computational strain.
The incorporation of multiscale features enables the system to discern subtle details as well as broader contextual signals, thereby enhancing control reliability based on visual inputs, according to Figure’s findings.
Scaling Deployments Effectively
To deploy a uniform operational policy across numerous robots, it is critical to address distribution shifts resulting from hardware variations among individual units. These variations stem from differences in sensor calibration—which affect input observations—and joint response characteristics—which impact action execution. Failure to compensate for these differences can hinder policy performance.
Given the complexity of the upper-body action space, traditional calibration methods cannot scale effectively across a fleet of robots. As a solution, Figure trains a visual proprioception model capable of estimating the 6D poses of end effectors using solely onboard visual input.
This “self-calibration” process facilitates the transfer of policy insights across robots with minimal downtime. By leveraging the learned calibration alongside the visual proprioception module, Figure has been able to implement the same operational policy, initially developed using data from a single robot, across multiple units. This consistency helps maintain high levels of manipulation performance, despite variances in sensor calibration and minor hardware discrepancies.
Enhancing Data Utilization and Manipulation Speed
Figure AI has made a concerted effort to ensure the quality of data sourced from human demonstrations, filtering out slower or failed attempts while retaining those that included corrective actions prompted by environmental variability. Working alongside teleoperators has further refined and standardized manipulation strategies, contributing to tangible advancements.
In pursuit of surpassing human agility in manipulation, Figure developed a simple yet effective technique that boosts test-time performance. Dubbed “Sport Mode,” this initiative interpolates policy action outputs, with its System 1 policies generating action “chunks” at a rate of 200Hz.
For example, the company found it could enhance test-time efficiency by 20% without altering the training framework. By linearly resampling an action chunk, the system executes the shorter trajectory at the consistent 200Hz control rate.
Moreover, this “Sport Mode” can yield speed increases up to 50%; however, exceeding this threshold begins to compromise performance, resulting in increased need for system resets due to imprecise actions. Notably, the system still outperformed expert training data even at a 50% acceleration, achieving greater efficiency in object handling.
Insights from Helix’s Performance
Figure AI measures performance through the normalized effective throughput (T_eff), indicating package handling speed relative to the expert data used for training. Values above 1.1 denote manipulation speeds that exceed those of trained trajectories. The company has observed considerable improvements in system functionality, particularly through the integration of multiscale feature extraction and stereo input.
For instance, the stereo model boasts a 60% increase in throughput compared to systems lacking stereo capabilities. Additionally, the enhanced model can generalize its operations to flat envelopes that were not part of its training regimen.
Figure’s research highlights the significance of data quality over sheer quantity, demonstrating that a model trained on curated, high-quality demonstrations can achieve superior throughput—40% better—despite being based on one-third less training data.
In summary, Figure AI’s developments showcase how the combination of high-quality datasets and architectural advancements like stereo multiscale vision and real-time calibration can lead to increased dexterity in robotic manipulation within real-world logistics environments. The progress made with Helix underscores the potential for scaling visuo-motor policy applications in complex industrial settings that demand both speed and precision.
Source
www.therobotreport.com