NVIDIAs RVT can learn new tasks after just 10 demos
NVIDIA Robotics Research has announced new work that combines text prompts, video input, and simulation to more efficiently teach robots how to perform manipulation tasks, like opening drawers, dispensing soap, or stacking blocks, in real life. Generally, methods of 3D object manipulation perform better when they build an explicit 3D representation rather than only relying on camera images. NVIDIA wanted to find a method of doing that came with less computing costs and was easier to scale than explicit 3D representations like voxels. To do so, the company used a type of neural network called a multi-view transformer to create virtual views from the camera input. The team’s multi-view transformer, Robotic View Transformer (RVT) , is both scalable and accurate. RVT takes camera images and task language descriptions as inputs and predicts the gripper pose action. In simulations, NVIDIA’s research team found that just one RVT model can work well across 18 RLBench tasks ...