Collaborative Robotics · Winter 2026

Bimanual Granular Scooping

A bimanual mobile robot that perceives objects through RGB-D sensing, plans grasps in 3D, navigates autonomously, and scoops granular materials using coordinated two-arm control.

Alaz Cig·Edward Lee·Haoyue Xiao·Jeffery Cai·Pengyu Mo·Saimai Lau·Zeyi Liu·Zhanyi Sun

Listed alphabetically

Overview

We demonstrate object manipulation, navigation-integrated grasping, and coordinated bimanual candy scooping with tool use, in simulation and in the real world.

Task 3. Bimanual candy scooping in simulation.

Tasks 1 & 2. Manipulation and locomotion simulation

Tasks 1 & 2. Manipulation and locomotion with real robot

System Architecture

The control architecture connects perception, navigation, grasp planning, and a high-level controller. An initial state-machine approach was augmented with an LLM-powered controller (Claude Opus 4.6) for generalizable voice-driven interaction.

External
SAM3 On eGPU
AnyGrasp On eGPU
Google Cloud Text-speech processing
Gemini Task parsing
ZeroMQ communication
Google remote procedure call
Perception
SAM3 Tracker Masks & camera tracking
AnyGrasp Server ICP & 6-DOF grasps
ArUco Node Fiducial detection
Camera Driver RGB-D stream
Topics & services
Controller
State Machine & Vosk Speech Recognition Rule-based orchestration
Claude Opus & Google Cloud Speech LLM-driven orchestration
Actions & services
Execution
Execute Grasp Approach, grasp, retreat
Navigator Three-stage PID
Dropoff Place & release
Motion Planner Inverse kinematics
Joint commands & velocity
Hardware
Arms /joint_cmd
Base /cmd_vel
Grippers /gripper/cmd
Camera Servo /pan_tilt_cmd

Figure 1. Robot system architecture. External GPU servers handle perception inference and voice processing. A high-level controller orchestrates all modules via services and actions. Execution nodes translate plans and forward to hardware commands. The system is implemented in ROS 2.

Tasks 1 & 2

Manipulation & Locomotion

Bimanual pick-and-place powered by RGB-D perception, real-time tracking, and autonomous navigation. AnyGrasp processes depth and color data to generate 6-DOF grasp poses, while SAM3 provides segmentation masks for continuous object tracking as the arms move. A PID-based controller handles straight-line path following and turning, coordinating base locomotion with arm manipulation.

Key Components

  • AnyGrasp server — generates 6-DOF grasp poses from RGB-D point clouds
  • Iterative closest point (ICP) — refines point clouds using known 3D object mesh models (STL) before grasp detection
  • SAM3 tracker — provides segmentation masks for continuous object tracking with proportional camera control
  • PID navigation controller — handles straight-line path following and turning for autonomous locomotion
  • Execute grasp server — plans and executes 6-DOF grasps from AnyGrasp poses
AnyGrasp RGB-D point cloud — banana

Figure 3a. AnyGrasp point cloud with grasp pose for a banana.

AnyGrasp RGB-D point cloud — cube

Figure 3b. AnyGrasp point cloud with grasp pose for a cube.

Task 3

Scooping Granular Objects

Full candy scooping pipeline: the robot grasps a scoop and bucket, navigates to a candy box, scoops the candy, and transfers it to the bucket. This integrates bimanual grasping, autonomous navigation, and coordinated tool use in one continuous sequence.

Candy Scooping Pipeline

1 Detect ArUco markers on tools
2 Pick up scoop & bucket
3 Move to candy box
6 Drop candy in bucket
5 Detect bucket center & move scoop to bucket
4 Scoop candy

Figure 4. Candy scooping pipeline with six coordinated steps.

Key Components

  • ArUco detection node — locates the scoop and bucket using ArUco fiducial markers
  • Execute grasp server — coordinates bimanual grasp sequences
  • Waypoint trajectories — arm paths for scooping and transfer motions
Task 3 scooping scene overview

Figure 5a. Bimanual candy scooping scene setup.

ArUco tag detection for tool localization

Figure 5b. ArUco tag-based pose estimation for tool localization.

Scooping perspective view

Figure 5c. Scooping perspective during candy pickup.

Development & Debugging Tools

Custom tools for visualizing sensor streams, diagnosing perception failures, and validating grasp plans before deploying to hardware.

Teleoperation interface

Figure 6a. Teleoperation interface for manual robot control.

Real-time display tool

Figure 6b. Real-time sensor and state display.

Custom GUI for debugging

Figure 7a. Graphical interface with Claude Opus integrated as an autonomous controller. Originally built to record examples for training a vision-language-action (VLA) or action chunking with transformers (ACT) policy.

Claude Control panel with active listening toggle

Figure 7b. Claude control panel for voice-driven interaction with Google Cloud text-speech processing, Gemini task parsing, and active listening with keyword detection.