Collaborative Robotics · Winter 2026

Bimanual Granular Scooping

A bimanual mobile robot that perceives objects through RGB-D sensing, plans grasps in 3D, navigates autonomously, and scoops granular materials using coordinated two-arm control.

Alaz Cig·Edward Lee·Haoyue Xiao·Jeffery Cai·Pengyu Mo·Saimai Lau·Zeyi Liu·Zhanyi Sun

Listed alphabetically

Code

Overview

We demonstrate object manipulation, navigation-integrated grasping, and coordinated bimanual candy scooping with tool use, in simulation and in the real world.

Task 3. Bimanual candy scooping in simulation.

Tasks 1 & 2. Manipulation and locomotion simulation

Tasks 1 & 2. Manipulation and locomotion with real robot

System Architecture

The control architecture connects perception, navigation, grasp planning, and a high-level controller. An initial state-machine approach was augmented with an LLM-powered controller (Claude Opus 4.6) for generalizable voice-driven interaction.

External

SAM3 On eGPU

AnyGrasp On eGPU

Google Cloud Text-speech processing

Gemini Task parsing

ZeroMQ communication

Google remote procedure call

Perception

SAM3 Tracker Masks & camera tracking

AnyGrasp Server ICP & 6-DOF grasps

ArUco Node Fiducial detection

Camera Driver RGB-D stream

Topics & services

Controller

State Machine & Vosk Speech Recognition Rule-based orchestration

→

Claude Opus & Google Cloud Speech LLM-driven orchestration

Actions & services

Execution

Execute Grasp Approach, grasp, retreat

Navigator Three-stage PID

Dropoff Place & release

Motion Planner Inverse kinematics

Joint commands & velocity

Hardware

Arms /joint_cmd

Base /cmd_vel

Grippers /gripper/cmd

Camera Servo /pan_tilt_cmd

Figure 1. Robot system architecture. External GPU servers handle perception inference and voice processing. A high-level controller orchestrates all modules via services and actions. Execution nodes translate plans and forward to hardware commands. The system is implemented in ROS 2.

Tasks 1 & 2

Manipulation & Locomotion

Bimanual pick-and-place powered by RGB-D perception, real-time tracking, and autonomous navigation. AnyGrasp processes depth and color data to generate 6-DOF grasp poses, while SAM3 provides segmentation masks for continuous object tracking as the arms move. A PID-based controller handles straight-line path following and turning, coordinating base locomotion with arm manipulation.

Key Components

AnyGrasp server — generates 6-DOF grasp poses from RGB-D point clouds
Iterative closest point (ICP) — refines point clouds using known 3D object mesh models (STL) before grasp detection
SAM3 tracker — provides segmentation masks for continuous object tracking with proportional camera control
PID navigation controller — handles straight-line path following and turning for autonomous locomotion
Execute grasp server — plans and executes 6-DOF grasps from AnyGrasp poses

Sequence of real robot actions showing CV detection, tracking, navigation, and scooping

Figure 2. Sequence of manipulation actions on the real robot — CV detection, object tracking, navigation, and task execution.

Figure 3a. AnyGrasp point cloud with grasp pose for a banana.

Figure 3b. AnyGrasp point cloud with grasp pose for a cube.

Task 3

Scooping Granular Objects

Full candy scooping pipeline: the robot grasps a scoop and bucket, navigates to a candy box, scoops the candy, and transfers it to the bucket. This integrates bimanual grasping, autonomous navigation, and coordinated tool use in one continuous sequence.

Candy Scooping Pipeline

1 Detect ArUco markers on tools

→

2 Pick up scoop & bucket

→

3 Move to candy box

6 Drop candy in bucket

←

5 Detect bucket center & move scoop to bucket

←

4 Scoop candy

Figure 4. Candy scooping pipeline with six coordinated steps.

Key Components

ArUco detection node — locates the scoop and bucket using ArUco fiducial markers
Execute grasp server — coordinates bimanual grasp sequences
Waypoint trajectories — arm paths for scooping and transfer motions

Figure 5a. Bimanual candy scooping scene setup.

ArUco tag detection for tool localization

Figure 5b. ArUco tag-based pose estimation for tool localization.

Figure 5c. Scooping perspective during candy pickup.

Development & Debugging Tools

Custom tools for visualizing sensor streams, diagnosing perception failures, and validating grasp plans before deploying to hardware.

Figure 6a. Teleoperation interface for manual robot control.

Figure 6b. Real-time sensor and state display.

Figure 7a. Graphical interface with Claude Opus integrated as an autonomous controller. Originally built to record examples for training a vision-language-action (VLA) or action chunking with transformers (ACT) policy.

Claude Control panel with active listening toggle

Figure 7b. Claude control panel for voice-driven interaction with Google Cloud text-speech processing, Gemini task parsing, and active listening with keyword detection.