IFG: Internet-Scale Functional Grasping

Anonymous Authors

IFG teaser

Abstract

Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the generated dataset is distilled into a diffusion model that can operate on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, IFG achieves high-performance semantic grasping without any manually collected training data.

Grasps

Single Object Grasp Examples

water bottle
Hammer
Large Spoon

Crowded Scene Grasp Examples

Clustered Scene 1
Clustered Scene 2
Clustered Scene 3
Clustered Scene 4

Method

Overview

Method overview
IFG takes an object mesh and a task prompt as input. To incorporate semantic understanding, it renders the object from multiple viewpoints, applies a VLM-based segmentation model combining SAM and VLPart, and reprojects the results into 3D space to identify task-relevant regions. For geometric grounding, it initializes a force closure objective at these regions and optimizes for functional grasps. The resulting data is then used to train a diffusion model for fast grasp synthesis from depth.

Pipeline

1) Inputs & Useful Region Proposal

IFG renders multi-view images and applies VLM-guided segmentation (SAM + VLPart) to extract task-relevant parts. These are reprojected to 3D and aggregated per mesh face to identify a “useful region” that localizes functional geometry.

2) Geometric Grasp Synthesis

Candidate hand poses are sampled near the useful region. A force-closure-based energy with joint limits and collision penalties is minimized to produce physically valid and functional grasps.

3) Simulation Evaluation

Each grasp is perturbed and tested in Isaac Gym on tasks such as Lift and Pick&Shake. Success rates across perturbations become continuous labels, and unstable grasps are filtered out.

4) Diffusion Policy Distillation

A diffusion model is trained to map a noisy grasp and depth-based BPS input to a final grasp. This combines semantic priors from VLMs and geometric accuracy from optimization for fast inference.

Results

Phase 1: Synthetic Data Generation (Optimization)

This section evaluates our simulation-based pipeline's ability to generate high-quality "ground truth" labels. This "Teacher" data is what we use to train our final model.

1.1 Functional Alignment

Qualitative Comparisons
Qualitative Comparison: IFG seeds optimization on semantic regions (e.g., handles), ensuring grasps are not just stable, but functional.

1.2 Single-Object Generation Metrics

Table: Full Single-Object Generation Success (Lift)
Object Get a Grip IFG (Ours)
Water Bottle49.162.8
Large Detergent51.262.5
Spray Bottle43.154.5
Pan48.152.1
Small Lamp56.885.7
Spoon42.750.9
Vase32.255.9
Hammer45.845.8
Shark Plushy19.825.1
Overall Average 50.93 51.11
Single Object Top-k Success
Top-k Success: Success rates for single object generation across varying numbers of grasp attempts.

1.3 Optimization Pipeline Ablation

Table: Incremental Pipeline Improvements (Success %)
Configuration Single (Lift) Cluttered (Lift)
Single camera only47.8318.53
+ Multi-camera views48.3724.70
+ Two-means clustering49.0431.59
Full IFG Pipeline51.1132.23

Phase 2: Trained Diffusion Model (Inference)

This section evaluates the final model (the "Student"). Unlike the optimizer, this model runs in real-time and only sees raw point cloud data from a single depth camera.

2.1 Full Crowded Scene Performance

Table: Full Performance Breakdown on 35 Dense Test Scenes (Lift Success %)
Object Category DexGraspNet2 GraspTTA ISAGrasp IFG (Ours)
Tomato Soup Can47.838.352.045.5
Mug33.226.922.660.4
Drill32.120.836.457.5
Scissors9.70.033.720.2
Screw Driver0.08.340.022.0
Shampoo Bottle50.625.418.853.1
Elephant Figure23.629.624.235.8
Peach Can61.828.055.360.3
Face Cream Tube32.122.520.735.5
Tape Roll22.713.99.843.2
Camel Toy12.814.321.321.8
Body Wash40.222.329.458.3
Object Average (Hard) 30.5520.8630.3542.80
Scene Average (Overall) 36.7125.6432.5134.16

2.2 Generalization and Coverage Analysis

Balanced Coverage Plot
Concentration vs. Coverage: Baselines focus on easy objects. IFG ensures the robot can target any object in the cluster.
Cluttered Scene Success Rate
Threshold Sensitivity: Success rates across different lift height requirements in cluttered scenes.

Discussion: The "Balanced Coverage" Advantage

While overall "Scene Averages" can be skewed by models that only pick the easiest objects in a pile, our Object Average (42.80% vs 30.55%) proves that IFG learns to handle a wider variety of geometries. By seeding the model with VLM semantic knowledge, we prevent the "student" from overfitting to simple shapes and encourage a deeper understanding of functional grasping.