IFG: Internet-Scale Functional Grasping

Ray Muxin Liu*, Mingxuan Li*, Kenneth Shaw, Deepak Pathak

Carnegie Mellon University

* indicates equal contribution

IFG teaser

Abstract

Large Vision Models trained on internet-scale data excel at segmenting and semantically understanding object parts, even in cluttered scenes. However, while these models can guide a robot toward the general region of an object, they lack the geometric precision needed to control dexterous robotic hands for precise 3D grasping. To address this, IFG leverages simulation through a force-closure grasp generation pipeline that captures local hand–object geometries, then distills this slow, ground-truth-dependent process into a diffusion model that operates in real time on camera point clouds. By combining the global semantic understanding of internet-scale vision with the geometric accuracy of simulation-based local reasoning, IFG achieves high-performance semantic grasping without any manually collected training data.

Grasps

Single Object Grasp Examples

water bottle
Hammer
Large Spoon

Crowded Scene Grasp Examples

Clustered Scene 1
Clustered Scene 2
Clustered Scene 3
Clustered Scene 4

Method

Overview

Method overview
IFG takes an object mesh and a task prompt as input. To incorporate semantic understanding, it renders the object from multiple viewpoints, applies a VLM-based segmentation model combining SAM and VLPart, and reprojects the results into 3D space to identify task-relevant regions. For geometric grounding, it initializes a force closure objective at these regions and optimizes for functional grasps. The resulting data is then used to train a diffusion model for fast grasp synthesis from depth.

Pipeline

1) Inputs & Useful Region Proposal

IFG renders multi-view images and applies VLM-guided segmentation (SAM + VLPart) to extract task-relevant parts. These are reprojected to 3D and aggregated per mesh face to identify a “useful region” that localizes functional geometry.

2) Geometric Grasp Synthesis

Candidate hand poses are sampled near the useful region. A force-closure-based energy with joint limits and collision penalties is minimized to produce physically valid and functional grasps.

3) Simulation Evaluation

Each grasp is perturbed and tested in Isaac Gym on tasks such as Lift and Pick&Shake. Success rates across perturbations become continuous labels, and unstable grasps are filtered out.

4) Diffusion Policy Distillation

A diffusion model is trained to map a noisy grasp and depth-based BPS input to a final grasp. This combines semantic priors from VLMs and geometric accuracy from optimization for fast inference.

Results

1. Qualitative Results

Our method produces stable and task-oriented functional grasps across diverse objects and environments. In single-object scenes, grasps align with affordance-relevant regions such as handles or rims. In crowded scenes, the VLM-guided segmentation allows IFG to isolate the correct target object and avoid collisions with distractors.

Qualitative — Single Object
Qualitative Results

2. Generalization Across Objects

IFG generalizes to unseen object instances and novel categories without retraining. By combining visual-language part reasoning with geometric grasp optimization, our method successfully transfers to objects that differ in shape, topology, or affordance layout from training data. We evaluate this in both isolated scenes and cluttered arrangements.

Table: Single-object generation success rates.
Object Get a Grip Ours
water bottle49.162.8
large detergent bottle51.262.5
spray bottle43.154.5
pan48.152.1
small lamp56.885.7
spoon42.750.9
vase32.255.9
hammer45.845.8
shark plushy19.825.1
Table: Crowded-scene lift grasp success rates.
Object DexGraspNet2 GraspTTA ISAGrasp Ours
Tomato Soup Can47.838.352.045.5
Mug33.226.922.660.4
Drill32.120.836.457.5
Scissors9.70.033.720.2
Screw Driver0.08.340.022.0
Shampoo Bottle50.625.418.853.1
Elephant Figure23.629.624.235.8
Peach Can61.828.055.360.3
Face Cream Tube32.122.520.735.5
Tape Roll22.713.99.843.2
Camel Toy12.814.321.321.8
Body Wash40.222.329.458.3

In single-object settings, IFG consistently outperforms Get a Grip across most categories, especially for functionally complex objects such as bottles, lamps, and vases. In crowded scenes, IFG achieves state-of-the-art performance, often surpassing DexGraspNet2, GraspTTA, and ISAGrasp by accurately focusing on the target object using language-guided segmentation.

Single Object Success Rate
Single Object Top-k Success Rate
Cluttered Scene 1
Cluttered Scene Generalization
Generalization — Cluttered Scene
Cluttered Scene Success Rate

Visualizations demonstrate IFG’s transfer to novel categories. In single-object settings, the method identifies object parts relevant to the task (e.g., mug handle, hammer shaft). In cluttered scenes, IFG isolates the correct object from nearby distractors and maintains collision-free grasp synthesis.

3. Generation Success Rate

We evaluate the quality of grasps generated before diffusion training by measuring physical execution success in simulation. IFG surpasses Get a Grip in single-object pick-and-place and pick-and-shake tasks. In crowded scenes, IFG remains competitive with large-scale pretrained models such as DexGraspNet2, despite using no supervised grasp annotations.

Table: Single-object grasp generation success rates
Method Pick & Shake (%) Lift (%)
Ours16.1451.11
Get a Grip11.8250.93
Table: Crowded-scene grasp generation success rates
Method Lift (%)
Ours32.23
GraspTTA25.64
ISAGrasp32.51
DexGraspNet236.71

4. Diffusion Success Rate

Finally, we distill optimized grasps into a diffusion model, enabling fast feedforward grasp generation. The diffusion model preserves task-oriented grasp behaviors and achieves high success rates during execution, validating that physically optimized grasps can be effectively transferred to a generative policy.

Diffusion Success Rate

BibTeX

@misc{liu2025ifginternetscaleguidancefunctional,
      title={IFG: Internet-Scale Guidance for Functional Grasping Generation}, 
      author={Ray Muxin Liu and Mingxuan Li and Kenneth Shaw and Deepak Pathak},
      year={2025},
      eprint={2511.09558},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.09558}, 
}

Acknowledgments

We thank Jason Liu, Andrew Wang, Yulong Li, Jiahui (Jim) Yang, Sri Anumakonda for helpful discussions and feedback. This work was supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant No. FA955023-1- 0747 and by the Office of Naval Research (ONR) MURI under Grant No. N00014-24-1-2748.