IFG: Internet-Scale Functional Grasping

Ray Muxin Liu*, Mingxuan Li*, Kenneth Shaw, Deepak Pathak

Carnegie Mellon University

* indicates equal contribution

arXiv PDF Poster(coming soon) Twitter(coming soon)

Abstract

Large Vision Models trained on internet-scale data excel at segmenting and semantically understanding object parts, even in cluttered scenes. However, while these models can guide a robot toward the general region of an object, they lack the geometric precision needed to control dexterous robotic hands for precise 3D grasping. To address this, IFG leverages simulation through a force-closure grasp generation pipeline that captures local hand–object geometries, then distills this slow, ground-truth-dependent process into a diffusion model that operates in real time on camera point clouds. By combining the global semantic understanding of internet-scale vision with the geometric accuracy of simulation-based local reasoning, IFG achieves high-performance semantic grasping without any manually collected training data.

Grasps

Single Object Grasp Examples

water bottle

Hammer

Large Spoon

Crowded Scene Grasp Examples

Clustered Scene 1

Clustered Scene 2

Clustered Scene 3

Clustered Scene 4

Method

Overview

Pipeline

1) Inputs & Useful Region Proposal

IFG renders multi-view images and applies VLM-guided segmentation (SAM + VLPart) to extract task-relevant parts. These are reprojected to 3D and aggregated per mesh face to identify a “useful region” that localizes functional geometry.

2) Geometric Grasp Synthesis

Candidate hand poses are sampled near the useful region. A force-closure-based energy with joint limits and collision penalties is minimized to produce physically valid and functional grasps.

3) Simulation Evaluation

Each grasp is perturbed and tested in Isaac Gym on tasks such as Lift and Pick&Shake. Success rates across perturbations become continuous labels, and unstable grasps are filtered out.

4) Diffusion Policy Distillation

A diffusion model is trained to map a noisy grasp and depth-based BPS input to a final grasp. This combines semantic priors from VLMs and geometric accuracy from optimization for fast inference.

Results

1. Qualitative Results

Our method produces stable and task-oriented functional grasps across diverse objects and environments. In single-object scenes, grasps align with affordance-relevant regions such as handles or rims. In crowded scenes, the VLM-guided segmentation allows IFG to isolate the correct target object and avoid collisions with distractors.

Qualitative — Single Object — Qualitative Results

2. Generalization Across Objects

IFG generalizes to unseen object instances and novel categories without retraining. By combining visual-language part reasoning with geometric grasp optimization, our method successfully transfers to objects that differ in shape, topology, or affordance layout from training data. We evaluate this in both isolated scenes and cluttered arrangements.

Table: Single-object generation success rates.

Object	Get a Grip	Ours
water bottle	49.1	62.8
large detergent bottle	51.2	62.5
spray bottle	43.1	54.5
pan	48.1	52.1
small lamp	56.8	85.7
spoon	42.7	50.9
vase	32.2	55.9
hammer	45.8	45.8
shark plushy	19.8	25.1

Table: Crowded-scene lift grasp success rates.

Object	DexGraspNet2	GraspTTA	ISAGrasp	Ours
Tomato Soup Can	47.8	38.3	52.0	45.5
Mug	33.2	26.9	22.6	60.4
Drill	32.1	20.8	36.4	57.5
Scissors	9.7	0.0	33.7	20.2
Screw Driver	0.0	8.3	40.0	22.0
Shampoo Bottle	50.6	25.4	18.8	53.1
Elephant Figure	23.6	29.6	24.2	35.8
Peach Can	61.8	28.0	55.3	60.3
Face Cream Tube	32.1	22.5	20.7	35.5
Tape Roll	22.7	13.9	9.8	43.2
Camel Toy	12.8	14.3	21.3	21.8
Body Wash	40.2	22.3	29.4	58.3

In single-object settings, IFG consistently outperforms Get a Grip across most categories, especially for functionally complex objects such as bottles, lamps, and vases. In crowded scenes, IFG achieves state-of-the-art performance, often surpassing DexGraspNet2, GraspTTA, and ISAGrasp by accurately focusing on the target object using language-guided segmentation.

Single Object Success Rate — Single Object Top-k Success Rate

Cluttered Scene 1 — Cluttered Scene Generalization

Generalization — Cluttered Scene — Cluttered Scene Success Rate

Visualizations demonstrate IFG’s transfer to novel categories. In single-object settings, the method identifies object parts relevant to the task (e.g., mug handle, hammer shaft). In cluttered scenes, IFG isolates the correct object from nearby distractors and maintains collision-free grasp synthesis.

3. Generation Success Rate

We evaluate the quality of grasps generated before diffusion training by measuring physical execution success in simulation. IFG surpasses Get a Grip in single-object pick-and-place and pick-and-shake tasks. In crowded scenes, IFG remains competitive with large-scale pretrained models such as DexGraspNet2, despite using no supervised grasp annotations.

Table: Single-object grasp generation success rates

Method	Pick & Shake (%)	Lift (%)
Ours	16.14	51.11
Get a Grip	11.82	50.93

Table: Crowded-scene grasp generation success rates

Method	Lift (%)
Ours	32.23
GraspTTA	25.64
ISAGrasp	32.51
DexGraspNet2	36.71

4. Diffusion Success Rate

Finally, we distill optimized grasps into a diffusion model, enabling fast feedforward grasp generation. The diffusion model preserves task-oriented grasp behaviors and achieves high success rates during execution, validating that physically optimized grasps can be effectively transferred to a generative policy.

BibTeX

@misc{liu2025ifginternetscaleguidancefunctional,
      title={IFG: Internet-Scale Guidance for Functional Grasping Generation}, 
      author={Ray Muxin Liu and Mingxuan Li and Kenneth Shaw and Deepak Pathak},
      year={2025},
      eprint={2511.09558},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.09558}, 
}

Acknowledgments

We thank Jason Liu, Andrew Wang, Yulong Li, Jiahui (Jim) Yang, Sri Anumakonda for helpful discussions and feedback. This work was supported in part by the Air Force Office of Scientific Research (AFOSR) under Grant No. FA955023-1- 0747 and by the Office of Naval Research (ONR) MURI under Grant No. N00014-24-1-2748.