IntentFuse: Language-Guided 3D Scene Understanding via Prompt Filtering and Fusion

Ahalya Ravendran, Madhawa Perera, Feng Xu, Lars Petersson, Dadong Wang, Xun Li

CSIRO Data61

📄 Paper 💻 Code

Abstract

IntentFuse is a lightweight middleware that grounds natural language queries in 3D scenes by connecting a compact language model with a pretrained LERF. It reformulates free-form queries into structured prompts, handling affordances and negations without extra training. Experiments show clear gains over LERF, enabling intuitive affordance grounding for robotics and AR/VR exploration.

Methodology

Pipeline Figure

IntentFuse Query Engine overview. The Query Evaluator extracts key roles from natural language, the Context Provider resolves ambiguities using scene priors, and the structured output is passed to the LERF engine for precise 3D grounding.

Experiments

Affordance Query

"Something to tell time"

LERF Negation
LERF
Ours Negation
Ours

Descriptive Query

"Something wooden, unlike a soft toy."

LERF Desk
LERF
Ours Desk
Ours

Descriptive Query

"Decorative pillow with tree silhouette in cream and brown."

LERF Affordance
LERF
Ours Affordance
Ours

Object-only Query

"Desk lamp"

LERF Affordance
LERF
Ours Affordance
Ours

BibTeX

@article{ravendran2025intentfuse,
  title={IntentFuse: Language-Guided 3D Scene Understanding via Prompt Filtering and Fusion},
  author={Ravendran, Ahalya and Perera, Madhawa and Xu, Feng and Petersson, Lars and Wang, Dadong and Li, Xun},
  journal={International Conference on Digital Image Computing: Techniques and Applications},
  year={2025}
}