IntentFuse: Language-Guided 3D Scene Understanding via Prompt Filtering and Fusion

Ahalya Ravendran, Madhawa Perera, Feng Xu, Lars Petersson, Dadong Wang, Xun Li

CSIRO Data61

Abstract

IntentFuse is a lightweight middleware that grounds natural language queries in 3D scenes by connecting a compact language model with a pretrained LERF. It reformulates free-form queries into structured prompts, handling affordances and negations without extra training. Experiments show clear gains over LERF, enabling intuitive affordance grounding for robotics and AR/VR exploration.

Methodology

IntentFuse Query Engine overview. The Query Evaluator extracts key roles from natural language, the Context Provider resolves ambiguities using scene priors, and the structured output is passed to the LERF engine for precise 3D grounding.

Experiments

Affordance Query

"Something to tell time"

Descriptive Query

"Something wooden, unlike a soft toy."

Descriptive Query

"Decorative pillow with tree silhouette in cream and brown."

Object-only Query

"Desk lamp"

BibTeX

@article{ravendran2025intentfuse,
  title={IntentFuse: Language-Guided 3D Scene Understanding via Prompt Filtering and Fusion},
  author={Ravendran, Ahalya and Perera, Madhawa and Xu, Feng and Petersson, Lars and Wang, Dadong and Li, Xun},
  journal={International Conference on Digital Image Computing: Techniques and Applications},
  year={2025}
}