Scene Aware Urban Design

This paper presents a human-in-the-loop computer vision system that uses Grounding DINO and a subset of ADE20K to detect urban objects and learn spatial patterns. From these co-occurrence embeddings, users receive a short list of likely complements to a selected anchor object. A vision–language model then reasons over the scene to suggest another element, supporting micro-scale interventions and more continuous, locally grounded participation.

Date 2025

Project Type: Research

Keywords: AR, Urban Planning, VLM

Team: Rodrigo Gallardo, Oz Fishman, Alex Htet Kyaw

The pipeline runs lightweight background object detection as scenes are processed. When a scene qualifies for intervention, the user selects an anchor object, triggering a two-branch decision process. In the statistical branch, the system retrieves common co-occurring urban objects from ADE20K-based analysis, and the user chooses one. In the semantic branch, the anchor–object pair is passed to a vision-language model, which proposes up to five additional context-aware objects for the scene.

Full Paper