Soundscape-to-Landscape

When humans hear the sounds of a place, we can often mentally “see” its environment—can AI simulate this ability?

People experience the world through multiple senses simultaneously, contributing to our sense of place. The SounDiT project investigates the Geo-contextual Soundscape-to-Landscape (GeoS2L) problem: synthesizing geographically realistic landscape images from environmental soundscapes. Each place is characterized by a unique set of visual and sounding environments—natural bird calls in a quiet park starkly contrasting with bustling traffic in a downtown at peak hours, the crash of ocean waves at a coastal beach, the ambient hum of a busy urban street, or the rustle of leaves along a forested trail. Understanding such geo-contextual linkages—how visual and sounding environments co-occur—benefits broad applications in geography, environmental psychology, ecology, and urban planning. SounDiT enables AI to replicate this cognitive process, bridging acoustic environments and visual scene understanding within a geographically consistent framework.

Prior quantitative geography studies have mostly emphasized human visual perceptions, neglecting human auditory perceptions at place due to the challenges in characterizing acoustic environments vividly. At the same time, prior audio-to-image generation methods typically rely on general-purpose datasets and overlook geographic and environmental contexts, resulting in unrealistic images that are misaligned with real-world environmental settings. Few studies have synthesized these two-dimensional—auditory and visual—perceptions in understanding human sense of place. SounDiT addresses these gaps through a novel geo-contextual computational framework that explicitly integrates geographic knowledge into multimodal generative modeling.

To support GeoS2L research, we construct two large-scale, geo-contextual multimodal datasets. A soundscape denotes the acoustic environment as perceived by humans at places, comprising sounds from both natural and anthropogenic sources (e.g., bird calls and vehicle noises), while landscape refers to geographical environments shaped by natural and built features. SoundingSVI pairs diverse soundscapes with geo-referenced street-view imagery, and SonicUrban pairs soundscapes sourced from manually verified YouTube recordings with real-world landscape images. Together, these datasets cover diverse geographic locations and scene types—from beaches and forests to busy streets and residential neighborhoods—supporting geo-contextual realistic synthesis.

We propose the SounDiT model, a novel Diffusion Transformer (DiT)-based generative framework that incorporates geo-contextual scene conditioning to synthesize geographically coherent landscape images. At its core is the SounDiT Block, which integrates multimodal encoders (Soundscape, Scene, and Landscape) with a Time-Aware Prototype Mixture-of-Experts (TA-PMoE) module and a geo-contextual Retrieval-Augmented Generation (RAG) mechanism. Together these components enable the generation of geo-consistent visual scenes that faithfully reflect the acoustic properties of the input soundscape—going beyond generic image generation to produce environments that are geographically meaningful.

Standard image generation metrics such as FID may fall short in capturing geographic semantic alignment between input soundscapes and generated images. To address this, we propose a practically-informed geo-contextual evaluation framework, the Place Similarity Score (PSS), which includes three metrics: element-level (individual scene components), scene-level (overall composition), and human perception-level (subjective sense of place). Extensive experiments demonstrate that SounDiT outperforms existing baselines in both visual fidelity and geographic settings. This work not only establishes foundational benchmarks for GeoS2L generation but also highlights the importance of incorporating geographic domain knowledge in advancing multimodal generative models—opening new directions at the intersection of generative AI, geography, urban planning, and environmental sciences, and advancing the frontiers of Human-Centered GeoAI.

References

Wang, J., Tan, H., Liao, B., Jiang, A., Fei, T., Huang, Q., Tu, Z., Ye, S., & Kang, Y. (2025). SounDiT: Geo-Contextual Soundscape-to-Landscape Generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025).

Soundscape-to-Landscape

GISense Lab

Quick Links

Contact

Footer

GISense Lab

Quick Links

Contact