Environmental acoustic intelligence through sound event localization and detection: a review

Yeow, Jun-Wei; Tan, Ee-Leng; Peksi, Santi; Gan, Woon-Seng

doi:10.1038/s44384-025-00036-3

Download PDF

Article
Open access
Published: 05 December 2025

Environmental acoustic intelligence through sound event localization and detection: a review

Jun-Wei Yeow¹,
Ee-Leng Tan¹,
Santi Peksi¹ &
…
Woon-Seng Gan¹

npj Acoustics volume 1, Article number: 31 (2025) Cite this article

3703 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Sound Event Localization and Detection (SELD) is a critical capability for environmental acoustic intelligence, enabling systems to jointly identify what sounds are active and where they originate. This technology is foundational for applications ranging from smart-city monitoring and autonomous systems to immersive media. Modern SELD research is dominated by deep learning approaches that leverage multi-channel audio, explicit spatial representations, and unified output formats that resolve complex and densely polyphonic scenes. This review provides a comprehensive synthesis of the field, charting its progress from foundational concepts to the state-of-the-art. We systematically analyze key methodological advancements, spatially-informed feature engineering, and the sophisticated data augmentation pipelines that underpin top-performing systems on public benchmarks. This review also highlights emerging opportunities for future research such as distance-aware 3-D SELD and the advancement of data-efficient learning paradigms. By translating these challenges into concrete research directions, this work aims to accelerate the progress of SELD toward robust, field-ready environmental intelligence tools.

Open set classification of sound event

Article Open access 13 January 2024

A deep learning approach for detecting drill bit failures from a small sound dataset

Article Open access 10 June 2022

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

Article Open access 12 August 2022

Introduction

In today’s digital age, the proliferation of low-cost sensors and inter-connected devices has created an unprecedented volume of audio data, capturing the rich, multifaceted soundscapes of daily life. These acoustic environments are dense with information, offering valuable cues about human activities, social interactions, mechanical operations, and potential environmental hazards¹. While humans possess an innate ability to discern not only the nature of a sound but also its spatial origin^2,3 and approximate distance⁴, manually analyzing this vast and continuous stream of audio is fundamentally impractical. This challenge has catalyzed a paradigm shift toward intelligent, automated systems for interpreting our acoustic world, a field we term Environmental Acoustic Intelligence. The technical foundation for this intelligence is the domain of computational auditory scene analysis², which seeks to imbue machines with human-like auditory perception.

A cornerstone of Environmental Acoustic Intelligence is the field of Sound Event Localization and Detection (SELD), a critical technology that automates the process of identifying what sounds are present in an environment and where they originate. It accomplishes this by fusing two complementary tasks into a single, cohesive framework: Sound Event Detection (SED) and Direction-of-Arrival Estimation (DOAE). The first task, SED, addresses the “what” by identifying the class of a sound event (e.g., “car horn” or “music”) and its precise onset and offset times. In contrast, DOAE address the “where” by estimating the spatial origin of a sound source relative to the recording device, typically in terms of azimuth and elevation. While useful in isolation, integrating these two capabilities allows SELD systems to provide a comprehensive spatiotemporal understanding of an auditory scene⁵. Unlike traditional methods which are often limited to localizing specific source types such as speech or drone noises⁶, SELD distinguishes itself by its ability to simultaneously detect and localize a diverse array of overlapping sound events.

As illustrated in Fig. 1, a SELD system can monitor a complex acoustic environment, such as a busy street, and concurrently identify a blaring alarm on the right while tracking pedestrian activity on the left. This ability to generate a comprehensive spatiotemporal understanding of an acoustic scene is invaluable for a wide range of applications; incorporating SELD can enhance situational awareness in autonomous robotics⁷, enable smarter surveillance systems in dense urban environments^8,9,10, and create deeply immersive experiences in virtual and augmented reality⁵. Despite these benefits, however, computationally replicating the proficiency of the human auditory system in parsing such complex scenes remains a formidable scientific challenge^2,3.

**Fig. 1: Conceptual illustration of a SELD system.**

While substantial progress has been made in the independent domains of SED and DOAE, their integration into a single framework introduces unique complexities. Specifically, SELD not only inherits the challenges of its constituent tasks but also confronts the critical data association problem: accurately linking multiple concurrent sound events to their respective spatial origins⁵. This associative challenge is exacerbated in real-world settings characterized by overlapping sources, reverberation, and dynamic background noise¹¹.

In recent years, the advent of deep learning has driven remarkable progress in the field, with deep learning-methods setting new standards on many benchmark tasks^3,12. This rapid evolution has led to a diverse and increasingly specialized body of work, creating a clear need for a comprehensive synthesis to map the current state-of-the-art. This review address this need by providing the first holistic survey of the deep learning-centric methodologies that define modern SELD research. We systematically outline methodological breakthroughs and persistent limitations across the entire SELD pipeline, including advances in neural network architectures, benchmark datasets, feature engineering strategies, and training paradigms.

This comprehensive, task-specific scope distinguishes our work from previous literature. Most prior work has surveyed the constituent sub-tasks in isolation, providing valuable in-depth analyses of either SED^13,14,15 or DOAE^6,16. While some reviews have addressed the joint SELD problem, their scope has been limited. For instance, some offer only a brief overview of SELD while maintaining a primary focus on DOAE¹⁷ or a narrow selection of early model architectures¹⁸. Perhaps the most relevant prior work is the overview by Politis et al.¹⁹, which provided a foundational analysis of the first Detection and Classification of Acoustic Scenes and Events (DCASE) challenge task on SELD in 2019. However, the field has advanced at a remarkable pace since that publication, with significant breakthroughs in model architectures, feature engineering, and output formats. In contrast, this review synthesizes these recent, multi-faceted developments across the entire SELD pipeline to provide a current and comprehensive map of the state-of-the-art.

By offering insights into both the state-of-the-art and ongoing challenges, we aim to guide researchers towards meaningful future directions, ultimately facilitating the development of more robust and deployable SELD systems for enhanced environmental acoustic intelligence. Although promising efforts are beginning to address multi-modal²⁰ and 3-D SELD with distance estimation²¹, these fields are still in their formative stages and therefore lie beyond the scope of this review.

Sound event localization and detection

At its core, SELD is a machine listening framework designed to emulate the human capacity for parsing complex acoustic environments. To achieve this, SELD systems must simultaneously answer two fundamental questions: what sound events are occurring, and where are they originating? This is accomplished by integrating two traditionally separate yet interdependent tasks into a unified model: polyphonic SED and high-resolution DOAE.

The first task, SED, entails understanding the temporal and semantic aspects of an auditory scene. Its objective is to identify the class of each active sound event and determine its precise onset and offset times. While SED can be monophonic, where the most dominant event is identified, real-world applications demand polyphonic capability, where multiple event classes can be detected simultaneously within the same time frame²². Polyphonic SED, while more challenging, is essential, as natural acoustic environments are rarely composed of isolated, sequential sounds^13,15.

The second task, DOAE, also commonly known as Sound Source Localization, provides the spatial context for an acoustic scene. Using multi-channel audio, DOAE methods estimate the spatial origin of a sound source, typically represented by its azimuth (left-right) and elevation (up-down) angles relative to the recording device^5,17. By itself, DOAE can determine the location of acoustic activity but offers little information about the identity of the sound source, perfectly complementing the semantic information provided by SED.

While SED and DOAE offer valuable insights independently, their integration within the SELD framework unlocks a much richer understanding of the environment. The central challenge of SELD is not merely executing these tasks in parallel, but correctly linking each detected sound with its precise spatial origin. This becomes particularly difficult in acoustically complex scenes with multiple overlapping sound sources. This complete spatiotemporal representation is critical for a growing number of applications across industrial, commercial, and scientific domains^19,23. These applications demonstrate the core value of Environmental Acoustic Intelligence, where raw audio is translated into actionable, spatially-aware insights. For instance:

Public Safety and Surveillance: Accurately linking detections of “gunshots” or “alarms” with precise spatial coordinates can significantly enhance public safety monitoring and policing efforts^24,25. Urban security systems can dispatch responders more effectively without relying on cameras, which may be cost-prohibitive and raise privacy concerns⁸.
Autonomous Vehicles: Autonomous vehicles typically rely on visual or LiDAR sensors for context, which can fail in low light or poor weather. Incorporating SELD enables vehicles to detect and localize non-line-of-sight warning sounds such as sirens or horns, substantially improving situational awareness and decision-making in complex traffic scenarios²⁶.
Biodiversity Monitoring: Integrating SELD can automate non-invasive acoustic surveys of wildlife. By detecting and localizing animal calls, these systems can help track populations over vast habitats, greatly reducing the need for costly and labor-intensive manual fieldwork²⁷.

Initial research into SELD treated the task as a modular pipeline, often combining separate algorithms for each sub-task. For instance, early systems first performed event detection using machine learning methods such as Hidden Markov Models (HMMs)^28,29 or Support Vector Machines³⁰. Parametric methods, such as the Steered Response Power with Phase Transform (SRP-PHAT), then handled localization^28,29. While foundational, these early frameworks struggled to associate detected sound events with their corresponding DOAs³¹ and were generally incapable of scaling to complex scenarios with overlapping sound sources⁵.

The challenge of parsing these acoustic scenes has been met with the transformative power of deep learning³. Deep neural networks (DNNs) are exceptionally well-suited to the task, as they can learn the intricate spectral and spatial patterns directly from multi-channel audio. This has led to the development of robust end-to-end solutions that perform well even in challenging acoustic conditions⁵, making deep learning the dominant framework for SELD. Accordingly, this review focuses on deep learning-based methods for SELD, given their demonstrated prowess in localization and detection³².

A pivotal catalyst in this research domain has been the DCASE challenges³³, which introduced a dedicated SELD task in 2019. The DCASE community has been instrumental in fostering a collaborative research ecosystem, providing widely adopted public datasets^31,34,35, open-source baseline systems^5,21, and unified evaluation metrics¹⁹.

Table 1 provides a timeline that summarizes key methodological breakthroughs that have defined contemporary SELD research. Furthermore, the confluence of community-driven benchmarking and deep learning techniques has given rise to a generic SELD pipeline, as depicted in Fig. 2. The process typically involves five key stages:

1.
Data Collection: Acquiring multi-channel audio recordings paired with precise temporal and spatial annotations for all sound events of interest.
2.
Feature Extraction: Transforming raw audio signals into robust time-frequency (e.g., log-Mel spectrograms) and spatial representations (e.g., intensity vectors). Data augmentation is often applied at this stage to increase the diversity of the training set.
3.
Neural Network Inference: Training a DNN, commonly a Convolutional Recurrent Neural Network (CRNN), to jointly predict event activity probabilities and their corresponding DOAs.
4.
Output Formatting: Structuring the raw predictions of the network into a human-readable format that provides frame-by-frame classifications of sound events alongside their estimated spatial coordinates.
5.
Evaluation: Systematically assessing model performance using established evaluation metrics to guide iterative improvements and benchmark against the state-of-the-art.

Table 1 A timeline of selected methodological milestones in deep learning-based SELD

Full size table

Deep learning

Deep learning is the driving force behind modern SELD systems. As a subfield of machine learning, it enables models to learn complex, high-dimensional patterns directly from raw or minimally processed data¹², making it particularly suited to the intricacies of audio analysis. The success of deep learning in SELD has been fueled by advancements in computational power, the availability of large-scale datasets, and the development of innovative network architectures that have revolutionized fields from medical imaging^36,37 to materials analysis³⁸. This section introduces the foundational concepts and neural network architectures that underpin current SELD systems.

Fundamentally, deep learning relies on artificial neural networks, which draw inspiration from the structure of the human brain³⁹. These networks consist of multiple interconnected layers of “neurons”, where each layer applies learnable transformation to its inputs. By stacking these layers, the network progressively extracts increasingly abstract features, learning a rich data hierarchy. While early neural networks were relatively shallow, researchers discovered that increasing network depth and incorporating specialized modules allowed models to capture far more intricate patterns⁴⁰, leading to the highly expressive and accurate models seen today.

Different classes of specialized architectures have emerged, each tailored to specific data structures^12,38,41. For tasks involving grid-like data, such as time-frequency representations of audio (spectrograms), Convolutional Neural Networks (CNNs) are especially effective⁴². In a CNN, learnable filters “slide” across the input spectrograms to capture local patterns, such as spectral shape and temporal patterns. This property makes CNNs a highly effective front-end for SELD, ideal for extracting the foundational spectral cues needed for event detection and localization.

In contrast, Recurrent Neural Networks (RNNs) are designed to model sequential data and temporal dependencies. By maintaining internal “memory” or hidden states across time steps, RNNs excel at capturing sequential patterns in time-series data. Variants such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) are particularly adept at preserving long-range temporal information¹². This temporal modeling capacity is crucial for SELD, where understanding the onset, duration, and co-occurrence of events over time is essential for accurate detection and localization.

Recognizing the need to model both local features and long-term temporal structure, hybrid architectures that combine CNN and RNN components have become commonplace in SELD. In particular, the CRNN has become the dominant architecture for the task, stacking convolutional layers to capture spectral and spatial features, followed by recurrent layers to model their temporal dynamics^43,44,45. More recently, attention-based models such as the Transformer⁴⁶ and its audio-specialized variants (e.g., Conformer⁴⁷) have emerged as compelling alternatives. These systems use self-attention mechanisms to capture global dependencies, often demonstrating superior performance in modeling long-range context in audio-related tasks.

In modern SELD research, CRNN-based architectures architectures continue to form the foundation for many competitive SELD systems⁴⁸. Fig. 3 illustrates a generic CRNN pipeline adapted from the first publicly available SELD baseline proposed by Adavanne et al.⁵. The system begins by converting multi-channel audio into time-frequency representations. The convolutional layers then act as a powerful feature extractor, identifying localized spectro-temporal patterns and creating a sequence of feature vectors (one for each time frame). These vectors are then fed into the recurrent layers, which integrate contextual information across time to produce enriched sequential representations. Finally, fully connected output layers interpret these time-distributed representations to yield frame-wise predictions for both event activity and corresponding spatial location. This architectural blueprint has served as a springboard for a wide range of more complex and powerful SELD systems.

**Fig. 3: Block diagram of a generic CRNN architecture.**

Audio formats

Accurately localizing sound sources hinges on decoding spatial cues such as inter-channel time delays, level differences, and phase relationships. These cues are physically encoded in the sound waves arriving at multiple points in space and are inherently lost in single-channel recordings⁶. This makes multi-channel audio an indispensable prerequisite for any state-of-the-art SELD system⁴⁹. While numerous multi-channel configurations exist, SELD research has predominantly centered on two principal four-channel recording formats: First-Order Ambisonics (FOA) and tetrahedral or multi-channel microphone arrays (MIC).

The FOA audio format provides a holistic representation of the sound field by employing a mathematical technique known as spherical harmonic decomposition to represent the complete 360° acoustic scene surrounding the microphone⁵⁰. In practice, a spherical microphone array, such as the Eigenmike em32 shown in Fig. 4, captures sound from multiple directions. An encoding matrix then transforms these raw signals into the four-channel FOA format, often labeled W, X, Y, and Z. The omnidirectional W channel represents the sound pressure, or what a single microphone would hear, while the three figure-of-eight X, Y, and Z channels represent the Cartesian particle velocity⁴⁵. This provides a physically meaningful description of the sound field that has proven highly effective for deep learning models. Consequently, SELD systems utilizing the FOA format consistently achieve state-of-the-art performance on public benchmarks^32,51.

**Fig. 4: Eigenmike em32 by mh Acoustics.**

The MIC format, by contrast, is a more direct, hardware-level approach. It retains the raw, unprocessed signals from microphone capsules placed at the vertices of a known geometry, typically a regular tetrahedron. Lacking an explicit spherical harmonic conversion, the MIC format captures spatial information purely through the raw inter-channel phase and level differences. By eliminating the FOA encoding stage, the MIC setup simplifies the front-end processing pipeline and can reduce computational latency, making it an attractive choice for resource-constrained applications⁵². Furthermore, because they operate directly on raw signals, MIC-based systems are not tied to specific array geometries and can support more flexible arrangements, facilitating highly adaptable, real-world deployments.

While FOA and MIC remain the most studied formats, interest is rapidly expanding to other geometries driven by practical applications. This trend signals a move to bring SELD from controlled laboratory settings into everyday devices. For instance, binaural recordings, captured from microphones placed at the ears of a human or dummy head, closely approximate human spatial hearing and align with the growing demand for immersive audio in wearable devices and virtual reality^53,54,55. Moreover, circular arrays^56,57,58 and microphones embedded in common devices such as mobile phones⁵⁹ are also being actively investigated as practical and cost-effective solutions for scalable outdoor surveillance and monitoring platforms⁸.

Each recording format offers a distinct trade-off between representational richness, computational cost, and hardware practicality. The choice of format, therefore, depends heavily on the target application, intended device, and available resources. Regardless of the specific arrangement, however, multi-channel audio remains the essential foundation for capturing the spatial characteristics necessary for effective SELD.

Available datasets

The performance of deep learning models fundamentally depends on extensive, strongly labeled training data. For SELD, this necessitates datasets with precise temporal annotations (onsets and offsets) for each sound event, coupled with their corresponding DOAs. While the curation of such datasets is a labor-intensive process, they are the cornerstone of model development, benchmarking, and validation. Table 2 presents a comprehensive overview of widely used SELD datasets, summarizing key attributes such as recording format, signal-to-noise ratios (SNRs), and total training duration.

Table 2 Comparison of widely used datasets for SELD research

Full size table

Synthetic vs. real-world data

The landscape of SELD datasets can be broadly categorized by data origin into either synthetic or real-world datasets. Early research relied almost exclusively on synthetic mixtures generated by convolving isolated audio recordings with Room Impulse Responses (RIRs). These RIRs can be derived from acoustic simulations to create fully synthetic datasets, or from in-situ measurements in real-rooms to create hybrid datasets⁶⁰. This synthesis paradigm offers precise control over all acoustic parameters, including event types, locations, and SNRs. However, synthetic data often fails to capture the full acoustic complexity and subtle nuances of real sound scenes, which can lead to a performance gap when models are deployed in real-world conditions⁶¹.

In contrast, fully real-world datasets are captured directly in authentic acoustic environments with genuine sound sources. These datasets represent the gold standard for model generalization, as they contain the natural acoustic variability and sound event activity that models will ultimately face in deployment. The primary challenge, however, lies in the extensive and costly manual annotation required to obtain accurate spatiotemporal labels required for model training³².

Key benchmark datasets

A series of notable benchmark datasets has been instrumental in steering the direction and progress of SELD research. Starting in 2019, the Tampere University (TAU) datasets marked the beginning of standardized SELD benchmarking. The initial TAU Spatial Sound Events (TSSE) 2019 dataset featured stationary sources in hybrid acoustic scenes³¹. This was succeeded by the TAU-NIGENS Spatial Sound Events (TNSSE) 2020 and 2021 datasets, which increased realism by introducing moving sound sources³⁴ and unknown directional interferences³⁵, respectively, pushing models to handle more dynamic and cluttered sound scenes.

Parallel to these efforts, the Learning 3D Audio Sources (L3DAS) Challenge provided its own series of dedicated SELD datasets. The L3DAS21 and L3DAS22 corpora were hybrid datasets focused on FOA audio within a single office environment^62,63. The final L3DAS23 dataset explored multi-modal SELD by introducing simulated visual data, extending the task into audio-visual contexts⁶⁴.

A critical shift towards greater realism was signaled by the introduction of manually annotated real-world datasets. Early examples include the SECL-UMons⁵⁶ and AVECL-UMons⁵⁸ datasets, which provided real-world audio and audio-visual recordings, respectively. However, they were limited by discrete source angles and the use of a 2-D circular microphone array, which prevented elevation estimation.

This gap was decisively addressed by the Sony-TAu Realistic Spatial Soundscapes (STARSS) datasets. The STARSS22 dataset provided the first large-scale, real-world SELD dataset featuring genuine human actors recorded in a diversity of real rooms⁶⁵, forming the basis for the DCASE 2022 Challenge. Subsequently, the extended STASRSS23 dataset enriched this by including synchronized 360° video recordings³², further advancing audio-visual SELD research²⁰. These comprehensive, real-world datasets have been invaluable in bridging the gap between controlled synthetic experiments and the complexities of natural acoustic environments.

Growing diversity of recording setups

Recent data collection efforts reflect a growing trend toward exploring more diverse and practical recording conditions beyond stationary indoor arrays. This is crucial for deploying SELD systems on consumer devices for various real-world applications. For instance, Nagatomo et al.⁵⁴ proposed the WearableSELD dataset, which captured spatial RIRs using microphones distributed across a head-and-torso simulator and various wearable items (e.g., earphones, glasses). Furthering this, Yasuda et al.⁵⁵ introduced the 6 Degrees of Freedom (6DoF) dataset, which incorporated motion-tracking data from wearable sensors, simulating truly dynamic scenarios where both the user and the sound sources can move.

The DCASE Challenge 2025 signals another pivotal shift by focusing on SELD using stereo audio, moving away from the specialized four-channel FOA and MIC formats⁶⁶. For this task, the FOA audio from the STARSS23 dataset is converted into a stereo format using mid-side conversion⁵³. This initiative directly addresses the need to make SELD technology more accessible and applicable to the vast ecosystem of consumer electronics, which predominantly feature stereo microphone systems.

Despite these innovations, data collection has predominantly focused on indoor environments, leaving outdoor settings comparatively unexplored⁶⁷. The UNS-Exterior Spatial Sound Events (UNS-ESSE2023) dataset is a notable exception, explicitly tailored for outdoor urban scenes, albeit with a limited number of event classes⁵⁷. This highlights a significant opportunity for future research to create datasets that capture the unique acoustic challenges of outdoor applications.

Collectively, the evolution of these datasets chart a clear trajectory: from carefully parameterized synthetic scenes toward large-scale, and potentially multi-modal, recordings of everyday acoustic environments. The selection of an appropriate corpus is therefore a critical decision, dictated by the intended deployment domain, target microphone geometry, and the degree of realism required for robust model evaluation.

Input features

The design of input features directly influences the ability of a SELD model to accurately detect and localize sound events. Although some studies have explored end-to-end learning directly from raw audio waveforms⁶⁸, the predominant and most successful approach in SELD involves transforming the raw multi-channel audio into structured, informative representations⁴⁹. This explicit feature engineering serves two primary purposes: it reduces the dimensionality of the input, making learning more computationally efficient, and it embeds physically meaningful acoustic and spatial cues into the representation. Consequently, this process is integral to the success of state-of-the-art systems. Feature sets are typically composed of two key components: a foundational time-frequency spectral representation and explicit directional cues.

Foundational time-frequency representations

Time-frequency representations, or spectrograms, are the standard foundation for audio analysis in SELD due to their robustness in characterizing acoustic signals. A spectrogram visualizes the distribution of the energy of the signal across time and frequency. Initial SELD research employed multi-channel magnitude and phase spectrograms, which effectively preserves both the energy fluctuations of events and the raw spatial phase relationships between microphone channels^5,62. These early representations are computationally efficient and relatively invariant to specific microphone array geometries⁴⁴, enhancing model generalizability.

Contemporary approaches, however, have largely converged on using perceptually-inspired representations that better align with human auditory processing, with the most widely adopted being the log-Mel spectrogram. The Mel scale mimics human hearing by allocating higher resolution to lower frequencies, where human perception is more sensitive, while also compressing higher frequencies. This results in lower-dimensional features, which reduces computational complexity and can improve model generalization by focusing on perceptually relevant frequency bands⁶⁹. Nevertheless, this non-linear frequency scaling presents a critical trade-off: spatial cues from different narrow bands are merged into a single Mel band, potentially hindering DOAE performance in scenarios with multiple overlapping sources⁵¹.

Incorporating explicit directional cues

For high-resolution localization, the implicit spatial information contained within multi-channel spectrograms is often insufficient. The contemporary approach is therefore to augment the foundational spectral representations with explicit directional features designed to embed spatial information. The choice of these spatial features is typically dictated by the audio recording format.

For FOA recordings, Intensity Vectors (IVs) are a common choice. As proposed by Perotin et al.⁴⁵, these vectors are derived from the Ambisonic channels and describe the directional flow of acoustic energy at each time-frequency bin. Because the DOA of a sound source is associated with the inverse of this energy flow, IVs provide an effective and physically-grounded spatial cue.

Conversely, for MIC recordings, features based on the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) are widely used. The GCC-PHAT is a robust estimator of the time difference of arrival (TDoA) between pairs of microphones, providing an effective geometric cue for inferring sound source direction^19,70. However, these phase-based features are suspectible to degradation from noise and reverberation in real-world environments, a key limitation that has prompted the development of more advanced spatial features⁵¹.

Recent advances have focused on creating sophisticated feature sets that contain rich spectral and spatial information. This trend is best exemplified by the development of the Spatial Cue-Augmented Log-Spectrogram (SALSA) feature set⁵¹ and its variants, which represent significant advancements in SELD-based feature engineering:

SALSA: The original SALSA feature set combines log-linear spectrograms with advanced spatial cues derived from an eigenvector-based analysis of the spatial covariance matrix. This provides a rich, high-fidelity representation of an acoustic scene⁵¹.
SALSA-Lite: To improve computational efficiency for real-time applications, the same authors introduced a lightweight version, SALSA-Lite⁴⁹. It replaces the costly eigen-decomposition with normalized inter-channel phase differences (NIPDs), achieving a significant speedup while preserving strong localization performance.
SALSA-Mel: Building on this, Huang et al.⁶⁹ adapted SALSA-Lite for the Mel scale, resulting in the SALSA-Mel. This version further reduces dimensionality and computational load, making it highly suitable for resource-constrained edge applications.

The continual refinement of input features, from generic spectrograms to these sophisticated representations, highlights a pivotal evolutionary path in the field. The overarching goal is to design input representations that are rich in semantic and spatial information, robust to real-world acoustic conditions, and computationally efficient. Feature engineering remains a critical step in developing practical and reliable SELD systems capable of real-world deployment.

Model architectures

DNN architectures are the computational backbone of SELD, tasked with translating high-dimensional input features into a structured, spatiotemporal understanding of an acoustic scene. The design of these architectures has progressed from a foundational blueprint toward more powerful and flexible models that mirror broader trends in artificial intelligence. While the CRNN remains a highly effective and dominant framework⁴⁸, the field has seen a clear progression towards deeper models, more sophisticated attention mechanisms, and the adoption of large-scale pre-training paradigms.

Foundational SELDNet

The first deep learning-based method for SELD, introduced by Hirvonen³, framed the task as a classification problem over a discrete spatial grid, using a simple CNN for detection and localization. The contemporary approach builds upon this work and was standardized by the seminal SELDNet architecture proposed by Adavanne et al.⁵. SELDNet established a new paradigm by framing the task as a joint classification and regression multi-task learning problem. This pivotal shift enabled the continuous, high-resolution estimation of sound event activities and their corresponding locations.

The SELDNet architecture is a classic CRNN, a hybrid design that became the blueprint for many subsequent systems. As illustrated in Fig. 5 (Left), its structure is logically composed of two primary modules. First, a CNN front-end is employed to effectively extract local patterns from input features. Next, an RNN module, typically implemented with bi-directional GRUs (biGRUs), processes the sequence of feature vectors from the CNN, which is vital for accurately tracking event onsets and offsets. Finally, the output layer of the network diverges into two heads: one performing multi-label classification to identify active sound events, and another performing regression to estimate their corresponding Cartesian coordinates.

**Fig. 5: Block diagrams of popular SELD model architectures.**

The influence of SELDNet is evident, as it became a de facto baseline for numerous benchmarks, including DCASE Challenges^{31,32,34,35,65} and other public datasets^54,62,63,64. Subsequent research has built on this foundation by incorporating more sophisticated input features and attention mechanisms to further refine accuracy and robustness^71,72.

Deeper backbones and advanced attention

As SELD research matured, efforts turned to enhancing the feature extraction and temporal modeling capacities of the CRNN framework. This evolution has proceeded along two main axes: increasing the complexity of the convolutional backbone and advancing the sophistication of the temporal modeling component.

A prominent trend has been the replacement of shallow CNN front-ends with deeper, more powerful Residual Networks (ResNets)^73,74,75. By introducing skip connections, ResNet-based architectures mitigate the vanishing gradient problem⁴⁰, enabling the successful training of much deeper models. This allows the network to learn a more expressive hierarchy of features from the audio input, improving its discriminative power¹⁷. Another complementary approach is the incorporation of advanced attention-based mechanisms within the feature extraction backbones, such as the integration of Squeeze-and-Excite blocks (SqEx)⁷⁶, Attentional Feature Fusion (AF), and Multi-scale Channel Attention Mechanisms (MS-CAM)⁷⁷.

Concurrently, the choice of temporal modeling modules has shifted from purely recurrent models toward attention-based architectures. Attention mechanisms enable a model to dynamically weigh the important of different segments of a sequence, focusing on frames most relevant to the task. For instance, the Transformer⁴⁶, with its self-attention mechanism, proved highly effective at capturing long-range dependencies without the sequential processing limitations of RNNs. This has led to the development of audio-specialized variants, most notably the Conformer architecture⁴⁷, which combined convolution (for local patterns) with self-attention (for global context). The fusion of a ResNet front-end with Conformer modules, depicted in Fig. 5 (Right), has become a state-of-the-art backbone⁷⁴, consistently adopted by many top-performing systems^{78,79,80,81,82}.

Beyond adopting general-purpose attention, researchers have developed mechanisms tailored to the multi-faceted nature of SELD inputs^83,84,85. Since features for SELD contain information across channel, tine, and frequency axes, specialized attention modules have been proposed to model these dimensions independently^75,86. For example, Shul et al.⁸⁷ proposed a channel-spectro-temporal attention mechanism that applies separate attention modules to each dimension, achieving strong performance while significantly reducing model complexity – a key advantage for real-time inference on edge devices⁸⁸.

Large-scale pre-training

A transformative trend in modern SELD research is the adoption of large-scale pre-training⁸⁹. Instead of training a model from scratch on limited, task-specific SELD data, this paradigm involves a two-stage process: first, a large model is pre-trained on a massive, general-purpose dataset; then, the knowledge of this model is transferred to the target SELD task⁹⁰. This approach allows the model to learn a rich and generalizable representation of audio that serves as a powerful starting point for fine-tuning, often leading to significant performance improvements^91,92.

Early examples included using large-scale pre-trained audio neural networks (PANNs)⁷³, which were trained on the extensive AudioSet dataset for audio tagging¹. More recent work has explored pre-training Transformer-based models, such as the Audio Spectrogram Transformer (AST)^93,94, using self-supervised methods on millions of unlabeled audio clips. These pre-trained models can then be used to augment SELD systems with high-level information about the acoustic scene⁹⁵. However, these models generally use single-channel audio and cannot be directly applied to the multi-channel SELD problem. Recent work by He et al.⁴⁸ has begun addressing this problem by proposing methods to adapt these single-channel models for the multi-channel SELD task.

A task-specific pre-training strategy has recently emerged with pre-trained SELD Networks (PSELDNets)⁸⁹. These PSELDNets can also be seen as the SELD version of the previous PANNs⁷³ meant for audio tagging tasks, highlighting a pivotal step forward in environmental audio intelligence. The PSELDNets are pre-trained on massive, synthetically generated SELD datasets before being fine-tuned on smaller, real-world datasets. This approach has been shown to achieve state-of-the-art results by effectively transferring knowledge from diverse, large-scale data sources directly to the SELD problem.

Modular strategies

While integrated, end-to-end models dominate SELD research, a noteworthy alternative is the use of modular strategies. These approaches decouple the joint SELD task, typically by focusing on the SED and DOAE problems independently. For instance, Cao et al.^70,96 employed a two-stage process where a SED model is trained first, and its learned weights are transferred to a second DOAE model with an identical architecture. Similarly, Nguyen et al.^97,98 trained SED and DOAE models independently before using a specialized sequence-matching or alignment network to associate their outputs.

Although modular architectures can be effective, particularly where the complexity of detection and localization differs greatly⁵⁹, they introduce significant complexity into the training pipeline. The need to manage, train, and fuse multiple models makes them less practical for seamless deployment. Consequently, the vast majority of state-of-the-art systems favor unified, end-to-end architectures that learn to detect and localize jointly, simplifying both the training and implementation process.

Data augmentation

High-quality, large-scale SELD datasets are often limited in size and class variability^32,65, which can cause models to overfit the training data and fail to generalize to unseen conditions. Data augmentation methods mitigate these challenges by synthetically expanding the training data through controlled variations. Consequently, data augmentation is critical for improving model robustness and is a cornerstone of most high-performing SELD systems⁹⁹. Table 3 summarizes several common data augmentations methods specific to SELD.

Table 3 Description of selected data augmentation techniques commonly used in SELD

Full size table

The central challenge in SELD-based data augmentation is to enhance data variability without corrupting the critical spectral content and spatial cues of sound events⁷⁴. For instance, a naive transformation that modifies one microphone channel differently from another can destroy the inter-channel relationships crucial for localization¹⁰⁰.

Augmenting the spectral-temporal domain

This category includes methods that operate directly on the time-frequency representation of the audio, often adapted from the image and speech processing domains¹⁰¹.

For instance, masking techniques such as SpecAugment¹⁰² and Cutout¹⁰³ introduce random, rectangular-shaped masks in the time and frequency dimensions of the spectrogram. This forces the model to learn more robust and distributed representations, making it more resilient to the partial loss of information. For SELD, it is generally considered imperative that the same mask is applied coherently across all microphone channels to preserve the integrity of the spatial cues⁷⁴. This constraint becomes especially critical for formats where inter-channel information is more pronounced, such as stereo or binaural recordings¹⁰⁰.

Techniques such as Frequency Shifting⁵¹ and FilterAugment¹⁰⁴ modify the spectral content of the audio. Frequency Shifting simulates pitch variations by shifting or reflecting frequency bands, while FilterAugment simulates the spectral coloration from diverse acoustic environments by applying random gain offsets to frequency bands¹⁰⁵. Such methods encourage models to learn frequency-invariant features rather than focusing on a few dominant bands. However, the efficacy of these methods have been mixed⁸⁰, highlighting the delicate balance between enhancing spectral diversity and inadvertently distorting critical spatial information, especially in real-world audio.

Augmenting the spatial domain

This class of techniques is particularly effective for SELD as they can directly create additional directional training samples from existing recordings without altering the underlying sound events or reverberation characteristics. For instance, SpatialMixup¹⁰⁶ applies selective directional gains to the audio signals, simulating variations in loudness from different directions and creating more diverse spatial scenarios.

For Ambisonics recordings, spatial rotation is a highly effective method where mathematical rotation matrices are applied to the FOA channels¹⁰⁷. This simulates rotating the entire sound field to a new orientation, effectively expanding the dataset several fold. Building on this concept, Wang et al.⁷⁴ proposed the Audio Channel Swapping (ACS) method, a more general approach that works for both FOA and MIC audio formats. The ACS method systematically swaps or permutes microphone channels according to the physical symmetry of the recording array while adjusting the ground truth labels accordingly. For a tetrahedral array, this can generate up to eight unique directional configurations from a single recording while preserving the reverberation of the recording environment⁸⁰.

Augmentation via mixture and synthesis

More sophisticated augmentation methods synthesize entirely new and realistic multi-channel audio clips. Mixture-based methods, for example, create new audio samples by mixing existing ones¹⁰⁸. The Mixup method, proposed by Zhang¹⁰⁹, linearly interpolates the waveforms and labels of two samples, which helps to regularize the network and smooth decision boundaries. Mixing two different monophonic samples can also create artificial polyphonic sound events, improving the ability of the model to detect overlapping sound sources¹¹⁰. SpecMix, as proposed by Kim et al.¹⁰¹, performs a similar operation in the time-frequency domain by mixing patches of two different spectrograms. Because this method only swaps designated time-frequency patches, it avoids the wholesale magnitude averaging of waveforms that can obscure class-discriminative cues.

Other synthesis methods include the Multi-Channel Simulation (MCS)⁷⁴ and Impulse Response Simulation (IRS)⁹⁹ frameworks. The MCS method isolates target events via beamforming to obtain source spectra and corresponding spatial covariance matrices. New multi-channel examples are then synthesized by randomly recombining spectra with spatial matrices. The IRS framework improves on the previous MCS method by eliminating directional interferences from the extracted target events. Subsequently, the IRS method replaces the empirically derived spatial covariance matrices with simulated RIRs, resulting in interference-free directional sound event examples.

Augmentation chains and adaptive policies

In practice, state-of-the-art systems rarely rely on a single augmentation method. Instead, they often employ augmentation chains—sequential pipelines that apply multiple transformations to improve robustness and generalization^82,106. For instance, the top-ranking submission in the DCASE Challenge 2022 utilized a four-stage data augmentation strategy combining spatial (ACS), synthesis (MCS, Mixup), and Masking-based (SpecAugment) operations⁷⁴.

However, designing these complex pipelines requires significant manual effort and parameter tuning¹¹¹, which has motivated a recent push towards adaptive strategies. For instance, Zhang et al.⁸¹ introduced an Automated Audio Data Augmentation (AADA) framework that allows the network to automatically learn optimal augmentation parameters rather than relying on fixed ones. Such advancements in learnable and adaptive augmentation strategies hold immense promise for reducing the manual effort needed to improve the generalization of SELD models.

Output formats

The output format of a SELD model is a critical architectural component that defines how the system represents its understanding of an acoustic scene. An effective format must not only encode what sounds occur and where they originate but must also be robust to multiple, overlapping same-class sound sources. The design of these formats has evolved from conceptually simple representations to sophisticated structures designed to overcome this fundamental challenge. This progression, illustrated in Fig. 6, directly reflects the growing capability of SELD systems to parse realistic and complex acoustic environments.

**Fig. 6: Comparison of SELD Output Representation Strategies.**

Class-wise, two-branch approaches

Initial SELD architectures, as exemplified by SELDNet⁵, adopted a straightforward and intuitive output format. This approach, as depicted in Fig. 6 (top-left), uses two parallel network branches to produce a class-wise output. In this design, a SED branch performs multi-label classification, yielding a vector of activity probabilities for each known sound class. Simultaneously, a parallel DOAE branch performs regression, predicting a corresponding 3-D Cartesian coordinate vector (x, y, z) for each class, representing its location on a unit sphere.

While conceptually intuitive, this class-wise format suffers from a critical limitation: it can only represent a single active instance of a sound class at any given time. This makes it fundamentally incapable of resolving common scenarios where multiple sounds of the same type occur, such as two people speaking simultaneously from different directions. Furthermore, training separate classification and regression branches with distinct loss functions introduces complexity and can hinder stable convergence⁷².

Toward a unified representation

To address the challenges of the two-branch approach, Shimada et al.¹¹² proposed the Activity-Coupled Cartesian DOA (ACCDOA) format, depicted in Fig. 6 (top-right). Instead of separating detection and localization outputs, ACCDOA combines them into a single 3-D vector for each sound class. In this representation, the direction of the vector encodes the DOA, while its magnitude represents the event’s activity probability. For the ground truth, a vector of unit length signifies an active event, while a zero-length vector signifies inactivity.

This tight coupling of detection and localization confers several advantages. First, by learning a single representation, the model is encouraged to develop more coherent spatial and semantic features. Second, it simplifies the training objective to a single regression loss on the 3-D vector, enhancing training stability and eliminating the need to balance multiple loss functions. Finally, model complexity is reduced by only using a single branch, yielding a more lightweight system. These benefits have positioned ACCDOA as a foundational concept in contemporary SELD frameworks.

Overcoming polyphonic scenarios

While ACCDOA solved the multi-task learning problem, it did not resolve the single-instance limitation of class-wise output formats. Effectively handling simultaneous occurrences of an identical sound class remained a persistent challenge in SELD^35,65. The solution was the development of multi-track or track-wise output formats^{82,97,113,114}, central to models such as the Event-Independent Networks (EINV)¹¹³ and the enhanced EINV2¹¹⁴. This paradigm, as illustrated in Fig. 6 (bottom-left), restructures the output layer to predict a fixed number of independent “tracks”. Each track acts as a potential sound source slot, with its own associated SED and DOAE outputs. A model with three tracks, for instance, can theoretically detect and localize up to three concurrent instances of the same event class. The number of tracks is typically determined by the maximum polyphony of the dataset¹¹⁰.

This solution, however, introduces the challenge of permutation ambiguity. Since the tracks are interchangeable and there is no inherent correct ordering for multiple identical sources, the model’s predicted order may not match the arbitrary order of the ground-truth annotation. To resolve this, track-wise models are trained using Permutation-Invariant Training (PIT)¹¹⁵. At each training step, PIT dynamically solves a combinatorial assignment problem, finding the optimal permutation between the predicted and ground-truth tracks that minimizes the loss function. This allows the network to learn effectively without being penalized for an arbitrary output ordering.

The synthesis: multi-ACCDOA

The logical culmination of these developments was the unification of the ACCDOA concept with the track-wise paradigm. Shimada et al.¹¹⁶ extended the original ACCDOA format with multi-track support, creating the multi-ACCDOA format as shown in Fig. 6 (bottom-right). This strategy duplicates the ACCDOA vector representation across multiple tracks for each class, enabling the network to represent and distinguish concurrent events of the same class from different directions. To handle the associated permutation ambiguity, the authors also introduced the Auxiliary Duplicating PIT (ADPIT) framework, a specialized training strategy for this format.

This integrated approach combines the representational effectiveness of ACCDOA with the polyphonic understanding of track-wise formats. Consequently, the multi-ACCDOA output format has become the foundational output strategy for state-of-the-art SELD systems designed for highly complex and polyphonic acoustic environments.

Alternative classification-based approaches

While most modern SELD systems employ regression-based output formats, an alternative paradigm frames the entire task as a classification problem. The first deep learning-based approach for SELD by Hirvonen³ pioneered this concept by treating localization as a multi-class classification task. In this work, the listening space was divided into a discrete set of spatial sectors (e.g., eight azimuth directions). Each unique combination of a sound event class and location (e.g., “speech” at 45°) was treated as a distinct class.

More recently, this idea has evolved into more sophisticated location-oriented frameworks that treat the listening area as a 2-D grid and predict event activity at each grid point. For instance, Kim et al.¹¹⁷ proposed AD-YOLO, inspired by the “You Only Look Once” object detection algorithm¹¹⁸. In this method, AD-YOLO assign grid cells the responsibility of detecting nearby sound sources, producing predictions that combine class probabilities and DOA coordinates for each cell. Similarly, Zhang et al.¹¹⁹ proposed the Spatial Mapping and Regression Localization for SELD (SMRL-SELD) framework, which considers each grid cell as either containing a specific sound event class or background noise, using a novel regression loss to guide localization. By shifting from an event-oriented to a location-oriented perspective, these hybrid classification-based approaches offer an effective alternative for handling complex polyphonic scenarios.

Evaluation and benchmarks

Quantifying the performance of a SELD system is uniquely challenging, as it requires a methodology that can simultaneously assess the accuracy of both event detection and localization. The evaluation metrics for SELD have matured significantly, evolving from separate, task-specific scores to an integrated framework that holistically captures the ability of a system to correctly associate what sound is present and where it originates.

Early decoupled evaluation

Early SELD evaluation, as defined for the inaugural DCASE Challenge 2019 SELD task, treated the two sub-tasks independently. SED performance was measured using standard classification metrics: Error Rate (ER) and F1 score²². Concurrently, DOAE was evaluated with a frame-level DOA Error (DE) and a Frame Recall (FR) metric⁴⁴.

While informative, this decoupled approach exhibited a critical flaw: it failed to penalize systems for data association errors. A model could, for instance, achieve a low DE by correctly localizing a sound source while misclassifying its event label (e.g., localizing a “dog bark” but labeling it as “speech”). These incorrect associations would not be adequately reflected in the final scores, meaning that the metrics did not fully represent the practical utility and true performance of a SELD system.

Contemporary joint evaluation

Recognizing the inherent interdependence of the tasks, the DCASE 2020 Challenge organizers introduced a standardized, integrated evaluation framework that has since become the community standard^19,120. This framework is built upon metrics that jointly consider classification and localization accuracy.

The core of this framework rests on location-dependent detection scores and class-dependent localization scores. In this framework, SED performance is quantified by the location-dependent Error Rate (${\text{ER}}_{\le {\text{T}}^{\circ }}$) and F1 score (${\text{F1}}_{\le {\text{T}}^{\circ }}$). A detected event is considered a true positive only if its class label is correct and its estimated DOA is within a spatial threshold, T° (typically 20°), of the ground-truth direction. The ${\text{F1}}_{\le {\text{T}}^{\circ }}$ score is calculated from the location-dependent precision and recall metrics. The ${\text{ER}}_{\le {\text{T}}^{\circ }}$ is the sum of substitutions (class errors), deletions (missed events), and insertions (false alarms) tallied between predictions and ground-truth references that are within this spatial threshold, divided by the total number of reference events.

Concurrently, DOAE performance is quantified by the class-dependent Localization Error (LE_CD) and Localization Recall (LR_CD). The LE_CD score measures the average angular distance between predicted and ground-truth DOAs for correctly classified events. The LR_CD represents the per-class recall, calculated as the fraction of ground-truth events that are correctly detected for the sound event class. Crucially, these localization metrics are computed only for correctly classified sound events, thereby measuring the localization performance exclusively on successfully detected sounds.

To provide a single, comprehensive score for ranking systems, the overall SELD error (${{\mathcal{E}}}_{{\rm{SELD}}}$) aggregates these four interdependent metrics:

$$\begin{array}{ll}{{\mathcal{E}}}_{{\rm{SELD}}}=\frac{1}{4}\left[{\text{ER}}_{\le 2{0}^{\circ }}+(1-{\text{F1}}_{\le 2{0}^{\circ }})\right.\\\qquad\,\,\left.\quad \ +\frac{{\text{LE}}_{{\rm{CD}}}}{18{0}^{\circ }}+(1-{\text{LR}}_{{\rm{CD}}})\right].\end{array}$$

(1)

An effective SELD system aims to minimize ER _≤ 20°, LE_CD, and ${{\mathcal{E}}}_{{\rm{SELD}}}$, while maximizing ${\text{F1}}_{\le 2{0}^{\circ }}$ and LR_CD. This joint evaluation methodology ensures that modern systems are optimized to correctly link the what and where, which is the true goal of SELD.

Current state-of-the-art

Table 4 provides an overview of top-performing SELD systems benchmarked on the STARSS23 dataset³², which also serves as the basis for the recent DCASE 2023 Challenge task on SELD. Analysis of these leading systems reveals several clear trends that define the current state-of-the-art.

Table 4 Overview of model architectures evaluated on the STARSS23 validation set

Full size table

As evidenced by the DCASE baseline results, a performance advantage is observed for the FOA format over raw MIC signals, a trend also noted in other studies^32,48,51,65. This is likely attributed to the rich, physically meaningful spatial information explicitly encoded in the FOA channels, enabling the extraction of more effective input features. Consequently, all top-performing systems listed in Table 4 use the FOA audio format. However, because MIC-based systems utilize raw microphone signals, they are more flexible to a wider range of microphone geometries⁵². A substantial research opportunity therefore exists to bridge this performance gap for practical, real-world deployment.

Architecturally, there is a clear shift away from the classic CRNN. The leading systems predominantly utilize powerful feature extraction backbones, such as ResNets, combined with advanced attention-based mechanisms, such as Conformers, for temporal modeling. Furthermore, achieving state-of-the-art performance now extends beyond model architecture to encompass complex training and inference strategies. Nearly all top-performing systems utilize methods such as large-scale pre-training on massive datasets^48,89,95, model ensembling to combine outputs from multiple models^78,121,122, or custom post-processing techniques such as output averaging and test-time augmentation to refine final predictions^78,79,87.

While these advanced techniques achieve impressive results on benchmarks, they also highlight a growing gap between benchmark performance and practical, real-world deployability. These state-of-the-art systems are often computationally immense, with substantially large model sizes and ensemble configurations that are generally too slow and resource-intensive for real-time inference on edge devices^78,121,122. This underscores a crucial challenge for the field—the development of systems that are not only accurate but also computationally efficient enough for practical application.

Current challenges and emerging opportunities

Despite substantial progress, the transition of SELD systems from controlled benchmarks to robust, real-world deployment is impeded by significant challenges. These challenges, however, also define the most promising areas for future research. This section outlines these limitations and the corresponding opportunities for scientific and technological advancement.

Handling dense polyphony and source overlap

A primary obstacle for SELD is the sheer complexity of real-world acoustic scenes. High degrees of polyphony, particularly involving multiple instances of the same sound class, can obscure spectral cues and introduce ambiguity in spatial estimation, making it increasingly difficult to distinguish and localize a large number of simultaneous sources¹¹. While modern track-wise output formats enable models to handle a fixed number of concurrent sources¹¹⁶, they are inherently limited and fail when this capacity is exceeded or when sources originate from spatially similar locations.

This limitation presents an opportunity to evolve SELD from a detection and localization framework into a more comprehensive acoustic scene decomposition paradigm. Future research could focus on models that dynamically estimate the number of active sources rather than relying on a fixed, predetermined number of output tracks. One promising avenue is the integration of source separation modules as a pre-processing step to disentangle mixed signals before the primary SELD task^95,123. A complementary approach is the exploration of “location-oriented” frameworks^117,119. By assigning class probabilities to a spatial grid, these methods can theoretically detect and localize an arbitrary number of concurrent events, offering a potential solution to the same-class polyphony limitation without being constrained by a predefined number of tracks.

Toward full 3-D spatial awareness

The evolution from traditional 2-D SELD (azimuth, elevation) to comprehensive 3-D spatial awareness, which includes accurate distance estimation, represents a critical next step for the field^16,21. Robust distance estimation is essential for unlocking richer spatial intelligence in applications such as virtual and augmented reality, immersive audio experiences, and robotics navigation. Notably, distance-aware systems are increasingly prevalent, underscoring the demand for such advanced spatial intelligence^124,125.

Despite these benefits, integrating accurate distance estimation into SELD systems remains difficult. Estimating distance from audio alone is a fundamentally challenging problem, as distance cues are often entangled with source-specific properties (e.g., loudness) and the acoustic characteristics of the recording environment¹²⁶. The DCASE 2024 Challenge on 3-D SELD highlighted this difficulty¹²⁷; despite using large datasets and complex models, most systems struggled to outperform the baseline in distance estimation accuracy^128,129,130.

This performance gap highlights that existing SELD features are not optimized for the nuanced task of distance estimation. It presents a clear opportunity to develop physically-motivated input features that explicitly model distance-related acoustic cues¹²⁶. Recent work has begun to explore cues such as reverberation characteristics and TDoA information¹³¹. For instance, Berghi & Jackson⁹⁰ proposed features using the short-term power of the signal’s autocorrelation (stpACC) to capture information about early reflections. Similarly, Yeow et al.¹³² jointly modeled coherence and direct-path energy to create a robust distance cue. This trend toward specialized, physics-informed features is a key pathway for achieving true 3-D spatial intelligence.

Real-time processing and edge deployment

A major hurdle is the gap between the computationally intensive models that achieve state-of-the-art results on academic benchmarks and the stringent requirements for real-time, on-device deployment. Top-performing systems often rely on massive, complex architectures, such as large ResNet-Conformer ensembles⁷⁴, which are prohibitive for real-time inference on resource-constrained edge devices such as wearable sensors or autonomous robots^23,54. Furthermore, the computational overhead of multi-channel feature extraction and the inherent latency of deep models present further major barriers to deployment^52,88.

This creates a pressing need for research in efficient artificial intelligence for acoustics. The opportunity lies in developing lightweight SELD models that maintain high accuracy while operating under strict power and latency budgets. Promising research directions include the design of streamlined network architectures, such as replacing recurrent layers with temporal convolutional networks (TCNs). The SELD-TCN network proposed by Guirguis et al.¹³³ was a pioneering example, demonstrating substantially faster inference speeds. Subsequently, Brignone et al.¹³⁴ proposed the QSELD-TCN network to leverage the parameter efficiency of Quaternions, achieving remarkable reductions in both model size and computational cost.

Future research can also investigate advanced model compression techniques such as network pruning, quantization, and knowledge distillation. In knowledge distillation¹³⁵, a compact “student” model is trained to replicate the output of a much larger, high-performance “teacher” model, effectively transferring its knowledge into an efficient form¹²¹. Approaches from related domains, such as acoustic scene classification¹³⁶, demonstrate that real-time processing is feasible with such targeted architectural optimization.

Achieving robustness in unseen environments

When transitioning from synthetic or controlled training data to real-world acoustic environments, SELD models often exhibit significant performance degradation^132,137. While synthetic data enables precise control over acoustic conditions, it often fails to capture the full complexity of natural environments⁶¹. Real-world scenarios are characterized by pervasive background noise, complex reverberation, and moving sources, all of which can cause a mismatch between the training and testing feature distributions, thereby compromising SELD performance^69,138.

The challenge of robustness creates vital research opportunities in domain adaptation. Rather than simply training on more diverse data, the goal is to create models capable of dynamic adaptation. For instance, Yasuda et al.¹³⁹ proposed a framework that uses measured echo signals to adapt a SELD model to unknown environments. Similarly, Hu et al.¹³⁷ proposed META-SELD, a framework that applies Model-Agnostic Meta-Learning (MAML)¹⁴⁰ to enable SELD models to adapt quickly to new acoustic environments using only a few samples. Building on this, Hu et al.¹⁴¹ later proposed environment-adaptive META-SELD, an enhanced framework that incorporates selective memory and environment-specific representations to mitigate conflicts between diverse acoustic environments.

Beyond fixed taxonomies

A fundamental limitation of most SELD models is their reliance on a fixed, predefined list of sound classes. This makes them ill-suited for long-term, real-world deployment, where systems must adapt to novel sounds and evolving user needs. This has catalyzed research into more flexible and scalable paradigms, such as class-incremental learning. Pandey et al.¹⁴² proposed a class-incremental learning framework, which enabled SELD systems to incorporate new sound classes as additional information is available without requiring complete retraining from scratch. Such methods address the challenge of “catastrophic forgetting” and will be essential for creating systems that can evolve throughout their operational lifetime.

Extending these concepts, the paradigm of open-vocabulary SELD has emerged, which aims to create systems not confined to any fixed class list. For instance, Shimada et al.¹⁴³ proposed an embed-ACCDOA model that jointly predicts a spatial location and a semantic embedding for each sound event. By leveraging the knowledge of contrastive language-audio pre-training (CLAP) models¹⁴⁴, these systems allow users to define target sound events using natural language text prompts at inference time. This represents a monumental step towards truly user-centric, scalable, and flexible SELD systems, enabling more open-ended and descriptive understandings of our acoustic environments.

Advancing data-efficient learning paradigms

The high cost of curating large-scale, strongly-labeled datasets remains a fundamental bottleneck for SELD, challenging model generalization in real-world deployments^48,61. While data augmentation techniques are effective^74,99,107, there is a growing need for learning paradigms that can substantially improve model robustness while reducing the reliance on labeled data.

This has motivated research into semi-supervised and self-supervised learning, which leverage large volumes of readily available unlabeled audio to learn powerful representations. For instance, Santos et al.¹⁴⁵ pre-trained a wav2vec-style encoder¹⁴⁶ directly on unlabeled FOA recordings, showing substantial performance gains when fine-tuned on a small amount of labeled data. Similarly, Nozaki et al.¹⁴⁷ proposed a source-aware spatial self-supervised learning method that uses blind source separation to inherently separate and localize sound sources, reducing the need for extensive labeled during fine-tuning.

A parallel and highly promising direction is the development of zero- and few-shot learning frameworks, which demonstrate that SELD systems can potentially learn to detect and localize sound events with minimal or even no labeled examples for those specific classes^143,148. Together, these data-efficient learning paradigms are crucial for overcoming the annotation bottleneck and unlocking the full potential of SELD for scalable deployment.

Conclusion

SELD has matured into a vibrant field, propelled by advancements in deep learning. This review has charted its comprehensive progress: from foundational CRNNs to sophisticated attention-based architectures, and from generic spectrograms to specialized, spatially-aware input features. Concurrently, systematic refinements in data augmentation and output formats have enabled the handling of complex polyphonic scenes. These collective advancements have significantly enabled the ability of computational systems to parse acoustic scenes, bring the goal of Environmental Acoustic Intelligence closer to reality.

Despite this progress, a critical gap persists between benchmark performance and the demands of practical, real-world deployment. State-of-the-art models are often too computationally intensive for low-latency inference on edge devices. Furthermore, overlapping same-class sources, complex acoustic environments, and the limited availability of large-scale strongly-labeled labeled datasets continue to challenge current methods. These challenges, however, directly highlight the most promising avenues for future research. Future advancements include exploring open-vocabulary learning, incorporating complex 3-D distance estimation, and creating adaptive models that can generalize to unseen environments.

Moving forward, the advancement of SELD will hinge on a multi-faceted effort. Coordinated benchmarking initiatives, such as the DCASE Challenges, will remain indispensable for steering research efforts and ensuring rigorous, comparative evaluation. As the field addresses these frontiers, SELD will transition from a promising academic pursuit into a versatile and widely deployed technology. In doing so, it will not only enhance how machines hear but will form a critical component of holistic machine perception, enabling systems to understand and interact with their environments – achieving unprecedented Environmental Acoustic Intelligence.

Data availability

No datasets were generated or analysed during the current study.

References

Gemmeke, J. F. et al. Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), 776–780 (IEEE, 2017).
Virtanen, T., Plumbley, M. D. & Ellis, D. Computational analysis of sound scenes and events (Springer, 2018).
Hirvonen, T. Classification of spatial audio location and content using convolutional neural networks. In Audio Engineering Society Convention 138 (Audio Engineering Society, 2015).
Risoud, M. et al. Sound source localization. Eur. Ann. Otorhinolaryngol. Head. Neck Dis. 135, 259–264 (2018).
Article Google Scholar
Adavanne, S., Politis, A., Nikunen, J. & Virtanen, T. Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J. Sel. Top. Signal Process. 13, 34–48 (2018).
Article ADS Google Scholar
Jekateryńczuk, G. & Piotrowski, Z. A survey of sound source localization and detection methods and their applications. Sensors 24, 68 (2023).
Article ADS Google Scholar
He, W., Motlicek, P. & Odobez, J.-M. Deep neural networks for multiple speaker detection and localization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 74–79 (IEEE, 2018).
Tan, E.-L., Karnapi, F. A., Ng, L. J., Ooi, K. & Gan, W.-S. Extracting urban sound information for residential areas in smart cities using an end-to-end IoT system. IEEE Internet Things J 8, 14308–14321 (2021).
Article Google Scholar
Kotus, J., Lopatka, K. & Czyzewski, A. Detection and localization of selected acoustic events in acoustic field for smart surveillance applications. Multimed. Tools Appl. 68, 5–21 (2014).
Article Google Scholar
Bello, J. P. et al. Sonyc: a system for monitoring, analyzing, and mitigating urban noise pollution. Commun. ACM 62, 68–77 (2019).
Article Google Scholar
Nguyen, T. N. T. et al. What Makes Sound Event Localization and Detection Difficult? Insights from Error Analysis in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE 2021) Barcelona, Spain, 120–124 (2021).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS Google Scholar
Chan, T. K. & Chin, C. S. A comprehensive review of polyphonic sound event detection. IEEE Access 8, 103339–103373 (2020).
Article Google Scholar
Mohmmad, S. & Sanampudi, S. K. Exploring current research trends in sound event detection: a systematic literature review. Multimed. Tools Appl. 83, 84699–84741 (2024).
Article Google Scholar
Mesaros, A., Heittola, T., Virtanen, T. & Plumbley, M. D. Sound event detection: a tutorial. IEEE Signal Process. Mag 38, 67–83 (2021).
Article ADS Google Scholar
Desai, D. & Mehendale, N. A review on sound source localization systems. Arch. Comput. Methods Eng. 29, 4631–4642 (2022).
Article Google Scholar
Grumiaux, P.-A., Kitić, S., Girin, L. & Guérin, A. A survey of sound source localization with deep learning methods. J. Acoust. Soc. Am. 152, 107–151 (2022).
Article ADS Google Scholar
Mohmmad, S. & Sanampudi, S. K. A parametric survey on polyphonic sound event detection and localization. Multimed. Tools Appl. 84, 22083–22120 (2024).
Article Google Scholar
Politis, A., Mesaros, A., Adavanne, S., Heittola, T. & Virtanen, T. Overview and evaluation of sound event localization and detection in dcase 2019. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 684–698 (2020).
Article Google Scholar
Berghi, D., Wu, P., Zhao, J., Wang, W. & Jackson, P. J. Fusion of audio and visual embeddings for sound event localization and detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8816–8820 (IEEE, 2024).
Krause, D. A., Politis, A. & Mesaros, A. Sound event detection and localization with distance estimation. 32nd European Signal Processing Conference (EUSIPCO) Lyon, France, pp. 286–90 (2024).
Mesaros, A., Heittola, T. & Virtanen, T. Metrics for polyphonic sound event detection. Appl. Sci. 6, 162 (2016).
Article Google Scholar
Shabbir, A. et al. Enhancing smart home environments: a novel pattern recognition approach to ambient acoustic event detection and localization. Front. Big Data 7, 1419562 (2025).
Article Google Scholar
Svatos, J. & Holub, J. Impulse acoustic event detection, classification, and localization system. IEEE Trans. Instrum. Meas. 72, 1–15 (2023).
Article Google Scholar
Park, J., Cho, Y., Sim, G., Lee, H. & Choo, J. Enemy spotted: In-game gun sound dataset for gunshot classification and localization. In 2022 IEEE Conference on Games (CoG), 56–63 (IEEE, 2022).
Banchero, L., Vacalebri-Lloret, F., Mossi, J. M. & Lopez, J. J. Enhancing road safety with ai-powered system for effective detection and localization of emergency vehicles by sound. Sensors 25, 793 (2025).
Article ADS Google Scholar
Kojima, R., Sugiyama, O., Suzuki, R., Nakadai, K. & Taylor, C. E. Semi-automatic bird song analysis by spatial-cue-based integration of sound source detection, localization, separation, and identification. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 1287–1292 (IEEE, 2016).
Butko, T., Pla, F. G., Segura, C., Nadeu, C. & Hernando, J. Two-source acoustic event detection and localization: Online implementation in a smart-room. In 2011 19th European Signal Processing Conference, 1317–1321 (IEEE, 2011).
Chakraborty, R. & Nadeu, C. Sound-model-based acoustic source localization using distributed microphone arrays. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 619–623 (IEEE, 2014).
Lopatka, K., Kotus, J. & Czyzewski, A. Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed. Tools Appl. 75, 10407–10439 (2016).
Article Google Scholar
Adavanne, S., Politis, A. & Virtanen, T. A multi-room reverberant dataset for sound event localization and detection. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 10–14. https://dcase.community/workshop2019/proceedings (New York University, NY, USA, 2019).
Shimada, K. et al. Starss23: an audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events. Advances in Neural Information Processing Systems 36, 3189 (2024).
Google Scholar
Mesaros, A., Serizel, R., Heittola, T., Virtanen, T. & Plumbley, M. D. A decade of dcase: achievements, practices, evaluations and future challenges. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2025).
Politis, A., Adavanne, S. & Virtanen, T. A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop, pp. 165–169 (DCASE 2020).
Politis, A. et al. A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection. in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop, pp. 125–129 (DCASE 2021) (Barcelona, Spain, 2021).
Dorfner, F. J., Patel, J. B., Kalpathy-Cramer, J., Gerstner, E. R. & Bridge, C. P. A review of deep learning for brain tumor analysis in MRI. NPJ Precis. Oncol. 9, 2 (2025).
Article Google Scholar
Esteva, A. et al. Deep learning-enabled medical computer vision. NPJ Digital Med 4, 5 (2021).
Article Google Scholar
Choudhary, K. et al. Recent advances and applications of deep learning methods in materials science. npj Comput. Mater. 8, 59 (2022).
Article ADS Google Scholar
Krogh, A. What are artificial neural networks? Nat. Biotechnol. 26, 195–197 (2008).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE conference on computer vision and pattern recognition, 770–778 (IEEE, 2016).
Nash, W., Drummond, T. & Birbilis, N. A review of deep learning in the study of materials degradation. npj Mater. Degrad. 2, 37 (2018).
Article Google Scholar
O’shea, K. & Nash, R. An introduction to convolutional neural networks. Preprint at https://doi.org/10.48550/arXiv.1511.08458 (2015).
Article Google Scholar
Tan, K. & Wang, D. A convolutional recurrent neural network for real-time speech enhancement. Interspeech 2018, 3229–3233 (2018).
Google Scholar
Adavanne, S., Politis, A. & Virtanen, T. Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In 2018 26th European Signal Processing Conference (EUSIPCO), 1462–1466 (IEEE, 2018).
Perotin, L., Serizel, R., Vincent, E. & Guérin, A. Crnn-based multiple doa estimation using acoustic intensity features for ambisonics recordings. IEEE J. Sel. Top. Signal Process. 13, 22–33 (2019).
Article ADS Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017).
Gulati, A. et al. Conformer: convolution-augmented transformer for speech recognition. in Interspeech 2020 5036–5040 (ISCA, 2020) https://doi.org/10.21437/Interspeech.2020-3015.
He, C., Cheng, S., Bao, J. & Liu, J. Adapting single-channel pre-trained transformer models for multi-channel sound event localization and detection. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2025).
Nguyen, T. N. T., Jones, D. L., Watcharasupat, K. N., Phan, H. & Gan, W.-S. Salsa-lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 716–720 (IEEE, 2022).
Zotter, F. & Frank, M. Ambisonics: A practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality (Springer Nature, 2019).
Nguyen, T. N. T., Watcharasupat, K. N., Nguyen, N. K., Jones, D. L. & Gan, W.-S. Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection. IEEE/ACM Trans. Audio, Speech, Lang. Process 30, 1749–1762 (2022).
Article Google Scholar
Yeow, J. W., Tan, E.-L., Bai, J., Peksi, S. & Gan, W.-S. Real-time sound event localization and detection: deployment challenges on edge devices. Preprint at https://doi.org/10.48550/arXiv.2409.11700 (2024).
Wilkins, J. et al. Two vs. four-channel sound event localization and detection. In Proc. 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 216–220 (Tampere, 2023).s
Nagatomo, K., Yasuda, M., Yatabe, K., Saito, S. & Oikawa, Y. Wearable seld dataset: Dataset for sound event localization and detection using wearable devices around head. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 156–160 (IEEE, 2022).
Yasuda, M., Saito, S., Nakayama, A. & Harada, N. 6dof seld: sound event localization and detection using microphones and motion tracking sensors on self-motioning human. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1411–1415 (IEEE, 2024).
Brousmiche, M., Rouat, J. & Dupont, S. Secl-umons database for sound event classification and localization. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 756–760 (IEEE, 2020).
Suzić, S. et al. Uns exterior spatial sound events dataset for urban monitoring. In 2024 32nd European Signal Processing Conference (EUSIPCO), 176–180 (IEEE, 2024).
Brousmiche, M., Dupont, S. & Rouat, J. Avecl-umons database for audio-visual event classification and localization. Preprint at https://doi.org/10.48550/arXiv.2011.01018 (2020).
Pertilä, P. et al. Mobile microphone array speech detection and localization in diverse everyday environments. In 2021 29th European Signal Processing Conference (EUSIPCO), 406–410 (IEEE, 2021).
Neri, M., Politis, A., Krause, D., Carli, M. & Virtanen, T. Speaker distance estimation in enclosures from single-channel audio. IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE, 2024).
Roman, I. R. et al. Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1221–1225 (IEEE, 2024).
Guizzo, E. et al. L3das21 challenge: Machine learning for 3d audio signal processing. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 1–6 (IEEE, 2021).
Guizzo, E. et al. L3das22 challenge: Learning 3d audio sources in a real office environment. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 9186–9190 (IEEE, 2022).
Gramaccioni, R. F., Marinoni, C., Chen, C., Uncini, A. & Comminiello, D. L3das23: Learning 3d audio sources for audio-visual extended reality. IEEE Open J. Signal Process. 5, 632–640 (2024).
Article Google Scholar
Politis, A. et al. STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events in Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE 2022) 125–129 (Nancy, France, 2022) https://doi.org/10.48550/arXiv.2206.01948.
Shimada, K. et al. Stereo sound event localization and detection with onscreen/offscreen classification. Preprint at https://doi.org/10.48550/arXiv.2507.12042 (2025).
Zhang, D., Chen, J., Bai, J. & Wang, M. Sound Event Localization and Classification using Wireless Acoustic Sensor Networks in Outdoor Environments. IEEE Sensors Journal (2025).
He, Y., Trigoni, N. & Markham, A. Sounddet: polyphonic moving sound event detection and localization from raw waveform. In International Conference on Machine Learning, 4160–4170 (PMLR, 2021).
Huang, S., Chen, J., Bai, J., Jia, Y. & Zhang, D. Dynamic kernel convolution network with scene-dedicate training for sound event localization and detection. Preprint at https://doi.org/10.48550/arXiv.2307.08239 (2023).
Cao, Y. et al. Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE 2019).
Sudarsanam, P. A., Politis, A. & Drossos, K. Assessment of Self-Attention on Learned Features For Sound Event Localization and Detection in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE 2021) 100–104 (Barcelona, Spain, 2021).
Phan, H. et al. On Multitask Loss Function for Audio Event Detection and Localization in Proceedings of the Detection andClassification of Acoustic Scenes and Events 2020 Workshop (DCASE 2020), 160–164 (Tokyo, Japan 2020).
Kong, Q. et al. Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Trans. Audio Speech Lang. Process 28, 2880–2894 (2020).
Google Scholar
Wang, Q. et al. A four-stage data augmentation approach to resnet-conformer based acoustic modeling for sound event localization and detection. IEEE/ACM Trans. Audio Speech, Lang. Process 31, 1251–1264 (2023).
Article Google Scholar
Chen, B., Wang, M. & Gu, Y. Joint spatio-temporal-frequency representation learning for improved sound event localization and detection. Sensors 24, 6090 (2024).
Article ADS Google Scholar
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. IEEE conference on computer vision and pattern recognition, 7132–7141 (IEEE, 2018).
Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y. & Barnard, K. Attentional feature fusion. In Proc. IEEE/CVF winter conference on applications of computer vision, 3560–3569 (IEEE, 2021).
Wang, Q. et al. The nerc-slip system for sound event localization and detection of dcase2023 challenge. Tech. Rep. DCASE2023 Challenge (2023).
Xue, L., Liu, H. & Zhou, Y. Attention mechanism network and data augmentation for sound event localization and detection. Tech. Rep. DCASE2023 Challenge, Tech. Rep (2023).
Niu, S. et al. An experimental study on sound event localization and detection under realistic testing conditions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023).
Zhang, W., Yu, P., Yin, J., Jiang, X. & Xu, M. Automated audio data augmentation network using bi-level optimization for sound event localization and detection. IEEE Signal Process. Lett 31, 2770–2774 (2024).
Article ADS Google Scholar
Hu, J. et al. A track-wise ensemble event independent network for polyphonic sound event localization and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 9196–9200 (IEEE, 2022).
Shul, Y., Ko, B.-Y. & Choi, J.-W. Divided spectro-temporal attention for sound event localization and detection in real scenes for DCASE 2023 Challenge Technical Report (DCASE 2023 Challenge, 2023).
Shul, Y. & Choi, J.-W. Cst-former: transformer with channel-spectro-temporal attention for sound event localization and detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8686–8690 (IEEE, 2024).
Ma, M., Hu, Y., He, L. & Huang, H. Glfer-net: a polyphonic sound source localization and detection network based on global-local feature extraction and recalibration. EURASIP J. Audio, Speech, Music Process 2024, 34 (2024).
Article ADS Google Scholar
Mu, D., Zhang, Z. & Yue, H. Mff-einv2: multi-scale feature fusion across spectral-spatial-temporal domains for sound event localization and detection. Preprint at https://doi.org/10.48550/arXiv.2406.08771 (2024).
Shul, Y., Choi, D. & Choi, J.-W. Cst-former: multidimensional attention-based transformer for sound event localization and detection in real scenes. Preprint at https://doi.org/10.48550/arXiv.2504.12870 (2025).
Zhang, Z. Enhancing 1-second 3d seld performance with filter bank analysis and scconv integration in cst-former. Preprint at https://doi.org/10.48550/arXiv.2410.13328 (2024).
Hu, J. et al. Pseldnets: pre-trained neural networks on large-scale synthetic datasets for sound event localization and detection. IEEE Transactions on Audio, Speech and Language Processing (2025).
Berghi, D. & Jackson, P. J. Reverberation-based features for sound event localization and detection with distance estimation. Preprint at https://doi.org/10.48550/arXiv.2504.08644 (2025).
Mohor, B., Srikanth, N. & Teo, H. B. Exploiting stereo spatial properties with resnet-conformers for robust event detection and localization. Tech. Rep. DCASE2025 Challenge (2025).
Zhao, T., Han, Z. & Liu, M. Enhancing stereo sound event localization and detection through pretrained audio representations and hybrid architectures. Tech. Rep. DCASE2025 Challenge (2025).
Gong, Y., Chung, Y.-A. & Glass, J. Ast: Audio spectrogram transformer. in Interspeech 2021, 571–575. https://doi.org/10.21437/Interspeech.2021-698 (ISCA, 2021).
Gong, Y., Lai, C.-I., Chung, Y.-A. & Glass, J. Ssast: self-supervised audio spectrogram transformer. In Proc. AAAI Conference on Artificial Intelligence, vol. 36, 10699–10709 (2022).
Scheibler, R., Komatsu, T., Fujita, Y. & Hentschel, M. Sound event localization and detection with pre-trained audio spectrogram transformer and multichannel seperation network. In Proc. 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE2022) (2022).
Cao, Y. et al. Two-stage sound event localization and detection using intensity vector and generalized cross-correlation. DCASE2019 Challenge, Tech. Rep (2019).
Nguyen, T. N. T., Jones, D. L. & Gan, W.-S. A sequence matching network for polyphonic sound event localization and detection. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 71–75 (IEEE, 2020).
Nguyen, T. N. T. et al. A general network architecture for sound event localization and detection using transfer learning and recurrent neural network. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 935–939 (IEEE, 2021).
Koyama, Y. et al. Spatial data augmentation with simulated room impulse responses for sound event localization and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8872–8876 (IEEE, 2022).
Yeow, J.-W., Tan, E.-L., Peksi, S. & Gan, W.-S. Improving stereo 3d sound event localization and detection: perceptual features, stereo-specific data augmentation, and distance normalization. Preprint at https://doi.org/10.48550/arXiv.2507.00874 (2025).
Kim, G., Han, D. K. & Ko, H. Specmix: a mixed sample data augmentation method for training withtime-frequency domain features. in Interspeech 2021, 546–550. https://doi.org/10.48550/arXiv.2108.03020 (ISCA, 2021).
Park, D. S. et al. Specaugment: a simple data augmentation method for automatic speech recognition. in Interspeech 2019, 2613–2617, https://doi.org/10.21437/Interspeech.2019-2680 (ISCA, 2019).
DeVries, T. & Taylor, G. W. Improved regularization of convolutional neural networks with cutout. Preprint at https://doi.org/10.48550/arXiv.1708.04552 (2017).
Nam, H., Kim, S.-H. & Park, Y.-H. Filteraugment: An acoustic environmental data augmentation method. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4308–4312 (IEEE, 2022).
Park, J., Nam, H. & Park, Y.-H. Resnet-conformer for stereo sound event localization and distance estimation in dcase 2025 task3. Tech. Rep. DCASE2025 Challenge (2025).
Falcón-Pérez, R., Shimada, K., Koyama, Y., Takahashi, S. & Mitsufuji, Y. Spatial mixup: Directional loudness modification as data augmentation for sound event localization and detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 431–435 (IEEE, 2022).
Mazzon, L., Koizumi, Y., Yasuda, M. & Harada, N. First order ambisonics domain spatial augmentation for dnn-based direction of arrival estimation. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE 2019) 154–158 (New York University, NY, USA, 2019).
Takahashi, N., Gygli, M. & Van Gool, L. Aenet: Learning deep audio features for video analysis. IEEE Trans. Multimed. 20, 513–524 (2017).
Article Google Scholar
Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization in International Conference on Learning Representations (2018).
Hu, J. et al. Sound event localization and detection for real spatial sound scenes: Event-independent network and data augmentation chains in Proceedings of the 7th Detection and Classification of Acoustic Scenes and Events 2022 Workshop (DCASE 2022) (Nancy, France, 2022).
Wu, S., Wang, Y., Hu, Z. & Liu, J. Haac: Hierarchical audio augmentation chain for accdoa described sound event localization and detection. Appl. Acoust. 211, 109541 (2023).
Article Google Scholar
Shimada, K., Koyama, Y., Takahashi, N., Takahashi, S. & Mitsufuji, Y. Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 915–919 (IEEE, 2021).
Cao, Y. et al. Event-independent network for polyphonic sound event localization and detection. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE 2020) 11–15 (Tokyo, Japan, 2020).
Cao, Y. et al. An improved event-independent network for polyphonic sound event localization and detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 885–889 (IEEE, 2021).
Yu, D., Kolbæk, M., Tan, Z.-H. & Jensen, J. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 241–245 (IEEE, 2017).
Shimada, K. et al. Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 316–320 (IEEE, 2022).
Kim, J. S., Park, H. J., Shin, W. & Han, S. W. Ad-yolo: You look only once in training multiple sound event localization and detection. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: Unified, real-time object detection. In Proc. IEEE conference on computer vision and pattern recognition, 779–788 (IEEE, 2016).
Zhang, X., Chen, Y., Yao, R., Zi, Y. & Xiong, S. Location-oriented sound event localization and detection with spatial mapping and regression localization.Preprint at https://doi.org/10.48550/arXiv.2504.08365 (2025).
Mesaros, A., Adavanne, S., Politis, A., Heittola, T. & Virtanen, T. Joint measurement of localization and detection of sound events. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 333–337 (IEEE, 2019).
Kang, S.-I., Cho, K., Keum, M. & Park, Y. The distillation system for sound event localization and detection of dcase2023 challenge. Tech. Rep. DCASE2023 Challenge (2023).
Kim, G. & Ko, H. Data augmentation, neural networks, and ensemble methods for sound event localization and detection. Tech. Rep. DCASE2023 Challenge (2023).
Cheng, S. et al. Improving sound event localization and detection with class-dependent sound separation for real-world scenarios. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2068–2073 (IEEE, 2023).
Patterson, K., Wilson, K., Wisdom, S. & Hershey, J. R. Distance-based sound separation. in Interspeech 2022, 901–905 https://doi.org/10.21437/Interspeech.2022-11100 (ISCA, 2022).
Chen, T., Itani, M., Eskimez, S. E., Yoshioka, T. & Gollakota, S. Hearable devices with sound bubbles. Nat. Electron. 7, 1047–1058 (2024).
Article Google Scholar
Sato, N., Yasuda, M., Saito, S. & Harada, N. Sound source distance estimation utilizing physics-informed prior for sound event localization and detection. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2025).
Diaz-Guerra, D. et al. Baseline models and evaluation of sound event localization and detection with distance estimation in dcase2024 challenge. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024), 41–45 (Tokyo, Japan, 2024).
Wang, Q. et al. The nerc-slip system for sound event localization and detection with source distance estimation of dcase 2024 challenge. Tech. Rep. DCASE2024 Challenge (2024).
Yu, H. Doa and event guidance system for sound event localization and detection with source distance estimation. Tech. Rep. DCASE2024 Challenge (2024).
Yeow, J. W., Tan, E.-L., Bai, J., Peksi, S. & Gan, W.-S. Squeeze-and-excite resnet-conformers for sound event localization, detection, and distance estimation for dcase 2024 challenge. Tech. Rep. DCASE 2024 Challenge (2024).
Berg, A., Engman, J., Gulin, J., Åström, K. & Oskarsson, M. Learning multi-target tdoa features for sound event localization and detection. in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2024 Workshop (DCASE2024) 16–20 (Tokyo, Japan, 2024).
Yeow, J.-W., Tan, E.-L., Bai, J., Peksi, S. & Gan, W.-S. Enhancing 3d sound event localization and detection with distance estimation using reverberation and spatial coherence features. IEEE Sensors J 25, 29221–29237 (2025).
Article ADS Google Scholar
Guirguis, K., Schorn, C., Guntoro, A., Abdulatif, S. & Yang, B. Seld-tcn: Sound event localization & detection via temporal convolutional networks. In 2020 28th European Signal Processing Conference (EUSIPCO), 16–20 (IEEE, 2021).
Brignone, C., Mancini, G., Grassucci, E., Uncini, A. & Comminiello, D. Efficient sound event localization and detection in the quaternion domain. IEEE Trans. Circuits Syst. II: Express Briefs 69, 2453–2457 (2022).
Google Scholar
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://doi.org/10.48550/arXiv.1503.02531 (2015).
Martín-Morató, I., Heittola, T., Mesaros, A. & Virtanen, T. Low-complexity acoustic scene classification for multi-device audio: analysis of dcase 2021 challenge systems in Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE 2021) 85–89 (Barcelona, Spain, 2021).
Hu, J. et al. Meta-seld: Meta-learning for fast adaptation to the new environment in sound event localization and detection. in Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE 2023) 51–55 (Tampere, Finland, 2023).
Zhang, D. et al. Synthesis-to-real robust training for enhanced sound event localization and detection using dynamic kernel convolution networks. Appl. Acoust. 228, 110267 (2025).
Article Google Scholar
Yasuda, M., Ohishi, Y. & Saito, S. Echo-aware adaptation of sound event localization and detection in unknown environments. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 226–230 (IEEE, 2022).
Finn, C., Abbeel, P. & Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, 1126–1135 (PMLR, 2017).
Hu, J. et al. Selective-memory meta-learning with environment representations for sound event localization and detection. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE, 2024).
Pandey, R., Mulimani, M., Politis, A. & Mesaros, A. Class-incremental learning for sound event localization and detection. In 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 1–5 (IEEE, 2025).
Shimada, K. et al. Open-vocabulary sound event localization and detection with joint learning of clap embedding and activity-coupled cartesian doa vector. IEEE Transactions on Audio, Speech and Language Processing (IEEE, 2025).
Elizalde, B., Deshmukh, S., Al Ismail, M. & Wang, H. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2023).
Santos, O., Rosero, K., Masiero, B. & de Alencar Lotufo, R. w2v-seld: a sound event localization and detection framework for self-supervised spatial audio pre-training. IEEE Access (IEEE, 2024).
Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. neural Inf. Process. Syst. 33, 12449–12460 (2020).
Google Scholar
Nozaki, Y., Bando, Y. & Onishi, M. Source-aware spatial self-supervision for sound event localization and detection. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5 (IEEE, 2025).
Shimada, K. et al. Zero-and few-shot sound event localization and detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 636–640 (IEEE, 2024).
Hu, J. et al. A data generation method for sound event localization and detection in real spatial sound scenes. Tech Rep. DCASE2023 Challenge (2023).
Zhang, D., Bai, J., Huang, S., Wang, M. & Chen, J. Jless submission to dcase2023 task3: Conformer with data augmentation for sound event localization and detection in real space. Tech. Rep., DCASE2023 Challenge (2023).
Wu, S. One audio augmentation chain proposed for sound event localization and detection in dcase 2023 task3. Tech. Rep. DCASE2023 Challenge (2023).
Kumar, P., Kumar, A., Choudhary, S., Prakash, J. & Kumar, S. A framework for seld using conformer and multi-accdoa strategies. Tech. Rep. DCASE2023 Challenge (2023).
Jiang, Y. et al. Exploring audio-visual information fusion for sound event localization and detection in low-resource realistic scenarios. In 2024 IEEE International Conference on Multimedia and Expo (ICME), 1–6 (IEEE, 2024).

Download references

Acknowledgements

This work was supported by the Ministry of Education, Singapore, through Academic Research Fund Tier 2 under Grant (MOE-T2EP20221-0014) and Grant (MOE-T2EP50122-0018).

Author information

Authors and Affiliations

Smart Nation TRANS Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi & Woon-Seng Gan

Authors

Jun-Wei Yeow
View author publications
Search author on:PubMed Google Scholar
Ee-Leng Tan
View author publications
Search author on:PubMed Google Scholar
Santi Peksi
View author publications
Search author on:PubMed Google Scholar
Woon-Seng Gan
View author publications
Search author on:PubMed Google Scholar

Contributions

J.W.Y. wrote the main manuscript text and compiled the literature. E.L.T., S.P., and W.S.G. assisted with guiding the research direction. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jun-Wei Yeow.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Yeow, JW., Tan, EL., Peksi, S. et al. Environmental acoustic intelligence through sound event localization and detection: a review. npj Acoust. 1, 31 (2025). https://doi.org/10.1038/s44384-025-00036-3

Download citation

Received: 13 August 2025
Accepted: 07 October 2025
Published: 05 December 2025
Version of record: 05 December 2025
DOI: https://doi.org/10.1038/s44384-025-00036-3

Subjects

Abstract

Similar content being viewed by others

Open set classification of sound event

A deep learning approach for detecting drill bit failures from a small sound dataset

Acoustic scene classification based on three-dimensional multi-channel feature-correlated deep learning networks

Introduction

Sound event localization and detection

Deep learning

Audio formats

Available datasets

Synthetic vs. real-world data

Key benchmark datasets

Growing diversity of recording setups

Input features

Foundational time-frequency representations

Incorporating explicit directional cues

Model architectures

Foundational SELDNet

Deeper backbones and advanced attention

Large-scale pre-training

Modular strategies

Data augmentation

Augmenting the spectral-temporal domain

Augmenting the spatial domain

Augmentation via mixture and synthesis

Augmentation chains and adaptive policies

Output formats

Class-wise, two-branch approaches

Toward a unified representation

Overcoming polyphonic scenarios

The synthesis: multi-ACCDOA

Alternative classification-based approaches

Evaluation and benchmarks

Early decoupled evaluation

Contemporary joint evaluation

Current state-of-the-art

Current challenges and emerging opportunities

Handling dense polyphony and source overlap

Toward full 3-D spatial awareness

Real-time processing and edge deployment

Achieving robustness in unseen environments

Beyond fixed taxonomies

Advancing data-efficient learning paradigms

Conclusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links