Fig. 5: Joint 2D localization and speech separation framework. | Nature Communications

Fig. 5: Joint 2D localization and speech separation framework.

From: Creating speech zones with self-distributing acoustic swarms

Fig. 5

A We first run the SRP-PHAT algorithm to prune the search space, and then in (B) we use an attention-based separation model to find the potential speaker locations in the remaining space. The separation model is composed of a U-Net encoder-decoder with a transformer encoder bottleneck between them. GLU stands for Gated Linear Unit. C shows our network used for speech separation. The encoder and decoder blocks are applied separately to the aligned microphone data for each of the speakers. The bottleneck block first applies temporal self-attention to each speaker individually using a conformer encoder (CE). It then applies self-attention across speakers using a transformer encoder (TFE) to compute attention weights across different speakers. It repeats this multiple times to address cross-talk between speakers.

Back to article page