Paper Summary

Authors
- *Jin Woo Lee, Kyogu Lee*
Abstract
- We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source.
- Most previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks.
  - These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains.
- To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space.
  - Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections.
  - NFS is efficient in both memory and computational cost, is interpretable, and operates independently of the source domain by its design.
  - With up to 25 times lighter memory and 6 times fewer calculations, the experimental results show that NFS outperforms the previous studies on the benchmark dataset.
Related works
Proposed method

This work mainly focuses on modeling binaural speech using a neural network, focusing on time delay and energy reduction.

While propagating through the air, sound has a delay in its arrival, and the energy is altered mainly due to attenuation and absorption.

We introduce a novel network that models the delay and the energy reduction of the binaural speech in the Fourier space.

Visualization of the proposed idea

The proposed system for binaural speech rendering
- Neural Fourier Shift
  - Fourier shift?
  - Building upon the Fourier shift theorem
  - So how do we train it?
  - Network architecture
- Evaluations
  - Dataset & Baselines
  - Quantitative results
  - Perceptual study
  - Efficiency
  - Interpretability
Conclusion

We present NFS, a lightweight neural network for binaural speech rendering.

Building upon the foundations of the geometric time delay, NFS predicts frame-wise frequency response and phase lags for multiple early reflection paths in the Fourier space.

Defined in the Fourier domain, NFS is highly efficient and operates independently of the source domain by its design.

Experimental results show that NFS outperforms the previous studies on the benchmark dataset, even with its lighter memory and fewer computations.

NFS is interpretable in that it explicitly displays the frequency response and phase delays of each acoustic path, for each source position.

We expect improving NFS using generic binaural audio datasets to generalize to arbitrary domains as our future works.

Demo Sound Samples

Benchmark Speech

Ground Truth (binaural)

Ours (NFS)

BinauralGrad

WarpNet

<aside> ☝ Please be sure to listen to the samples with EARPHONES! Listening with headphones or speakers does NOT accurately reflect the binauralization rendered in the audio.

</aside>

We can also inspect what NFS is doing to render the binaural speech. These videos visualize the magnitude response and delay for each framewise impulse responses.