Blind source separation in single-channel polyphonic music recordings

Schulze, Sören

doi:10.26092/elib/1439

Zitierlink DOI

10.26092/elib/1439

Blind source separation in single-channel polyphonic music recordings

Veröffentlichungsdatum

2022-02-03

Autoren

Schulze, Sören

Betreuer

King, Emily J.

Gutachter

Dörfler, Monika

Zusammenfassung

We address the problem of unmixing the contributions of multiple different musical instruments from a single-channel audio recording without any specific prior information. Based on a model for the sounds of string and wind instruments, every tone is represented using a set of model parameters as well as a learned dictionary matrix that captures relations of the amplitudes of the harmonics specific to each instrument.
We propose two practical approaches that both operate on time-frequency representations derived from the short-time Fourier transform. The first approach is based on a specifically developed sparse pursuit algorithm. Since it needs to operate on a log-frequency spectrogram, we analyze the characteristics of such representations from a theoretical point of view and propose a log-frequency spectrogram that fulfills all the properties that we consider favorable. For use in the separation algorithm, it turns out that the best log-frequency spectrogram is obtained via the sparse pursuit algorithm itself. While discussing pursuit algorithms in general, we also sketch a potential application of Beurling LASSO on source separation.
The second approach is an application of deep neural networks for the prediction of the model parameters. Since the problem is non-convex and possesses a large number of local minima, we combine conventional backpropagation with policy gradients which stem from reinforcement learning. This method is distinguished by its ability to operate directly on the Gabor frame analysis coefficients (i.e., the sampled complex-valued output of the short-time Fourier transform).
On each of the samples that we gathered for evaluation, at least one of the approaches dominates the state of the art, respectively. The second algorithm can generally be considered better, especially in the suppression of interference between the sources. Unlike most traditional algorithms, neither of the methods is bound to any particular tuning of the instruments. They each possess different mechanisms to account for inconsistencies in the sounds of acoustic instruments, and they both incorporate inharmonicity in their parameter predictions.

Schlagwörter

blind source separation

;

unmixing

;

time-frequency analysis

;

machine learning

;

dictionary learning

;

sparse pursuit

;

deep learning

;

neural networks

;

policy gradients

;

non-convex optimization

Institution

Universität Bremen

Fachbereich

Fachbereich 03: Mathematik/Informatik (FB 03)

Dokumenttyp

Dissertation

Zweitveröffentlichung

Nein

Lizenz

https://creativecommons.org/licenses/by/4.0/

Sprache

Englisch

Dateien

Name

DissertationSchulze.pdf

Size

8.86 MB

Format

Adobe PDF

Checksum

(MD5):f69388710aa87e552d67dcb1c6f42783