Personalizing Myoelectric Silent Speech Interfaces via Cross-Speaker Training and Voice Timbre Control

Scheck, Kevin

doi:10.26092/elib/5498

Zitierlink DOI

10.26092/elib/5498

Personalizing Myoelectric Silent Speech Interfaces via Cross-Speaker Training and Voice Timbre Control

Veröffentlichungsdatum

2026-01-30

Autoren

Scheck, Kevin

Betreuer

Schultz, Tanja

Gutachter

Schultz, Tanja

Nakamura, Satoshi

Zusammenfassung

Electromyography (EMG) signals, measuring muscle activity, are investigated for Silent Speech Interfaces (SSIs) to enable speech communication via silent articulation. The previous paradigm for EMG-to-Speech conversion relies on speaker-dependent models predicting acoustic features of speech from the same speaker providing EMG inputs. However, this approach makes SSI applications limited, as it 1) cannot be used to synthesize the personal voice of individuals unable to produce audible speech during EMG recording, 2) suffers from data scarcity, requiring each speaker to record a sizable corpus, and 3) leads to unintelligible speech in low-latency settings.

The problem of converting EMG signals to speech in personal voices (1) is addressed by using voice conversion methods that disentangle phonetic and voice timbre information. The proposed voice-adaptive EMG-to-Speech models predict speech content features, mostly reflecting phonetic content, from EMG signals and combine them with reference audio of the target voice for speech synthesis. Further evaluations demonstrate that such models can be trained using EMG signals of silent speech only.

The data scarcity problem (2) is addressed by several studies. For this purpose, EMG models are pre-trained with other biosignals, unlabeled EMG signals, and labeled EMG signals of multiple speakers, i.e., cross-speaker training. In particular, cross-speaker training improves average speech synthesis intelligibility, while eliminating the need to train speaker-specific models.

To improve EMG-to-Speech in low-latency settings (3), this work presents an end-to-end model which outperforms previous low-latency baselines in speech intelligibility and naturalness while generating speech in less than 20 ms algorithmic latency. Furthermore, combining the previously outlined contributions, this work introduces a unified model which can convert EMG signals of multiple speakers to selectable voices.

Schlagwörter

Silent Speech Interfaces

;

Electromyography

;

Speech Synthesis

;

Voice Conversion

;

Deep Learning

Institution

Universität Bremen

Fachbereich

Fachbereich 03: Mathematik/Informatik (FB 03)

Institute

Cognitive Systems Lab (CSL)

Dokumenttyp

Dissertation

Lizenz

Sprache

Englisch

Dateien

Name

Kevin-Scheck-Dissertation-Personalizing-Myoelectric-Silent-Speech-Interfaces-via-Cross-Speaker-Training-and-Voice-Timbre-Control.pdf

Size

41.29 MB

Format

Adobe PDF

Checksum

(MD5):5e5c67e2b234f67f6deae352593d1490