Personalizing Myoelectric Silent Speech Interfaces via Cross-Speaker Training and Voice Timbre Control
Veröffentlichungsdatum
2026-01-30
Autoren
Betreuer
Gutachter
Nakamura, Satoshi
Zusammenfassung
Electromyography (EMG) signals, measuring muscle activity, are investigated for Silent Speech Interfaces (SSIs) to enable speech communication via silent articulation. The previous paradigm for EMG-to-Speech conversion relies on speaker-dependent models predicting acoustic features of speech from the same speaker providing EMG inputs. However, this approach makes SSI applications limited, as it 1) cannot be used to synthesize the personal voice of individuals unable to produce audible speech during EMG recording, 2) suffers from data scarcity, requiring each speaker to record a sizable corpus, and 3) leads to unintelligible speech in low-latency settings.
The problem of converting EMG signals to speech in personal voices (1) is addressed by using voice conversion methods that disentangle phonetic and voice timbre information. The proposed voice-adaptive EMG-to-Speech models predict speech content features, mostly reflecting phonetic content, from EMG signals and combine them with reference audio of the target voice for speech synthesis. Further evaluations demonstrate that such models can be trained using EMG signals of silent speech only.
The data scarcity problem (2) is addressed by several studies. For this purpose, EMG models are pre-trained with other biosignals, unlabeled EMG signals, and labeled EMG signals of multiple speakers, i.e., cross-speaker training. In particular, cross-speaker training improves average speech synthesis intelligibility, while eliminating the need to train speaker-specific models.
To improve EMG-to-Speech in low-latency settings (3), this work presents an end-to-end model which outperforms previous low-latency baselines in speech intelligibility and naturalness while generating speech in less than 20 ms algorithmic latency. Furthermore, combining the previously outlined contributions, this work introduces a unified model which can convert EMG signals of multiple speakers to selectable voices.
The problem of converting EMG signals to speech in personal voices (1) is addressed by using voice conversion methods that disentangle phonetic and voice timbre information. The proposed voice-adaptive EMG-to-Speech models predict speech content features, mostly reflecting phonetic content, from EMG signals and combine them with reference audio of the target voice for speech synthesis. Further evaluations demonstrate that such models can be trained using EMG signals of silent speech only.
The data scarcity problem (2) is addressed by several studies. For this purpose, EMG models are pre-trained with other biosignals, unlabeled EMG signals, and labeled EMG signals of multiple speakers, i.e., cross-speaker training. In particular, cross-speaker training improves average speech synthesis intelligibility, while eliminating the need to train speaker-specific models.
To improve EMG-to-Speech in low-latency settings (3), this work presents an end-to-end model which outperforms previous low-latency baselines in speech intelligibility and naturalness while generating speech in less than 20 ms algorithmic latency. Furthermore, combining the previously outlined contributions, this work introduces a unified model which can convert EMG signals of multiple speakers to selectable voices.
Schlagwörter
Silent Speech Interfaces
;
Electromyography
;
Speech Synthesis
;
Voice Conversion
;
Deep Learning
Institution
Fachbereich
Institute
Dokumenttyp
Dissertation
Lizenz
Sprache
Englisch
Dateien![Vorschaubild]()
Lade...
Name
Kevin-Scheck-Dissertation-Personalizing-Myoelectric-Silent-Speech-Interfaces-via-Cross-Speaker-Training-and-Voice-Timbre-Control.pdf
Size
41.29 MB
Format
Adobe PDF
Checksum
(MD5):5e5c67e2b234f67f6deae352593d1490
