Improving End-to-end Korean Voice Command Recognition using Domain-specific Text
Abstract
We propose a method to improve voice command recognition performance in a new domain where text can be used for training when speech data are unavailable. By using domain-specific text and a pre-trained text-to-speech (TTS) system, we synthesize the domain-specific speech and fine-tune a pre-trained end-to-end (E2E) automatic speech recognition (ASR) model to recognize voice commands in a new gaming environment. To improve the performance, we introduce intent-aware modeling based on a multi-task learning and a text masking method. The proposed method fine-tunes the context-related part of the ASR model to reflect the text of the new domain and consistently outperforms conventional E2E ASR fine-tuning methods for Korean voice command recognition tasks that use recurrent neural network transducer (RNNT) and Transformer models.