Fig. 1
From: Context-aware data augmentation for enhanced speech command recognition in industrial environments

Overview of the proposed system. The user says the keyword, the microphone’s VAD discards non-speech frames, and the filtered audio signal is given as input to the KWS. If a keyword is detected, the user can then issue a command. The signal is filtered once more by the VAD and then given as input to the command recognition module, which will either accept or reject the command.