Trend 4: Speech processing through deep learning

In the past year, deep-learning models have taken over the majority of speech processing, replacing conventional models. These neural network models have substantially improved the quality of speech recognition, text-to-speech (TTS), speech diarization, among others. Some of the most popular ones are:

  • Automatic speech recognition (ASR): wav2vec 2.0, Mozilla DeepSpeech, VoiceFilter-Lite (Google proprietary), Jasper, Quartznet.
  • Diarization: Marblenet and Spearkernet.
  • TTS
  • Spectrogram generation: Tacotron2, GlowTTS, FastSpeech2, FastPitch.
  • Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN.

This technology has led to significant advances in conversational intelligence, with applications such as knowledge mining, customer service, cross-sell and upsell marketing, and transactions across digital channels. Speech processing has also recently removed the mandate that chatbots have no personality. Many systems, including Mozilla DeepSpeech and Infosys Nia, exhibit profound knowledge of many subject areas, mitigating scripting errors through continuous self-learning capabilities.

A global airplane manufacturer wanted to transcribe conversations between pilots and ground staff to boost operational efficiency. These conversations were studded with cockpit noise, strong regional accents, different languages, and heavy ambient noise. The company partnered with Infosys to develop a deep-learning open-source model that was custom-trained for accent variations. The model delivered high transcription accuracy, ran language insights to infer causes of flight landing delay and air accidents, and provided insights to improve ground staff and pilot training.


Trend 5: Open-source models now comparable to commercial counterparts

Traditionally, speech processing models, backed by large speech-to-text (STT) and TTS corpora, dominated the market. Most of these models, offered via cloud services, belonged to large tech giants. However, open-source models are advancing at speed. A majority of deep-learning models are open-source, primarily due to two factors. First, large transformers models meant for language processing are made available via websites such as HuggingFace and can run on machines with low computational power. Second, tech conglomerates such as Google, Microsoft, and NVIDIA have released some powerful proprietary models for the open-source community. This clearly indicates that open-source models will bring the next wave of transformation in speech processing models.

A large U.S.-based railroad company wanted to transcribe call center conversations to optimize operations, upskill the workforce, and improve customer satisfaction. The company partnered with Infosys to develop open-source custom models and framework. Using these technical calls, Infosys helped the railroad company transcribe audio files and perform text analytics to detect common reasons for calls. It also helped the company get better customer insights and identify workforce training requirements.


Trend 6: End-to-end conversational offerings in focus

Offerings that ease the deployment of speech processing with simultaneous services, such as STT, text synthesis, and TTS, are becoming widely available. With these prominent capabilities, businesses can deploy speech processing for multiple problems simultaneously and achieve faster results. Popular models include Mycroft, SpeechBrain, ESPNet, and NVIDIA NeMo. NeMo has separate collections for ASR, NLP, and TTS. Every module used in the pre-trained toolkit repository can be customized, composed, and extended to create new end-to-end conversational AI model architectures.