Natural language processing

Trend 6: Generative multimodal models become popular

Initially, NLP focused on bag-of-words representation, emphasizing individual words with hot encoding-based sparse vectors. Then came models like GloVe and Word2Vec that addressed word similarity, but lacked contextual information, leading to biases. Long-term short-term memory architectures (LSTMs) improved by capturing long-range dependencies but had slow processing times. In 2017, Google engineers introduced transformers, laying the groundwork for foundation models that excelled in diverse tasks. Attention mechanisms allowed LLMs to selectively focus on relevant parts of the input, dynamically allocating processing power to crucial information. This enabled them to handle complex and nuanced language, i.e., capturing long-range dependencies and context, going beyond surface-level analysis, thus surpassing limitations of earlier NLP approaches, crucial for tasks like question answering and summarization.

LLMs now excel across a range of modalities, such as image, audio, and video. Multimodal LLMs create rich, contextual, and highly accurate descriptions of multimedia content. These models comprehend sentiment in different media, with thought given to tone, emotion, and underlying implications used in prompts. State-of-the-art models like ImageBind link multiple modalities into a single embedding space, with Meta's ImageBind combining six modalities simultaneously: images, text, audio, depth, thermal, and inertial measurement unit data.

Natural language processing

Trend 7: Instruction-tuned LLM multitasking facilitates fluid and natural conversations

LLMs, trained without specific labels and through reinforcement learning, effortlessly grasp insights from extensive text, smoothly adapting to new tasks and scenarios for enhanced flexibility.

However, instruction tuning enables LLMs to learn multiple tasks simultaneously. This method fine tunes pretrained LLMs with massive instructions from multiple tasks. These trained LLMs (denoted as instructed LLMs) solve various unseen tasks in a zero-shot scenario without further fine tuning or demonstration examples.

While this approach enhances generalization skills, it incurs a computational cost due to extensive parameters. However, the result is more fluid conversations and potential operational cost reduction by minimizing resource-intensive API calls during inference.