INNOVATION | 09.25.2024

The “superpowers” of multimodal AI

Redacción MAPFRE

Multimodal artificial intelligence (AI) is the next step beyond traditional AI models. It enables the simultaneous integration and processing of multiple types of data or “modalities” to improve the understanding and responsiveness of the systems.

When we talk about multimodal AI, we talk about text, images, audio, video or other types of data that, at a given time, may arise in the interaction with a human. A clear example of this innovation is a virtual assistant that simultaneously interprets voice commands and visual gestures interchangeably to provide a more precise and contextual response.

What advantages does it offer over conventional AI? Let’s look at a specific case. Traditional natural language processing systems (NLP) work only with text, without the ability to integrate and analyze visual or auditory information. The multimodal system exceeds the format limitation and allows multimedia components present in our current communication to be added; this incorporation of different sources allows it to have a richer and contextual understanding of the environment or task to be performed.

Evolution of AI over time

The evolution of artificial intelligence has been a dynamic and continuous process over time, marked by several important milestones that have transformed our ability to interact with technology.

Since its inception, AI has undergone various stages, each of which has significantly expanded its scope and functionality. Although there is no universal version of these stages or the terms used to refer to them, a useful simplification for the purpose of this article would be that of an evolution marked by three main stages.

Traditional AI: models based on single-modality data

The first generation of artificial intelligence systems focused on models that used a single source of data to make decisions or perform specific tasks. These systems, popularly known as traditional AI, were mainly based on learning algorithms to analyze structured data.

For example, the first voice recognition systems were only trained with audio data, while natural language processing systems (NLP) worked exclusively with written text. Although these models proved to be useful at the time in specific areas, their ability to understand and act in more complex contexts was limited due to this focus on a single dimension.

Generative AI: creation of new content using existing data

Thanks to advances in the field of AI and the accumulation of large volumes of data, this innovation has evolved toward what we know as generative AI.

This branch of artificial intelligence focuses on creating new content based on existing data. Thus, it can produce images, music, text and other types of content using techniques such as generative adversarial networks (GAN).

In this sense, generative AI generates content that is very difficult to differentiate from human creations. An example of this is the popular chat GPT-3, a language model developed by OpenAI that can generate coherent and contextual text in natural language from a few keywords.

Generative AI is applied in fields as diverse as art, advertising or code development, and also in many corporate areas, from customer service to document management. Its social, economic and business impact is high, and awareness of its responsible use is essential, as well as reflecting on its present and future potential, something that we have already done at MAPFRE.

The beginnings of generative AI have been monomodal (for example, text to text models such as ChatGPT or text to image such as DALL-E) until the arrival of the third stage.

Multimodal AI: integration of multiple data forms to generate more contextual applications

The next step in the evolution of artificial intelligence is multimodal AI. This approach seeks overcoming the limitations of previous models by integrating multiple data forms. It combines information from various sources, such as text, images, audio, video and sensory data, to provide a richer and more contextual understanding of situations.

For example, in the field of health, a multimodal AI system could simultaneously analyze both medical images and patient voice recordings, along with biometric sensor data, to offer a more precise and personalized diagnosis. Another use case of multimodal AI can be found in autonomous driving systems, where data from cameras, LiDAR sensors, and maps are used to make safe real-time decisions.

Convolutional neural networks (CNN)—AI models specifically designed to analyze images when detecting visual patterns and characteristics—are combined with systems that are effective in understanding the content of text and audio. By combining these approaches, multimodal AI can better understand the situation and provide more precise answers. This capacity is especially useful in complex applications, such as medical diagnosis, which uses X-ray images, laboratory results and symptom descriptions to make more precise assessments.

One of the most widespread multimodal AI systems is Google Gemini, GPT-4, Inworld AI, Meta ImageBind and Runway Gen-2, among others.

Advantages of multimodal AI and its application in the insurance sector

Multimodal AI offers numerous advantages that can be leveraged in the insurance sector.

By combining different types of data, it provides a more complete and contextual understanding of the information. This can allow insurance companies to carry out a more accurate assessment of claims, better analyze risks and detect fraud more effectively; for example, multimodal AI can simultaneously analyze the text of a claim, images of the damages and call logs to provide a quick and precise response. Furthermore, its ability to integrate data from different sources can be a great advantage in the relationship with the insured party, as it allows for the development of more intuitive and seamless human-system interfaces. And with regard to policy customization, multimodal AI enables a more accurate analysis and prediction of each client’s needs and behaviors. For example, it can combine text data from emails, images of scanned documents and call logs to offer products and services that are better tailored to their profiles.

In short, multimodal AI can revolutionize the way we interact with technology by combining different data sources to provide more precise and contextual responses. Its ability to integrate text, images, audio, video and other data allows for more sophisticated and effective applications, from medical care to personalization of services in the insurance sector.

RELATED ARTICLES:

The “superpowers” of multimodal AI

Evolution of AI over time

Advantages of multimodal AI and its application in the insurance sector

What impact artificial intelligence will have on risk management within the regulatory framework

Responsible Artificial Intelligence, an economic, technological and social necessity

Insurance companies, a key player in mitigating the risks of Artificial Intelligence