8 min read

Conversational Guide to Multimodal Conversations

Share on WhatsAppShare via email

Welcome to our conversational guide to multimodality. You're about to learn all the nitty-gritty details of how customers interact with technology throughout the day, why they prefer one modality over the other and how to deliver multimodal experiences to satisfy their needs in different contexts.

Feel free to read the contents of this guide in order or jump straight to the section that sparks your interest. We'll start from the basic definitions and move towards more advanced concepts. Here's a list of topics covered in this guide:

  1. Ways in which Customers Interact with Technology
  2. Definition of Modality
  3. Definition of Multimodality
  4. Multimodal Spectrum of Devices
  5. Multimodal Conversation Design Principles
  6. Definition of Multimodal Bot
  7. Conclusion

How Do Customers Interact with Technology?

With the constantly growing number of channels and different modalities, it should be no surprise that online conversations are increasingly more sophisticated. However, by understanding how potential customers interact with technology throughout the day, you can quickly develop an intuitive understanding of all of them:

  1. Digital Interactions in the Morning
  2. Digital Interactions on the Way to Work
  3. Digital Interactions After Work
  4. Digital Interactions in the Evening

How Do People Interact with Technology in the Morning?

They wake up to the alarm ringing on their smart speaker. Without even getting out of their bed, they can tell it to stop. Then, once they've gotten up and are ready to go, they check the weather on their smart display in the hallway before heading out to ensure that they don't need an umbrella. They can listen to a summary while also seeing an hourly forecast at a glance.

How Do People Interact with Technology on Their Way to Work?

As they're walking to their car, they pull out their phone and tap it to place an order for their favourite coffee. They listen to their favourite podcast or the news while driving to work and can ask their virtual assistant for help in a hands-free way.

How Do People Interact with Technology After Work?

They go about their day, returning home at night when they start preparing dinner for their family. They turn to the smart display in their kitchen and ask for help with a recipe.

How Do People Interact with Technology in the Evening?

After having dinner, it's time to unwind in the living room and watch something on their smart TV. As they're getting ready for bed, they put on some relaxing music through their smart speaker or headphones while reading news headlines on the tablet that's sitting next to them. Finally, they tap into their smartphone before bed to check the next day's agenda.

What’s a Modality?

A modality is a specific sensory channel used for communication, including voice and visuals.

When Do People Communicate Using Visuals?

People prefer to communicate using visuals when they are in public settings, their informational needs go beyond basics and they don’t perform other activities simultaneously.

When Do People Communicate Using Voice?

People prefer to communicate using their voice either when their hands are busy performing other activities simultaneously, want to get a piece of information quickly or complete a simple task.

What Are the Differences Between Voice and Visuals?

Voice and visual ways to communicate may seem interchangeable, but both have their strengths and weaknesses, so it's essential to understand when it's better to use one over the other.

Strengths and Weaknesses Comparison of Voice vs Visual Types of Communication
InputUser types their request into a chat widget or taps one of the visible buttons to get the required information or complete a task.After clicking on a microphone icon or using a wake word, the user says what they need using spoken language.
OutputInformation presented visually can be comprehensive and include text, images, charts, maps, lists, etc.Voice output has limited informational capabilities, and it's primarily suitable for small bites of information and short lists.
DiscoverabilityVisual representation gives the user the ability to quickly scan through it and find the piece of information they are after.Listening to information is linear and takes time. Having to ask multiple questions to get the required information can be frustrating for users.
FocusVisual communication requires intense focus from the user to process the information and effectively prevent other simultaneous activities.Voice communication doesn't require full attention from the user and allows for other simultaneous activities, including driving a car, running or cooking.

Why Is it Important to Adapt to Preferred Conversation Modality?

Customers select their preferred modality based on various factors including their surroundings, informational needs and simultaneous activities, so adapting to their preferences is a critical part of a conversation that helps to build trust and comfort.

What’s Multimodality?

Multimodality is when there are different ways of presenting something. For example, a speech and a picture together can be multimodal. Multimodal representations help people understand how the things they see and what they hear are related to each other.

What’s the Multimodal Spectrum of Devices?

If you look at the spectrum of devices available for conversational interactions, you'll see that your customers can talk through the smart speaker in their living room, their car or even in their headphones. They often tap on their phone to watch something. To satisfy customer needs in different interaction modes, first, you need to understand all the possible variants in the multimodal spectrum:

  1. Voice-Only Devices
  2. Voice-Forward Devices
  3. Intermodal Devices
  4. Visual-Only Devices

What Are Voice-Only Devices?

This category consists of devices you have to hear and talk to.

What Are the Examples of Voice-Only Devices?

Voice-only devices include smart speakers and headphones.

What Are Voice-Forward Devices?

Devices that rely primarily on voice but have optional visuals are known as voice-forward devices.

What Are the Examples of Voice Forward Devices?

Examples of voice-forward devices include cars and smart displays.

What Are Intermodal Devices?

Devices that can use a mix of both voice and visuals are known as intermodal devices.

What Are the Examples of Intermodal Devices?

Examples of intermodal devices include laptops and smartphones with audio on.

What Are Visual-Only Devices?

This category contains devices you have to look at or touch.

What Are the Examples of Visual-Only Devices?

Examples of visual-only devices include laptops or smartphones that are on mute and most watches.

How to Design Multimodal Conversations?

Designing multimodal conversations needs to follow several important principles:

  1. Familiar and Accessible
  2. Availability in Different Contexts
  3. Leverage Different Modalities

Why Is it Important to Design Familiar and Accessible Conversations?

Your company should provide an opportunity for potential and existing customers to engage with your business whenever they need your assistance.

Why Is it Important to Design Conversations for Different Contexts?

Customers interact with businesses in many different contexts throughout the day, so it's essential to design conversations that help them achieve their goals when they are:

  1. at home or on the go
  2. tapping on a device or communicating using their voice
  3. in shared or personal spaces

Why Is it Important to Design Conversations for Different Modalities?

There are various ways customers communicate through different platforms and gadgets. Some rely solely on voice, some rely on visuals, while others include both forms of interaction and people will continue to use more types of devices in the future.

What’s a Multimodal Bot?

A multimodal bot is an AI-powered bot that can communicate with customers in both voice and visuals using natural language to provide information about products and services and accomplish everyday tasks. They can seamlessly adapt to different modalities, depending on user preferences at any given moment.

What Are the Subcategories of Multimodal Bots?

While a multimodal bot can handle an entire spectrum of devices, they're often broken down into two groups by the dominant modality they use to communicate with customers:

  1. Chatbot
  2. Voice Bot

How Does a Multimodal Bot Work?

There are several steps involved in how a multimodal bot processes user requests before sending a suitable response:

  1. Understand the modality of the request
  2. Process request appropriately
  3. Set preferred modality for the response
  4. Check device capabilities
  5. Prepare the most suitable response
  6. Send a reply to the client

Why Are Multimodal Bots Important?

People want to communicate using technology when they need it most. However, things get increasingly complicated with an ever-growing number of different devices and technologies available for customers throughout their daily activities. Your customers expect personalised experiences, and multimodal bots enable you to deliver the desirable types of interactions whenever needed.


Multi-modal bots can handle voice and visuals simultaneously, making them highly versatile tools for improving customer experience. By combining both voice and visual interactions, you can provide a user experience that is familiar and accessible even in different contexts and across multiple modalities. It'll help you reach customers whenever they are ready to engage with your business.

Share on WhatsAppShare via email

Join our newsletter

We care about the protection of your data. Check our privacy policy.