Healthcare Agents and AI Voice Generation
Considerations when building AI voice agents, the potential of voice generation in healthcare, and what happens when your agent turns Brit.
One of the funniest bugs we had last year was a bug we named Ryan Turns Brit.
We were rolling out a telephony-based self-service voice agent for a large U.S. healthcare system, designed to help patients get information without needing to speak to a human, using generative AI. It all went fairly smooth, until, one day, the customer started receiving feedback from their end-users who were... confused. Not by the system’s functionality, but by its personality. Specifically, its accent.
The carefully selected voice character, a calm and reassuring dude named Ryan, had suddenly developed a distinct British accent. Not just a hint of an accent - we’re talking full-on special-agent-double-o-seven here.
Needless to say, American patients calling into a U.S. healthcare service weren’t expecting to be greeted by a very proper and articulate Brit. Some thought it was funny. Others thought it was weird.
But the healthcare system we were working with was not amused - they were upset, and rightfully so. The unexpected accent didn’t align with their brand. It created confusion for their patients, and disrupted the experience they were aiming to deliver.
Ryan Turns Brit ended up being a bug in an underlying voice generation system we had a dependency on. To workaround this, we had to temporarily switch to voice character Andrew, who wasn’t plagued with an accent, but also didn’t have the same reassuring tone as the Ryan dude. Meaning the customer was not happy until Ryan gave up his MI6 aspirations and got back to normal. The bug was fixed within a couple of days.
This story opened the door to a broader conversation about the potential of voice generation in healthcare, engaging patients through speech interaction, and the importance of tone.
Voice generation has come a long way from the robotic sound of early text-to-speech systems. Today’s models can produce human-like speech.
Quick intro to voice generation tech
Voice generation has come a long way from the robotic sound of early text-to-speech systems. Today’s models can produce human-like speech with natural inflection, pacing, and emotion. Technologies use neural networks trained on hours of recorded speech to generate voices that sound like real humans.
Let’s highlight the difference between various types of tech here:
Text-to-speech (TTS): converts written text into spoken audio.
Speech recognition, or speech-to-text (STT): this is the reverse direction of voice generation - this means AI that listens to the audio and transcribes what the user said into written text.
Voice cloning: generates speech while recreating a specific person’s voice, given enough voice samples. Even in other languages.
As a side note - speaking of replicating someone’s voice, if you read my blog post from last year about Generative AI and the Hollywood strike, you may recall I raised the theoretical possibility of narrating the Tokyo subway system, in Japanese, using the voice of my favorite British actor. On a more serious note, one of the major issues discussed there was related to AI replicating humans, and actors being concerned about film studios using their appearances or voices without their consent. Complicated.
And as a side note to the side note - video and image generation technologies have evolved so much since that blog post, and so has the emerging legislation, that I feel I owe the topic a revisit. Will do that in one of the next blog posts, so make sure to continue following.
Voice generation technology holds powerful potential when applied responsibly, especially in healthcare. Here’s why.
Voice generation holds powerful potential in healthcare when applied responsibly.
The potential of voice generation in Healthcare
Voice is one of the most intuitive and accessible channels for interacting with AI agents. It allows users to engage hands-free, and can feel more natural than text-based interfaces - especially for patients with limited digital literacy or accessibility needs. In healthcare, where time, empathy, and clarity are all in short supply, a well-designed voice interface can be a game-changer.
Voice channels can significantly enhance accessibility.
Voice channels can significantly enhance accessibility. Whether it is for people with vision impairment or people that have challenges reading. Voice channels can also help patients interact with the system in different languages, and can even be adjusted to match the local accent of the patient’s region. Unless they go rouge like our Ryan dude in the above story, that is.
Voice generation technology can also be assistive for people who cannot speak. Perhaps the most touching example I’ve recently seen is a project done in Microsoft to restore the voice of an ALS patient using AI. With only a few samples of his voice recordings before he lost his voice, a team in Microsoft was able to recreate the patient’s natural voice, allowing him to communicate again with his family, in his own tone and style. Watch the video here:
When building AI voice agents, there are several things you need to consider.
Things like latency, brevity, tone, and more.
Considerations for voice agents that are interacted over telephony
When building AI agents that are intended to interact with end-users over voice-based telephony systems, there are several, very real considerations you need to take into account. Here are some of them:
Latency: users are far more sensitive to delays in voice conversations than they are in chat. A pause that might be acceptable in text feels awkward when delivered by a voice assistant. If an agent is taking too long to respond, users would disengage quickly.
Make it short: spoken responses must be shorter than textual responses. People will scroll through a written paragraph, but they won’t sit and listen to a long answer. Most people can’t even stand it when a human rambles for too long, so if your voice agent starts babbling, users would loose patience and likely ask to be handed off to a human. So. Short sentences. Speak to the point.
Allow barge-in: in real conversations, people interrupt each other all the time, especially when they’ve heard enough to act. In some cultures even more than others. And you know exactly who I’m talking about. Now, as for voice agents - they should allow end-users to barge-in, so users can speak over a response to move the conversation along. Without that, the interaction could quickly become annoying.
Tone: adjust the tone to the needs of the end-user. For healthcare agents over voice or telephony, that typically means the agent should be speaking clearly, calmly and with empathy, as patients may be calling when they’re stressed or not feeling well.
Language: use language that is suitable for the end-user. If your end-users are patients, the agent needs to use terminology and sources that patients would understand.
The voice you use for AI agents in healthcare matters.
It’s not just about what the system says - it’s how it says it.
Lessons from Ryan
Couldn’t help brining up the film Her again here. The film puts a focus on the role of the voice character in AI agents, and the importance of its tone in engaging the user.
Inception is important too, as I wrote in my previous blog post about Inception of AI Agents. Agents that are built to interact with end-users over voice need to be incepted for this channel. Agents that are built to interact with patients need to be incepted differently than agents that are intended to interact with clinicians.
When you’re building agents in healthcare, the voice you use matters. It’s not just about what the system says - it’s how it says it. What we learned from Ryan Turns Brit was that even something like a lovely accent of the voice character can undermine the user experience.
And let’s admit, British accent is lovely. There. I said it. Someone had to.
We see more and more interesting use cases for AI agents in healthcare. And some special agents. More on that soon.
About Verge of Singularity.
About me: Real person. Opinions are my own. Blog posts are not generated by AI.
See more here.
LinkedIn: https://www.linkedin.com/in/hadas-bitran/
X: @hadasbitran
Instagram: @hadasbitran
Recent posts: