clock menu more-arrow no yes

Filed under:

An invisible interface: 6 things we learned from designing for voice

How to design for an interface you can't see

Illustration of audio waves Sanette Tanaka

How do you design for an interface you can’t see?

We took on that challenge last spring when we decided to explore what it’d take to create a bot for the Amazon Echo. As the designer on the project, the idea of creating a compelling experience for a voice-based product intrigued me. After all, an interface is merely a means of connecting a user to a system, be it screen or speech—how hard could it be?

Our team had explored bot and messaging platform projects in the past, but the speech-to-audio interaction was new to us, and good news examples were few and far between. Given our resources and scope, we decided to go with a prompt-and-response system as opposed to attempting a completely open-ended AI chatbot.

We worked closely with Emily Withrow of Northwestern University Knight Lab, who brought us the initial idea for the project and six months of research. Over the next several weeks, our two teams worked together to design and develop a minimum viable product.

Initial flows, whiteboarding, and sketches from our hacking sessions
Initial flows, whiteboarding, and sketches for our Echo bot
Chao Li, Sanette Tanaka, Emily Withrow

Our final product worked like this: each day, our editorial teams added three new “stories” into our test bot. Each story consisted of a short audio clip, and had a number of related clips. When you started the experience, the Echo played the first story. You could then move onto the next story, or say one of three prompts to dive deeper into the current story.

Today I want to share some of what we learned throughout our many attempts at creating a satisfying voice experience. I should note that while we recognize that voice interfaces vary substantially among platforms, this post primarily compares the Alexa Voice Service interface delivered via the screenless, voice-activated Amazon Echo to graphical, screen-based user interfaces.

1. Balance discoverability with direction

Our test bot centered around this idea of a “story”: a single piece of content, with a number of related pieces of content. We knew that we had to create a navigation structure that allowed users to easily access all of the information.

We considered a ton of approaches, from a completely set flow (where the user says a single prompt to move forward in the experience)...

Illustration showing a completely set flow, where the user says a single prompt to move forward in the experience

… to a totally open-ended model (where the user can try a number of prompts to reach more content).

Illustration showing an open-ended model, where the user can try a number of prompts to reach more content

Neither of those models worked for us. The set flow felt too scripted, whereas the open-ended option paralyzed users with choice.

Web designers can count on some inference on the part of their users, such as their location on the site (based on the breadcrumbs or text that they see) or where they can go next (based on the links and buttons that are available). In contrast, voice UI designers have to be extremely explicit.

That reasoning ultimately led us to a limited choose-your-own-adventure experience, where users could say one of three prompts to dive deeper into a single story, or move onto the next story.

Illustration showing a limited choose-your-own-adventure experience, where users could say one of three prompts to dive deeper into a single story, or move onto the next story.

Using set prompts took some of the discoverability out of the bot, but provided a more satisfying experience than a seemingly-all-knowing-but-actually-quite-limited one (Siri, for instance). During testing, we found that this flow gave just the right amount of authority to the user, giving them choice without overwhelming them with options. Users could move in and out of stories easily, and felt confident that they knew how to reach all aspects of a story.

2. Remember that “words as audio” does not equal “text on screens”

The treatment of text on screens assumes some general powers on behalf of users—that they can access some sort of global navigation, skim a block of text, and reread a section. We can’t assign those same powers to users who are listening to words as audio, rather than reading them as text.

Research shows that people lose attention after 10 to 15 seconds of listening to a prompt. That makes designing audio outputs for Alexa, where speech follows a linear path and has a defined start and end, extremely challenging. Users have less control over factors like pace and length, and we needed to be cognizant of that.

In our original design, we wrote an introduction that welcomed the user to the bot and briefly explained how to use it. Once we played the text through Alexa, though, we quickly realized that the copy was way too long. We wanted to capitalize on our users’ attention from the get-go, so we eventually decided to nix the introduction completely and launch straight into the first story.

3. Be wary of what you ask users to remember

People forget what they hear more easily than what they see or touch. In our test bot, we used words as menu options. We found, though, that even remembering three prompts asked a lot of our users. Designer Luke Wroblewski said it well: “When there’s no graphical user interface (icons, labels, etc.) in a product to guide us, our memory becomes the UI.” And our memories are far better at recognizing cues than straight up recalling them.

During early testing sessions, users regularly forgot prompts, merged two into one, or recalled phrases that were never actually stated. Whenever this happened, users immediately assumed the bot was broken—and for all intents and purposes, they were right. If they couldn’t reach the information they wanted, our experience was broken.

To address this, we made sure that our prompts did not change from story to story—we didn’t want users to have to relearn the commands every time they launched the bot. We also played the prompts after the first story, so users had little time between hearing the instructions and having to act on them. Finally, we provided auditory cues. If users hesitated or said something the bot didn’t recognize, they were reminded of the three prompts.

4. Make room for natural speech

Just as how you use your cursor on desktop or finger on mobile, what you say to Alexa is a means to an end.

And like it’s good practice to increase the size of tappable areas on touch screens to account for our fingers, we had to consider the many ways speech would be “inputted” into the bot. Many of those factors, like background noise, accents, inflections, regional differences, voice quality, and so forth, were Amazon’s responsibility. Still our team could do a lot on our end by anticipating and accounting for the many ways that people modify their speech when speaking to Alexa.

Though voice interfaces are not new, Alexa’s human-like demeanor brought out interesting user behaviors during our testing sessions. We tested with both Echo owners as well as people who had never interacted with the product before, and found variations among both groups. Some users stated the prompts as they were, others asked them as questions, and others still phrased them as directives (e.g. “Tell me about the…”).

To account for those behaviors, we broadened the inputs of what the bot would accept. Our test bot anticipated and accepted more than a dozen alternate inputs for each of the three prompts. People are not robots, and we shouldn’t expect them to behave as such.

Illustration showing the three prompts supported in the bot, as well as more than a dozen alternate inputs for each.

5. Design for uncertainty

While a consideration for every product, designing for multiple contexts is particularly important with the Echo, a product intended to sit in the home, that listens and responds to spoken prompts. Whereas a website or app user can always look away from their screen, an Echo user needs to voice a command and Alexa needs to “hear” it.

We pressed ourselves early on to consider our users’ contexts when they are interacting with our bot and brainstorm ways that it might be invasive. What if they trigger the bot by accident? What if they are in a rush and need a response immediately? What if they get a phone call while Alexa is in mid-sentence?

We devoted an entire portion of our usability testing to skipping content and ending the experience. We tried to mimic an intense situation by describing a scenario for users and asking them to react. Doing so revealed a number of phrases that a user might rely on to stop the bot (e.g. “leave,” “quit,” and “enough”) in addition to the ones that Amazon automatically provides.

Although Amazon supports a final message after a stop command, we decided not to include one. If a user said stop, the bot stopped. The last thing we wanted was for users to feel trapped in the experience.

6. Hold back on personality

The personality of a bot can make or break the experience. Some bots are heavily personalized, like the notorious Clippy. While there’s a happy medium between full robot servant and best friend, I’m not a huge fan of skeuomorphic interfaces that attempt to skin a voice interface as a human conversation.

Our Echo bot has a fairly neutral personality. We felt that Alexa’s innate personality, expressed through voice and tone, was distinct enough. In addition, we knew that the copy in the bot would be heard again and again—we had to ensure that our experience worked for the first listen and 100th listen.

To sum it up...

A number of designers have touted that the best interface is no interface. Whether you subscribe to that mantra or not, this project gave us the opportunity to entertain that reality. Creating a good voice experience boils down to simply following good design principles that are tailored to the strengths and constraints of your chosen technology.

Voice design is still nascent, and I’m happy that our team could help shape one of the early experiences.


* * *

If you can’t get enough of speech interface design, the following resources were super helpful to us when we were starting out:


Huge thanks to my teammates Chao Li, Allison McHenry, and Yuri Victor, as well as Joe Germuska and Emily Withrow of Northwestern University Knight Lab for working with us on this project. Finally, thank you to my coworkers Ryan Gantz, Katie Kovalcin, and Lauren Rabaino for their thoughtful edits.