Alan Harnum

Voice UIs: Promises and Challenges of a "New" Technology

08 Jun 2018

Below are the speaker's notes from a talk I gave at the 2018 Guelph Accessibility Conference. I'm both intrigued by and very wary of voice UIs, so this was a good opportunity to survey the topic in detail.

The slides from the talk are also available.


Introducing Myself and This Talk

Good morning. I’m Alan Harnum, a Senior Inclusive Developer at OCAD University’s Inclusive Design Research Centre. You can visit our website to see details of the work we do today and have done in the past - this is a milestone year for us, our 25th anniversary since our founding at the University of Toronto in 1993 as the Adaptive Technology Resource Centre. But broadly speaking we work to advance the state of “design that considers the full range of human diversity with respect to ability, language, culture, gender, age and other forms of human difference”. Accessibility is a major aspect of this work for us, so I’m very pleased to be able to be here again this year to learn from everyone and talk a little about our own work.

I am going to start this talk by saying I am not an expert in voice UI, and I suspect there are people at this conference and possibly even in this room who have more concrete experience of the technologies involved than me, and a better understanding of how they work.

What I am is someone who has experimented with these technologies in a few different contexts, has followed the news about them in the last few years, and has some thoughts about the particular moment we’re in, from both a technical and social standpoint - though of course those cannot be cleanly separated, so call it a sociotechnical standpoint.

To show my hand, here is what I think is going on - the moment that we are in, as such (I will welcome disagreement with any of these points, as I am sure I’m wrong about some or all of them):

That is the short version. The longer version is the rest of the talk, and the main thing I’m hoping to do is start some conversation about this.

What Are Voice UIs?

For my purposes, I am defining a voice UI as a technology or context that prioritizes voice-based control. In many contexts voice synthesis may also be the primary means of communication by the device about its state or the actions it is taking, as is the case with smart speakers like the Amazon Echo, but my main criteria here is control by voice recognition.

So this includes things like voice control of a smartphone, voice-controlled “personal assistants” like Siri or Alexa that may exist on different platforms, software to control a personal computer by voice commands, or the broadly-classed “Internet of Things” devices that can respond to voice commands, either independently or through integration with another device - that space of acting as an integration point for things like internet-connected thermostats and cameras seems to be one of the places Google, Amazon and others are trying to occupy with their smart speakers and similar devices.

Controlling machines by voice has a long history in science fiction, from “Hal, open the pod bay doors” to “Computer: tea, earl grey, hot”. There is something very appealing in the idea of machines responding to natural language commands, but also the fearful possibility that they will not obey us.

By this definition, I’m excluding technologies like screen readers. I’m also less concerned in this talk in more specialized speech recognition software like Dragon Naturally Speaking that have a long history in accessibility. It’s not that that these aren’t interesting and important topics in their own right, but some of what I’m specifically interested in is this moment where several things are happening at once:

Why Are They of Interest for Accessibility and Inclusion?

There’ve been a number of articles in the popular media about smart speakers being used by visually impaired people - this CBC Spark story is one example. It’s worth listening to if you’re interested in this topic, as it gives a quick overview of some of the uses it’s being put to by a specific group. It talks specifically about custom skill development, which I think is a key topic to be aware of - it gets called different things depending on the company or product, but the core idea is to create custom code to allow the voice-controlled device to execute new functions.

Last year at this conference I gave a talk called “Multimodal Design Patterns for Inclusion & Accessibility” where I spoke about some of the experimental work we’ve done at the IDRC as part of our work on the Fluid Project, the open source community we’re part of. Without rehashing that talk - it would be extremely meta to use some of a talk this year to talk about a talk from the previous year - one of the main points I made is that inclusively-designed systems need to have the potential to accommodate different modalities of interaction and perception.

For me, voice UIs represent significant possibilities for additional modalities, areas of potential growth and stretching of the range of what is possible that could help to support independence, autonomy and collaboration between people of diverse needs in new ways.

I am shortly going to talk about some of the very significant problems with the current state of the technology, but staying in that space of hopeful potential, we can imagine possibilities like voice-controlled power wheelchairs, or building custom voice agents deployed through consumer smart speakers like the Google Home or Amazon Alexa to support more independent living by people with various disabilities.

One of the reasons I put “New” in quotation marks in this talk is that from a certain standpoint, none of this is conceptually or experimentally particularly “new”. Voice-controlled power wheelchairs have been in the research literature for at least fifteen years now.

What I think is new that we can potentially leverage as people interested in inclusion and accessibility supported by technology is the previously-mentioned wide distribution of cheap consumer devices with good voice recognition, and, coming alongside them, a wide range of open source projects and vendor-supported toolkits for building voice UIs.

Even some web browsers - Chrome, specifically - are able to do voice recognition quite well. At the IDRC, we were able to use this to build a transcribing voice recorder in pure browser technology as an experiment that I will now attempt to demonstrate, with some uncertainty if it will work in this room’s acoustics (n.b. It worked!).

What I will say, especially for the developers in the audience, is that it turned out to be a lot easier to start building custom systems using voice recognition than I expected it to be. The Chrome browser is one possible avenue for this, but I’ll highlight some other possibilities later. And the two major smart speakers available in Canada from Google and Amazon also have cloud-based toolkits that aren’t quite drag and drop for building custom behaviour, but are fairly easy to work with.

Now, from another angle, I think we need to be aware of the significant limitations of voice recognition technology itself, before we get into discussing the larger issues of privacy, security and control / ownership. Relatively high-quality voice recognition only exists in a few languages. English is the predominant one, and Amazon Alexa voice service that powers the Echo still does not support French. So when I talk about cheap, good voice recognition, I’m really only talking about that for English and a few other languages.

We also know that even if you speak English, many of these systems typically need a certain style of English to have a high rate of recognition. They don’t handle accents well beyond certain American and British accents, you get the best results when you can speak clearly and enunciate, they don’t always handle background noise particularly well.

Finally, some of the new generation of voice recognition systems are using machine learning to develop their recognition and progressively improve it. This obviously has the same dangers of algorithmic discrimination and not including a diverse range of voices and experiences that machine learning generally has.

So there are some qualifiers even before we get into the next area...

What Are the Privacy and Security Concerns?

I’m going to do this one as a series of screenshots, which I’ll describe as well. They’re all of online articles in the last few years.

One of the realities of most of the consumer-grade devices is that they get “better” by recording samples of your voice and using them to build up a personal recognition corpus. So recordings of your voice end up on Google or Amazon’s servers, which has a whole host of potential issues. This is an issue that’s part of the broader issue of companies tracking your activity - but in this case, the tracking includes a transcription what you said, when you said it, and a recording of your voice. And of course the companies respond that you can opt out of this, delete recordings, etc, but I don’t think this is a real answer - these devices are typically invasive by default and try to collect data, and you have to work hard to prevent them from doing so.

Finally, I think it’s a definite sign of a problem if the American Civil Liberties Union writes an advisory blog post about a technology.

One of the things I go back and forth on in my thinking personally is whether or not recording voice commands - as opposed to the issues of inadvertent recording or compromised devices - is inherently a bigger problem than all the other forms of activity tracking companies like Google and Facebook do when offering “free” services. Something about the recording of my voice feels more personal than data-mining my emails for advertisements, but is that just my particular bias? I worry a little about overstating the relative dangers of these systems relative to others because to someone like me who’s able to use conventional controls, they seem an additional convenience rather than a game-changer, and therefore not exploring areas of work that could be transformative for some people.

What Are Some Potential Directions for the Future?

At the start of the talk I said that I thought a path forward for those of us interested in accessible and inclusive technology was to get better informed about voice UI technology, both what it can offer and where the current shortcomings are. I’ve also tried, briefly, to offer some of that in this talk.

I also said at the start that I had some thoughts about the particular sociotechnical moment we’re in with voice UI technology. It’s always dangerous to say “this moment in time is rather like this other moment in time that I personally experienced”, but I do think there are echoes of the period in the late 90s - early 2000s when various private companies tried to make their browser become the dominantly used one. This was, as some of you might recall, a very difficult era for the web - one where standardization was poor and some websites would only work in certain browsers. I feel we may be going through something of the same in this moment for voice UI, as the big companies pour a lot of energy into trying to become dominant in the space, producing voice UI devices that don’t work well together, and trying to get other manufacturers to adhere to using their particular technology.

In the popular interpretation of how the “browser wars” era came to an end (it didn’t, really, but that’s another talk), the creation of the open source Firefox browser from the wreckage of Netscape played a big part in shifting the landscape. So to close this talk I’d like to offer a highly selective list of open-source projects that deal with voice UI. I think it’s important to be pragmatic and look at the big commercial companies, and how their voice UIs are currently being used, but I also think it’s important to inform ourselves about how to build and explore in this space without them.

Mozilla Common Voice is a Mozilla project to build better open voice datasets for training speech recognition applications. A related project is their DeepSpeech project, an open-source implementation of Baidu’s DeepSpeech architecture for building speech recognition systems with machine learning. Jasper is a project that lets you build devices like smart speakers using the inexpensive Raspberry Pi computer and some off-the-shelf components like a USB microphone.

Finally, Snips isn’t fully open-source yet (though they’ve promised to open-source everything soon), but is intriguing for its emphasis on decentralization and not sending voice data into the cloud. The company originates in the EU, which is somewhat reassuring for privacy given the recent General Data Protection Regulation.

Questions and Discussion

We’re at the end of the “structured” portion of this presentation, but I’d like to use the time we have left for questions - not just for me, but for you as well. In particular...

Of course I am also happy to field any questions you have for me.