Where do you stand on voice?

Written by Simon Brady | 11/28/22 10:49 AM

At our recent XLoD event – I hope you were one of the 650 or so attendees – there was an enormous amount of off-stage debate about the true role of comms surveillance in general and voice in particular. I’ll return another time to the question of whether comms surveillance is just a secondary investigative tool and what that view means for integrated surveillance another time. But there were also lots of interesting insights into why people still struggle so much with voice and why even those who want to incorporate voice into more integrated surveillance programmes may be focusing on the wrong things.

The most extreme view is that voice is the least useful of the surveillance channels. Even assuming good recording quality and excellent transcription, the volumes and complexity of those outputs makes them useless for any kind of near real-time surveillance and near useless for the initial surfacing of risks in any non-real-time analysis. “You would never start with voice if you wanted to detect, say, market abuse. The false positives would be huge and even when a human listens to original voice recordings, the signals are so subtle that even they can miss them when they know what they are looking for because of a trade alert,” said one surveillance chief.

At the opposite end of the spectrum are, obviously, the providers of voice surveillance solutions who point out that in areas such as fraud they detect 10 times as many instances of wrongdoing than other channels. They also point out, though generally not in public, that most banks focus on the wrong things when looking at voice.

Surveillance teams tend to focus on transcription accuracy as a key evaluation metric for voice solutions. They assume that if a solution is very accurate at turning audio into the exact written equivalent, then that is the best raw material to then feed into their e-comms surveillance system.

This is wrong on two counts. First, transcription accuracy expressed as a percentage of words correctly transcribed is not useful by itself. A solution that transcribes ‘its’ as ‘it’s’ or ‘the’ as ‘thu’ or is tripped up by the many ‘umm’s and ‘er’s isn’t actually losing any diagnostic value but its accuracy count will suffer.

Second, in a related point, a voice surveillance system should be judged on whether it picks up examples of misconduct, not on whether it gets every word right. These are very different and it is easy to test for the former: give the machine a large volume of audio in which you know misconduct has occurred and see whether the machine can find it.

And third, how you view voice surveillance determines how you do it. If you think voice is secondary and if you think that the way to use it is as an investigative tool run through your ecomms models, you are guaranteeing the poor results that make you think it’s secondary in the first place. Why? Because you are relying on poorly transcribed text being fed into analytics that are not designed for the way humans speak.

Transcribed speech looks nothing like written emails or even messages. The way meaning is embedded in speech is not the same as it is in the written word. So even if you have perfect transcription, models built to detect misconduct in the written world will not necessarily pick it up in transcribed audio.

The debate at XLoD around the value of different comms channels for particular purposes was fascinating. But the debate around voice was the most intriguing. And with collaboration tools now generating huge volumes of very high-quality audio, the debates are only going to get more important.

To hear more about voice surveillance have a look at our upcoming Deep Dive Video & Voice Surveillance.

View full post