Go Back

The State of Voice and Video Surveillance in Financial Services

Within the United States, people are captured on camera an average of 238 times per day, and that number continues to rise year over year. Homeowner doorbells, street-view surveillance, red light cameras in banks and corporate offices, laptop cameras, and other technology are capturing countless hours of recorded footage daily. This is hardly surprising given how video surveillance provides an affordable risk mitigation and security strategy. Despite its utility as a tool to increase protection for organizations and individuals, it hasn’t always been a priority for most financial firms based on our conversations with compliance officers. However, regulatory bodies are issuing increasing fines to encourage financial institutions to take all types of communication monitoring seriously—including video and voice.

Video surveillance: Missing the bigger picture

In conversation with our clients, we have learned that large firms appear to have audio and transcriptions thereof down as one of their best practices for monitoring. What is surprising is that most communications monitoring solutions capture the transcript from a video call, but don’t capture the video itself because video is seen as an overlay to the text and content rather than a valuable source of information. This is a lost opportunity.

Facial expressions, notes held up to the screen, but not verbalized, and numerous other contextually rich data are not being captured. But as we all know, these pieces usually hold incredibly valuable information regarding the potential risk. You often have to read between the lines to know what someone is really saying and these images can help.

From a technological perspective, state-of-the-art video surveillance features now include object tracking: that is, the camera zeroes in for a close-up of anything moving in front of the lens. Resolution continues to improve offering a “record setting” level of detail with each new update. Behind the video, the analytical capabilities are mind-blowing and enable rapid scanning for patterns – even in real time. Techniques such as dimensionality reduction simplify analysis by limiting the number of computer edges that need to be calculated. Also, clustering similar data helps reduce the complexity so that groups of data can be calculated  more rapidly and readily versus individual nodes of data.

So, if you know what to look for, you can find it. But that’s part of the problem. If the non-compliant behavior is not defined in excruciating detail, even an AI-informed system is going to miss it because it’s searching for a particular thing versus “things in general.”

The use of emojis and how emoji surveillance is currently under consideration by many firms as the next frontier. The upside-down smiley emoji suggests sarcasm, irony, or passive aggressive behavior; but it can also mean the opposite of whatever a person is saying. If the speaker has not spelled it out as a rule to infer the opposite of whatever text is captured alongside an upside-down smiley emoji, compliance teams may be missing an opportunity to flag suspicious behavior that warrants a closer look.

Voice surveillance: Can you hear me now?

With the advances made in voice-to-text capture, this seems straightforward – but only as a native English speaker with perfect elocution. Which highlights a significant problem with current options: All voice tech solutions are designed for surveilling English only. That creates a back door opportunity for bad actors to exploit leaving firms open to more risky behavior.

Similar to “change of venue” risk in written communication, when voice data is captured surveillance teams may need to monitor language hopping. The system must detect which language or dialect is used for which text string, identify when it switches, stitch the disparate text strings together, interpret the meaning of the combined text string, and infer any semblance of non-compliance. Indeed, this is non-trivial. Hence, few vendors have this advanced capability.

Background noise further complicates audio surveillance. In fact, it is used deliberately to obfuscate interpretation by bots as one multi-factor authentication method by distorting the serial number verification code with wind, waves, water drops, and other sounds that make it difficult to interpret.

Voice-to-text now approaches 98% accuracy consistently across software packages. The creation of a voice lexicon, similarly to the e-Comms lexicon that has been generated, could have a dramatic impact on error rates. Even an increase as small as 0.1%, applied to billions of hours of recorded content, offers a significant improvement.

Here, voice recognition has advanced considerably. Numerous financial institutions now use the combination of a recognized cellular phone number and voice pitch plus tone to verify customer identity. On the flipside, ChatGPT and other AI-generated voices have now become so sophisticated that they can accurately emulate a user’s voice. Even more unsettling is that the AI-generated voice can be instructed to say anything as the next level of deep-fakes.

The interesting revelation with voice surveillance is that monitoring is not revealing insider trading or other forms of market abuse, but rather conduct violations. Personal conduct issues are being flagged in 99.9% of cases revealed in voice transcription based on analysis we have conducted on behalf of our clients.

Reassessment of Video and Voice Surveillance Policies

There is a bit of a “chicken and egg” problem here with respect to video and voice surveillance. Financial firms are understandably reluctant to bear the burdens of the cost and complexity when it comes to proactively surveilling these media. Others argue that the onus is on the regulatory agencies to provide clear direction in the form of policies stipulating exactly what should versus should not be captured. And direction regarding the boundaries of what is captured; specifically, should personal communication on a work device be monitored? What about on a Bring Your Own Device, BYOD, personal device?

Similarly to the RTO (Return to Office) movement and implications there regarding retention and facility costs, BYOD presents another fiscal challenge. To better manage surveillance, all personnel should use work-issued devices only. These can be enabled with standardized software subscriptions to enable consistent monitoring. Moreover, using work-issued devices underscores clearly defined policy-compliant practices and readily enable enforcement. However, the downside is the added cost of all those mobile devices and annual plans. Deutsche Bank is predicting Phase 2 of the SEC’s war on e-comms where this new phase will include hefty voice surveillance fines similarly to the billions in fines levied during the Phase 1 war with WhatsApp.

The present and the future

Firms are investing in new technology largely because regulators have made it clear they have to. However, most compliance officers feel that their current tech stack is inadequate. From issues with capturing accurate voice recordings to mountains of data created by monitoring video, the problem with finding true risk in this new frontier is complex. As regulatory agencies continue to develop guidance on what firms must cover, communications monitoring platforms like Shield must continue to be on the forefront of capture and assessment to protect compliance teams from risk. There is still a long way to go, but with continued effort, the industry can ensure the integrity of its operations and maintain the trust of its clients.


Follow Us

Subscribe to Shield’s Newsletter

Capture everything. Deploy anywhere. Store in one place.