Deepfake voice cloning technology is an emerging risk to organizations, which represents an evolution in the convergence of artificial intelligence (AI) threats. When leveraged in conjunction with other AI technologies — such as deepfake video technology, text-based large language models (LLMs such as GPT), generative art, and others — the potential for impact increases. Voice cloning technology is currently being abused by threat actors in the wild. It has been shown to be capable of defeating voice-based multi-factor authentication (MFA), enabling the spread of misinformation and disinformation, and increasing the effectiveness of social engineering. We are continuously monitoring the emergence of deepfake technologies and their use in cybercrime, as detailed in our April 29, 2021 report “The Business of Fraud: Deepfakes, Fraud’s Next Frontier”.
As outlined in our January 26, 2023, report “I, Chatbot”, open-source or “freemium” AI platforms lower the barrier to entry for low-skilled and inexperienced threat actors seeking to break into cybercrime. These platforms’ ease-of-use and “out-of-the-box” functionality enable threat actors to streamline and automate cybercriminal tasks that they may not be equipped to act upon otherwise. Many of the voice cloning platforms referenced in this report are free-to-use with a registered account, thus lowering any financial barrier to entry for threat actors. For those that are not free-to-use, premium prices are negligible — rarely more expensive than $5 per month.
Voice cloning samples that surface on social media, messaging platforms, and dark web sources often leverage the voices of public figures — such as celebrities, politicians, and internet personalities (“influencers”) — and are intended to create either comedic or malicious content. This content, which is often racist, discriminatory, or violent in nature, enables the spread of disinformation, as users on social media are sometimes deceived by the high quality of the voice cloning sample. This “proof-of-concept” (POC) work shared by threat actors has inspired a trend on dark web and special-access sources, with threat actors speculating about the emergence of voice cloning as an attack vector. Conversations among threat actors often reference executive impersonation, callback scams, voice phishing (“vishing”), and other attacks that rely on the human voice.
One of the most popular voice cloning platforms on the market is ElevenLabs’s Prime Voice AI (elevenlabs[.]io), a browser-based text-to-speech (T2S; TTS) software that allows users to upload “custom” voice samples for a premium fee. While there are a number of voice cloning platforms referenced in this report (such as MetaVoice, Speechify, and so on), ElevenLabs is one of the most accessible, popular, and well-documented, and thus serves as the case study for this research