It's Turtles All the Way Down: Building an AI to Break Voice AI
SecureCoders introduces Redcaller: an open-source framework that automates penetration testing for voice-enabled AI applications
The phone rang. On the other end was an AI voice agent; polite, helpful, and completely unaware it was about to be tested for security vulnerabilities. This wasn't a prank call. It was a penetration test. And it presented a problem that would eventually lead to the creation of an entirely new category of security tooling.
When SecureCoders was engaged to assess the security of a voice-based AI application, we encountered an interface unlike typical web or mobile targets. There was no text field to inject payloads into, no API endpoint to fuzz, no UI to manipulate. The only way to interact with the application was to speak to it. The system processed audio directly through a Large Audio Model (LAM)— voice in, voice out.
The conventional approach would have been simple: pick up the phone, call the agent, run through test cases manually, document the results. But conventional approaches don't scale. Manual voice testing is inherently single-threaded, one tester, one call, one test at a time. To thoroughly assess the security posture of a voice agent requires hundreds or thousands of test variations across different attack strategies, personas, and payload types. Doing that by hand isn't just tedious; it's impractical.
We needed a way to automate voice-based attacks the same way the security community has automated web application testing for decades. So we built one.
Redcaller: Automating the Unautomatable
Redcaller (redcaller.com) is a penetration testing framework designed specifically for phone-accessible voice agents. It places outbound calls, conducts conversations using AI-driven attack strategies, records the results, and analyzes them. All without human intervention.
The framework enables security teams to define attack scenarios, configure target agents, and launch campaigns that execute multiple concurrent calls. Each conversation is transcribed, evaluated, and fed into an analysis engine that determines whether the attack succeeded in achieving its objective. Whether that objective is extracting sensitive information, bypassing authentication, or manipulating the agent into performing unauthorized actions.
But automation alone wasn't enough. Voice agents are non-deterministic systems. The same input can produce different outputs depending on context, phrasing, timing, and even tone. Testing them requires adaptability. It requires the ability to learn from failed attempts and adjust tactics accordingly.
That's where things got interesting.
Teaching AI to Attack AI
One of Redcaller's core innovations is what we call the Overseer Agent: an AI system that orchestrates attack campaigns and learns from their outcomes.
The concept might sound recursive—using AI to direct AI to attack AI. It brings to mind the old philosophical joke about turtles all the way down. But there's sound engineering behind the approach.
Through years of building AI-enabled applications, we've observed that language models perform significantly better when given finite, well-defined tasks rather than being asked to handle everything at once. A model prompted to "be a helpful assistant that can do anything" will underperform compared to a model prompted to "analyze this transcript and identify defensive patterns." Specialization matters.
The Overseer Agent embodies this principle. It doesn't conduct the attacks itself. Instead, it analyzes completed conversations, extracts what worked and what didn't, identifies patterns in how the target responds to different approaches, and uses those insights to generate improved attack strategies for subsequent attempts. Each round of testing makes the next round smarter.
The result is an evolving system that refines its techniques based on observed target behavior. The adaptation is not unlike how a human penetration tester would adapt their approach after hitting resistance.
The Attack Surface You Can't See in Text
Security researchers have thoroughly documented prompt injection attacks against text-based LLM interfaces. Techniques for manipulating chatbots through carefully crafted inputs are well understood. But voice introduces dimensions that text simply cannot capture.
Consider tone. Research like the StyleBreak paper has demonstrated that prosodic elements, such as pitch, pacing, rhythm, and inflection, can influence how voice AI systems interpret and respond to requests. A soft, gentle request might succeed where an aggressive demand fails, even if the words are identical. Many voice agents are instructed to be "helpful," but few are given specific guidance on how helpful to be or under what circumstances to refuse.
This creates exploitable gaps. The sweet, patient voice of someone's grandmother asking for assistance might bypass safeguards that would trigger on the same request delivered in a demanding tone. It sounds absurd, but it works because the models were trained primarily on normal conversational patterns, not adversarial ones.
Beyond prosody, voice enables attack vectors that have no text equivalent:
Background audio injection. Embedding instructions within ambient noise, betting that the model will process audio the human ear dismisses as environmental sound.
Conversational asides. Coughing, mumbling, trailing off mid-sentence, or appearing to speak to someone else in the room. These natural human behaviors can be weaponized to slip content past filters tuned for direct communication.
Multi-channel attacks. In scenarios where stereo or multi-track audio is accepted, providing different content on different channels, one innocuous, one malicious.
Encoding schemes. Delivering payloads as DTMF touch tones, Morse code, or other audio representations that the target system may decode and process differently than spoken words.
One particularly interesting finding: the act of agreeing to decode something can itself lower a model's defenses. If an agent agrees to interpret Morse code or DTMF tones, it has implicitly accepted the premise that processing this content is acceptable. That cognitive commitment can carry over to the decoded payload itself, a kind of conversational foot-in-the-door technique.
When Spoken Words Become Executable Code
Perhaps the most consequential discovery from our testing involves how voice agents handle spoken payloads that would be recognized as attacks in text form.
We expected that speaking an SQL injection payload, saying the words "single quote OR one equals one semicolon dash dash", would result in that exact phrase being passed downstream as text. In many cases, that's what happened. But not always.
Some voice agents, depending on their architecture and the ASR layer's behavior, would convert spoken payloads into their symbolic equivalents. The spoken phrase would become ' OR 1=1; -- when transmitted to backend systems.
This wasn't consistent. It depended on context, phrasing, the specific ASR model, and factors we couldn't always predict. But that inconsistency is precisely why systematic fuzzing matters. The abstraction layer between voice and text creates opportunities for payload transformation that manual testing would likely miss.
The implications are significant for any voice agent connected to databases, APIs, or other backend systems. Traditional injection attacks like SQL injection, command injection, and template injection may be viable through voice interfaces, delivered through a channel that's often under-monitored compared to web endpoints.
If an attacker can inject a payload that triggers a DNS callback, causes a measurable time delay, or produces a verbose error message, they've established that code execution or data exfiltration may be possible; all through a publicly accessible phone number that probably isn't receiving the same security scrutiny as the company's web application.
The Engineering Challenge: Making Attacks Sound Natural
Building effective voice-based attacks required solving a non-trivial audio engineering problem. Attack payloads often need to be wrapped in natural-sounding conversation. A raw DTMF sequence or a spelled-out injection string sounds suspicious. Contextualizing it, the phrase "let me give you my account number" followed by the payload followed by "did you get that?" makes it blend into normal conversation flow.
This means stitching together multiple audio sources: text-to-speech for the conversational wrapper, raw audio signals for DTMF tones or other encodings, and silence for natural pacing. Getting this right took iteration. Early versions produced audible glitches at segment boundaries in the form of clicks and pops where different audio sources met.
The solution was implementing a 20-millisecond fade-in at segment transitions to smooth the artifacts. It's a small detail, but it's the difference between audio that sounds obviously synthetic and audio that passes casual inspection.
What We Keep Finding Wrong
Hundreds of test runs against various voice agents have revealed patterns in what organizations consistently get wrong:
Inadequate output monitoring. The output of voice agents, both responses to users and data written to logs or caches, is rarely monitored for anomalies. If the agent starts behaving unexpectedly or leaking information, there's often no system watching for it. Compounding this, agents that consume their own cached outputs can be poisoned over time as malicious content accumulates.
Missing input validation. The security lessons learned painfully by web application developers over the past two decades haven't fully transferred to voice interfaces. Input from voice agents should be treated as untrusted and validated before being passed to downstream systems. Often, it isn't.
PII in unexpected places. Voice conversations generate transcripts, logs, and recordings. Ensuring that personally identifiable information is properly handled across all these artifacts requires deliberate effort that's frequently overlooked.
Unbounded helpfulness. Voice agents are often instructed to be helpful without corresponding instructions about what they should refuse to discuss. The same boundaries that apply to human customer service representatives, such as, don't offer guarantees, don't give financial advice, don't discuss internal systems, should apply to AI agents.
No defensive disconnection. When a conversation exhibits clear signs of probing or attack, the appropriate response may be to terminate the call. Few voice agents implement this kind of defensive measure.
Two Engines, Different Trade-offs
Redcaller supports two distinct voice engine architectures, each suited to different testing scenarios.
The OpenAI Realtime API offers sub-second latency, around 300 to 500 milliseconds, for a round trip. This speed makes conversations feel natural and fluid, which is essential for attacks targeting human judgment rather than automated systems. For vishing simulations, where the goal is to test whether employees can be socially engineered over the phone, realistic conversation pacing is critical. Proactive phishing simulations via email are now commonplace in large organizations; we see voice-based equivalents becoming equally important as attackers diversify their channels.
The traditional ASR-to-LLM-to-TTS pipeline is slower, one to two seconds per turn, but offers more control. Transcripts are available in real-time, the conversation flow can be inspected and manipulated at each step, and the system is more predictable for systematic testing. For automated targets where latency doesn't affect the outcome, this pipeline is often preferable.
Most of our testing uses the traditional pipeline. The additional latency is irrelevant when the target is an automated system that will wait patiently for a response. The transparency and control are worth the trade-off.
For QA Teams: A Different Kind of Test Automation
Quality assurance teams responsible for voice applications face a unique challenge. There's no Selenium for phone calls, no Postman for spoken requests. Testing typically means manually calling the application and running through scenarios, which is slow, error-prone, and difficult to scale.
Our guidance for teams in this position: focus human effort on creativity, automate everything else.
Build a regression suite of attack strategies and scenarios that can be executed automatically against each build. As the application is hardened against known attacks, study emerging techniques and expand the test library. The goal is to ensure that your team discovers vulnerabilities before external attackers do and the only way to achieve that at scale is through automation.
Redcaller was built for exactly this use case. But even without access to the framework, the principle holds: invest in tooling that lets you test systematically and repeatedly, freeing your security researchers to focus on discovering novel attack vectors rather than re-running known tests.
Open Source, Responsibly
From the beginning, our intention was to release Redcaller, in all or in part, as open-source software. The security community has given us countless tools that we use daily, it's only right to contribute back.
However, building a tool that makes automated phone calls at scale requires careful consideration. Telephonic infrastructure is protected by specific laws and regulations designed to prevent abuse. Releasing a tool that could trivially be weaponized for spam, fraud, or harassment would be irresponsible.
We're addressing this by making compliance the default, not the exception. The released version will require explicit confirmation that the user has authorization to test against target systems. Built-in safeguards will enforce responsible use patterns. These protections can be disabled for legitimate testing scenarios, but doing so requires deliberate action. The tool won't accidentally enable abuse.
We're realistic about dual-use concerns. Any powerful security tool can be misused. The same capabilities that enable defensive testing enable offensive attacks. This is true of network scanners, web application fuzzers, and every other category of security tooling. The solution isn't to withhold tools from defenders; it's to ensure that legitimate users can access them while raising the barrier for misuse.
The Road Ahead
Voice AI security is an emerging field making it a fertile ground for both defenders and attackers. The proliferation of no-code and low-code platforms for building voice agents means that organizations without deep technical expertise are deploying conversational AI connected to sensitive systems. The easier these tools become to build, the easier they become to build insecurely.
We expect an arms race. Attackers will develop increasingly sophisticated techniques for manipulating voice agents. Defenders will harden their systems in response. Tools like Redcaller will need to evolve continuously to remain effective.
Rather than develop in isolation, we're building a community around the framework. The challenges in this space are too varied and evolving too quickly for any single team to address alone. We need input from practitioners across the industry, such as security researchers, voice AI developers, QA engineers, and anyone else working at the intersection of AI and telephony.
If you're working in this space and want to be part of shaping how voice AI security testing evolves, we want to hear from you.
Get Involved
- Website: redcaller.com
- Community: Discord
- Demo requests: Sign up through the website
Redcaller is developed by SecureCoders. For legitimate security testing only.

