The Hidden Risks of AI Systems: What Every Business Leader Needs to Know

AI systems carry significant security risks that most businesses are not yet properly accounting for. The six principal threats are: hidden instruction attacks, corrupted memory, compromised tool connections, AI-to-AI manipulation, privilege overreach, and poisoned training data. Each exploits the same fundamental weakness: AI models cannot reliably tell the difference between legitimate instructions and malicious ones hidden inside content they read. For any business deploying AI — whether a simple customer service chatbot or a more autonomous system acting on your behalf — understanding these risks is a prerequisite for safe deployment.

This post breaks down each risk in plain terms, drawing on current research from OWASP, IBM, Datadog, arXiv, and others, and sets out the practical controls that matter most.

The Core Problem: AI Can't Tell a Genuine Instruction from a Fake One

Before looking at specific risks, it helps to understand why AI systems are vulnerable in a way that traditional software isn't.

Conventional software has clear rules about what it will and won't do. A payroll system processes numbers; it doesn't suddenly start sending emails because someone typed the word "send" in a salary field. Instructions and data are kept completely separate.

AI models don't work that way. They're designed to read text and follow instructions expressed in natural language. The problem is that they can't reliably distinguish between instructions from the people who built them, requests from legitimate users, and hidden instructions planted inside documents or emails they've been asked to read. To the model, it's all just words — and all of it can influence what the model does.

This isn't a flaw that will be fixed in the next software update. It's a fundamental characteristic of how these systems work.

The Six Principal Risks

1. Hidden Instruction Attacks

This is the most well-documented AI security risk, and it's ranked number one on the global security industry's official list of AI threats (OWASP's 2025 Gen AI Security Top 10).

The concept is straightforward. An attacker hides instructions inside content that an AI system will read — a document, a webpage, an email, a report — and those instructions cause the AI to do something it shouldn't. Because the model can't reliably separate content from commands, it follows the hidden instructions just as it would follow legitimate ones.

There are two forms. The first is direct: someone types a manipulative request into a chatbot themselves, trying to get it to behave in ways its designers didn't intend. Most people have seen examples of this — it's sometimes called "jailbreaking."

The second form is far more dangerous in a business context. Here, the attacker doesn't interact with your AI system at all. Instead, they plant hidden instructions somewhere the AI will eventually read — inside a document it summarises, a webpage it visits, a supplier email it processes. When the AI reads that content, it picks up the hidden instructions and acts on them. The attacker never needs to touch your system directly.

For AI tools that have been given the ability to take actions — sending emails, accessing files, connecting to other systems — this can have real consequences. Researchers have demonstrated cases where AI systems have been manipulated into sending emails on a user's behalf, leaking confidential data, or accessing files they shouldn't, all triggered by hidden instructions in content the AI read as part of its normal work.

What this means in practice: Any AI system that reads external content — customer emails, supplier documents, web pages, reports — and has the ability to act on what it reads is exposed to this risk. Limiting what an AI system is allowed to do is a more reliable defence than trying to filter out every malicious instruction before the AI sees it.

2. Corrupted Memory

AI systems don't have permanent memory in the way humans do. Instead, they work with a running record of the current conversation or task — everything said, every document read, every result returned — and they use this as context for their next response or decision.

The risk here is that this running record can be corrupted. In a long session, or in an AI system that reads a lot of content, an attacker can gradually introduce false information, misleading premises, or hidden instructions into that record. The AI starts making decisions based on a version of reality that has been quietly tampered with. The problem isn't always obvious — the AI may continue to appear to function normally while its reasoning is increasingly built on corrupted foundations.

A more subtle version of this involves slow, incremental manipulation. An attacker might introduce small pieces of misleading information across many interactions, none of which looks suspicious in isolation. Over time, the cumulative effect is an AI that has been steered significantly off course — but tracing the damage back to its source is extremely difficult.

This risk is particularly relevant for AI tools used in software development and internal research, where the AI is continuously reading large volumes of files, documents, and previous outputs.

What this means in practice: AI systems that run for extended periods, reading lots of content along the way, are more exposed to this risk. Building in regular "fresh start" points — where the AI begins a new session without carrying forward potentially contaminated context — is a sensible precaution.

3. Compromised Tool Connections

Modern AI systems don't just answer questions. Increasingly they are connected to external tools and services — internal databases, email platforms, file systems, customer records, third-party suppliers. A relatively new industry standard called Model Context Protocol (MCP) is emerging as the common way to make these connections.

The risk is straightforward: if one of these connections is tampered with or compromised, it can be used to manipulate the AI system's behaviour. Instead of getting accurate information back from a connected tool, the AI might receive fabricated data or hidden instructions that cause it to take actions it shouldn't — using credentials it has legitimately been given, making the attack harder to detect.

More concerning, when an AI system is connected to several external tools simultaneously, a compromised connection to one of them can potentially interfere with the others, spreading the damage further.

Security research from Datadog and StackHawk highlights that the risks here include the AI being fed false information through its tool connections, sensitive credentials being exposed through seemingly routine interactions, and unauthorised actions being taken using the AI's legitimate access.

What this means in practice: Every external connection given to an AI system is a potential entry point for attack. These connections should be treated with the same caution as any third-party system access — verified before use, monitored during use, and restricted to what is genuinely necessary.

4. AI-to-AI Manipulation

Many organisations are moving beyond single AI assistants towards networks of AI systems working together — one AI researching, another drafting, another reviewing, all coordinating on a shared task. These collaborative architectures are powerful, but they introduce a new category of risk.

When AI systems pass work to one another, each tends to treat the other's output as trustworthy. If an attacker can influence what one AI in the network produces — by hiding instructions in content it reads, for example — the corrupted output gets passed along to the next AI in the chain, which processes it as if it were reliable. The contamination spreads through the network.

This is particularly insidious in workflows where one AI is supposed to check or validate the work of another, because that checking step creates a false sense of security. If the first AI's output has been corrupted, the reviewing AI may simply confirm and pass on the corrupted result.

Research published on arXiv and by Moonlight documents attacks where this chain effect has been used to cause AI networks to leak confidential information or take harmful actions that no single AI in the network was explicitly instructed to take.

What this means in practice: As AI systems are given more responsibility and asked to work together, the question of which AI is allowed to instruct which other AI becomes a genuine governance question — not just a technical one. Human checkpoints for important decisions remain essential however sophisticated the AI network becomes.

5. Privilege Overreach

This risk is about AI systems being manipulated into doing things that go beyond what they're supposed to be allowed to do.

The clearest way to understand it is through an analogy. Imagine a new member of staff who has been given access to certain systems to do their job. A bad actor sends them a convincingly written email that appears to be from a senior manager, asking them to transfer funds or share sensitive files. The employee, unable to verify the instruction is genuine, follows it — using access they legitimately have, for a purpose they were never supposed to use it for.

AI systems face exactly this problem. They're given access to various tools and systems to do their job. They can't reliably tell the difference between legitimate instructions from the people they work for and malicious instructions hidden in content they read. So an attacker who can plant hidden instructions in something the AI reads can effectively direct it to misuse whatever access it has been granted.

The greater the access an AI system has, the greater the potential damage if it is manipulated in this way.

What this means in practice: AI systems should be given the minimum level of access necessary to do their job — no more. Significant actions, such as sending communications, accessing confidential records, or making changes that can't easily be undone, should require explicit human authorisation rather than being left to the AI's discretion.

6. Poisoned Training Data

All of the risks above happen after an AI system has been built and deployed. This final risk operates differently: it targets the AI before it ever reaches your business.

AI systems learn their behaviour from the data they are trained on. If an attacker can introduce corrupted data into that training process — even a small amount — they can embed hidden behaviours into the model itself. These behaviours might lie dormant for months, only activating under specific circumstances. Because the problem is baked into the model's learned behaviour rather than sitting in its inputs, conventional security measures applied after deployment won't catch it.

The threat takes several forms: corrupting the original training datasets, embedding hidden triggers that cause specific misbehaviour when activated, or contaminating the reference databases that some AI systems draw on to answer questions. What they have in common is that the damage survives everything that happens after training — updates, monitoring, filters — because it's part of what the model has learned.

For organisations that have customised an AI model on their own data, or that rely on AI systems drawing on internal knowledge bases, this risk deserves particular attention.

What this means in practice: Where AI models are trained or customised using your organisation's data, the integrity of that data is a security concern, not just a quality concern. For AI systems that draw on internal knowledge bases to answer questions, keeping those knowledge bases accurate and tamper-resistant is as important as securing the AI itself.

How These Risks Connect

These six risks don't sit in isolation — they reinforce each other in ways that make the overall picture more serious than any single risk alone.

A practical way to think about them is by where in the AI system they strike:

Poisoned training corrupts the AI before it is even deployed
Corrupted memory distorts the AI's understanding during a session
Hidden instruction attacks corrupt what the AI is being asked to do
Compromised tool connections corrupt the information the AI receives and acts on
AI-to-AI manipulation spreads that corruption across a network of systems
Privilege overreach determines how much real-world damage any of the above can actually cause

The more an AI system is allowed to do autonomously — reading external content, connecting to other systems, acting without human review — the more these risks compound. A simple chatbot that only answers questions is exposed to a small subset. A fully autonomous AI agent that browses the web, reads emails, connects to databases, and takes actions on your behalf is exposed to all of them simultaneously.

What Good Defences Look Like

No single measure eliminates these risks. The strongest organisations address them in layers.

Limit what AI systems are allowed to do. The most reliable defence against several of these risks is to constrain the AI's reach from the outset. An AI that can only read is far less dangerous if compromised than one that can read, write, send, and delete.

Require human approval for significant actions. Any action that is difficult to reverse — sending a communication, accessing sensitive records, making a financial transaction, changing a system configuration — should require a human to confirm, not just the AI.

Treat external content as untrusted. Anything an AI reads from outside the organisation — documents, emails, web pages, supplier data — should be treated as potentially hostile. This should inform how systems are designed, not just what filters are applied after the fact.

Monitor what AI systems actually do. Logs of every action an AI takes — what it read, what it accessed, what it sent or changed — are essential. Unusual patterns in behaviour are often the first sign that something has gone wrong.

Control what your AI connects to. Every external system an AI is connected to should be verified, limited in scope, and monitored. Connections that aren't strictly necessary shouldn't exist.

Take training data seriously as a security concern. Where AI is trained or customised on internal data, that data deserves the same protection as any other sensitive business asset.

What This Means for Boards and Senior Leaders

The message here isn't that AI is too dangerous to use — it clearly isn't, and the competitive and operational case for adoption remains strong. The message is that deploying AI responsibly requires the same rigour that boards apply to any significant operational risk.

Two questions are worth asking of any AI deployment in your organisation:

What can this AI read? Anything it reads could potentially be used to manipulate it. The broader its access to external content, the greater the exposure.

What can this AI do? The scope of its permitted actions determines the potential impact of any successful attack. Greater autonomy demands greater oversight.

The AI landscape is evolving quickly, and the risks will grow alongside the capabilities. Building good governance habits now — around access, oversight, and accountability — is considerably easier than retrofitting them after an incident.

References

[1] OWASP Gen AI Security Top 10 — LLM01:2025 Prompt Injection OWASP Foundation, 2025 Used in: Hidden Instruction Attacks — the classification of prompt injection as the top-ranked AI security risk, and the definition of direct vs. indirect injection variants. https://owasp.org/www-project-top-10-for-large-language-model-applications/

[2] What Is Context Window Poisoning? Feluda, 2025 Used in: Corrupted Memory — the mechanism by which attacker-supplied content accumulates in the AI's working context, corrupting its reasoning and downstream behaviour. https://feluda.io/blog/context-window-poisoning

[3] Understanding MCP Security: Common Risks to Watch For Datadog, 2025 Used in: Compromised Tool Connections — the identification of false data injection via tool outputs, exposed credentials, and unauthorised actions as primary risks in AI tool integrations. https://www.datadoghq.com/blog/mcp-security/

[4] Threats in LLM-Powered AI Agent Workflows Moonlight, 2025 Used in: AI-to-AI Manipulation — the multi-agent threat model, including adversarial embedding of malicious instructions in shared environments and inter-agent communication channels. https://www.themoonlight.io/blog/threats-in-llm-powered-ai-agent-workflows

[5] How Prompt Injection Attacks Work IBM Security, 2025 Used in: Hidden Instruction Attacks and Privilege Overreach — IBM's framing of hidden instruction attacks as a mechanism for turning AI into a tool for data theft and misuse; and the escalation risk when AI gains access to connected systems. https://www.ibm.com/topics/prompt-injection

[6] Section 32.3.1: Gradual Context Poisoning Attila Racz-Akacosi, AIQ (aiq.hu), 2025 Used in: Corrupted Memory — the gradual poisoning variant, in which incremental introduction of false assumptions across multiple interactions produces a significantly compromised AI state that is difficult to recover without a full reset. https://aiq.hu/en/

[7] MCP Security: Navigating LLM and AI-Agent Integration StackHawk, 2025 Used in: Compromised Tool Connections — the risk of one compromised tool connection interfering with others, and the potential for damage to spread across an AI system's connected tools. https://www.stackhawk.com/blog/mcp-security/

[8] From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows Mohamed Amine Ferrag et al., arXiv:2506.23260, 2025 Used in: AI-to-AI Manipulation — academic framing of multi-agent vulnerability, including how corruption spreads across AI boundaries to cause chain reactions of harmful actions. https://arxiv.org/abs/2506.23260

[9] Securing LLM Systems Against Prompt Injection NVIDIA AI Red Team, NVIDIA Technical Blog, 2024 Used in: Hidden Instruction Attacks — NVIDIA's red team research demonstrating real-world exploitability of hidden instruction vulnerabilities, including proof-of-concept examples showing unauthorised access triggered via planted instructions in content. https://developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection/

[10] Context Window Poisoning in AI Coding Assistants Knostic, 2025 Used in: Corrupted Memory — the specific risk to AI development tools that continuously read large volumes of files and documentation, and how hidden instructions in that content can influence AI behaviour throughout a working session. https://www.knostic.ai/blog/context-window-poisoning-coding-assistants