Trusted Input

Published

AI agents are able to do good things and bad things. Preventing bad things is difficult. The universe of bad things grows in proportion to the access and capabilities LLMs have.

In pre-AI computing, trusted input results in trusted output. Not error free (trusted input can contain bugs), but deterministic in a CPU-instructions-do-what-it’s-told-to sort of way.

In AI computing, we have untrusted input which leads to untrusted output. Worse, trusted input that goes through AI is also untrusted output (AI can write code that contains malware).

Reconciling the trusted input -> untrusted output paradigm that AI agents introduced is a fundamental constraint on the usefulness of artificial intelligence. Always have a human in the loop is safest, but that defeats the point of agents doing things for you. Code sandboxing can limit certain kinds of damage but can’t prevent an AI agent from draining your bank account. Fine-grained permissions would be too onerous and it will be easier to not to it. Adding an LLM-as-judge moves the same problem someplace else in the chain. Training LLMs to prevent prompt injection seems unattainable.

What is the solution?