Dark
Light

Preventing AI Failures in Military Operations

August 25, 2025

Generative AI is advancing fast, and when it comes to military applications, there’s no room for error. Recently, the White House rolled out an executive order aimed at setting up an AI evaluation ecosystem to make sure all these new tools, including those used by the Space Force, work safely and effectively. This isn’t just about keeping up—it’s about staying ahead of global competitors like China, who are busy refining their own benchmarks.

If you’ve ever struggled with new tech that just doesn’t behave as expected, you might appreciate Daniel Levinson’s take on the issue. A seasoned tech entrepreneur, Levinson suggests that we adapt quality control methods from the commercial world to our own military processes. Think of it like the rigorous checks car manufacturers put in place to ensure every vehicle runs safely. In a similar vein, generative AI needs two levels of scrutiny: one to gauge its performance and another to ensure that its users are ready for the job.

Commercial providers have been busy embedding safety and quality controls into their large language models. The Department of Defense is following suit by teaming up with companies like ScaleAI for benchmarking purposes. However, the reality on the ground is that operators need these advancements at a tactical level—today, not sometime in the future. Without consistent checks, even a powerful AI system can slip into unreliability, much like a poorly maintained vehicle can suddenly veer off track.

Levinson reminds us that regular evaluation is key. Without constant feedback, flawed intelligence could lead to serious missteps. With more than twenty years in the field, he argues that the Space Force, much like any modern operation, can really benefit from practices honed in the commercial sector. Yes, the military’s process for approving and deploying AI is a bit more complicated, but that doesn’t mean there aren’t practical ways to introduce continuous benchmarking into everyday operations.

This isn’t just about following best practices—it’s about avoiding potentially critical errors. Although the White House has set the tone at the strategic level, Levinson makes it clear that the same focus on quality assurance needs to be mirrored at the tactical level. Even something as simple as prompt engineering can help refine AI outputs, much like a quick tune-up can keep a car running smoothly.

With reliance on advanced AI systems growing, robust benchmarking is essential for ensuring accuracy, especially as models evolve. Operators on the front lines might not always realise the need for persistent evaluations, so building these checkpoints into daily operations is crucial. Levinson advocates for solutions that are both efficient and cost-effective, with an emphasis on nurturing in-house expertise rather than relying solely on external resources.

As a practical step, Levinson envisions a role for a ‘quality assurance sentinel’—someone dedicated to monitoring AI outputs. This role involves setting baseline frameworks, maintaining up-to-date evaluation records, and running regular tests to keep systems reliable. Over time, as our technologies improve, much of this oversight might even be automated. But for now, human judgement remains essential to ensure that AI-driven systems support military operations safely and effectively.

Don't Miss