Enter the Dark Software Factory

The Trust Paradox: Why 'Lights-Out' Software Development Is Closer Than You Think.

Have software factories become obsolete, or are they simply unfashionable? Perhaps a necessary restructuring is required to achieve better outcomes.

The term "dark factory" comes from manufacturing. Fanuc has been running robot-building factories in Japan with the lights off since the 1980s. No humans inside. No lights needed.

In software, we're approaching something analogous. AI coding agents like Cursor or Claude Code have crossed a threshold in 2026. They are far beyond the autocomplete chatbot of 2024; they autonomously read specs, scaffold services, write tests, open PRs, and respond to review comments. Inference costs have dropped to the point where running a competent programming agent is far more cost-effective than an equally competent software developer. The technical building blocks for a "dark software factory," a pipeline where AI handles the path from requirement to pull request, are real.

But having the building blocks and having a factory that actually produces trustworthy software are two very different things.

The Gap Between "Working" and "Correct" #

Here is the core tension: AI agents are fast, but speed without governance is just faster failure.

We're already seeing this play out. GitClear's 2024 analysis of millions of commits across codebases with high AI-assisted development found a measurable increase in "code churn," meaning code that gets written and then quickly revised or deleted. The pattern suggests that AI-generated code passes initial review but creates downstream maintenance problems that surface weeks later.

This matches what most honest engineering leads will tell you privately. AI-generated code tends to work. It passes the tests. It meets the acceptance criteria on paper. And then six months later, someone has to modify that service, and they discover that the implementation makes assumptions nobody documented, uses patterns inconsistent with the rest of the codebase, or solves the problem in a way that is technically correct but architecturally wrong.

A system that is difficult to maintain doesn’t just add technical debt; it taxes every future change.

Why This Matters for Defense #

For DoD programs, this has real operational consequences.

Defense programs operate under constraints that most commercial teams never face. Software running on weapons platforms, intelligence systems, and logistics networks has to be operationally effective, which is a perpetually evolving objective. It also has to meet NIST 800-53 controls, survive Authority to Operate (ATO) reviews, and maintain an auditable chain of custody from requirement to deployed artifact. When something breaks in production, the consequences are far more detrimental than lost revenue, they're degraded mission capability and potentially, lives at stake.

The DoD also faces a practical reality that makes AI adoption both more urgent and more dangerous. The defense industrial base is competing for the same software talent as Google and Stripe, and losing. Programs routinely operate with engineering teams that are understaffed relative to the scope of what they're building. AI-assisted development is becoming the only way to close the gap between what the mission needs and what the team can deliver.

But "move fast and break things" was never an option in this context, and bolting AI agents onto a workflow that lacks governance doesn't change that.

The Hallucination Problem Is Worse Than You Think #

The most dangerous failure mode in AI-assisted development is wrong code that looks right.

When an AI agent writes an implementation, it can also write tests for that implementation. The problem is that both the code and the tests emerge from the same model, trained on the same data, carrying the same blind spots. If the model misunderstands an API's behavior, it will write code that calls the API incorrectly and tests that validate that incorrect usage. Everything passes. The build is green. And the bug doesn't surface until the code hits real data in a real environment.

This is what we call the Verification Gap, and closing it requires a deliberate architectural decision: the model that writes the implementation must not be the model that writes the verification.

In practice, this means something like: Use one frontier model (Claude Opus 4.7) for implementation using a spec and another (GPT-5.5) for property-based test generation. Critically, Agent A never sees Agent B's test logic, so it can't learn to game the verification.

Google's AlphaCode research documented a version of this problem. AI-generated solutions that passed AI-generated test cases routinely failed against held-out human-written tests. The finding isn't surprising, but it's worth internalizing: self-validation is not validation.

Move from Prompting to Executable Contracts #

One of the fair criticisms of the structured-AI-development movement is that it can feel like prompt engineering wearing a lab coat. So let's be specific about what differentiates an iterative spec-driven approach from a fancy prompt.

A prompt says "build me an auth service." An executable contract defines the API surface in OpenAPI or Protobuf, specifies the constraints in JSON Schema, and expresses the expected behaviors as BDD-style acceptance criteria. The human defines the interface. The agent implements the logic. And our CI/CD pipeline runs deterministic contract tests against the schema before a human ever looks at the code. If the agent fills in the blanks, it does so within a strictly typed box that won't break downstream services.

This is where most teams stop, and it's where they get the process wrong. An executable contract is not a waterfall requirements document written in a machine-readable format. Software development is a process of learning the problem, not just solving it. You write a contract for a small piece, the agent builds it, you test the result against real user behavior, and what you learn reshapes the next contract. The spec is a living hypothesis that evolves through contact with reality. We've adapted the principles of Behavior Driven Development for the era of agents, but the BDD insight still holds:

the conversation about what to build matters more than the artifact that captures it. This is also why AI doesn't reduce the need for software engineering. If anything, it raises the bar. When agents handle implementation, what remains is the harder work: defining the right interfaces, writing contracts that are precise enough to constrain an agent but flexible enough to evolve, and verifying that what was built actually serves the mission. The human role in a Dark Factory is closer to the engineer who designs the bridge and checks the load-bearing tests than the worker who lays the rebar. That requires more skill, not less.

Governance at the Edges, Autonomy in the Middle #

The agents need room to solve problems. If you constrain them too tightly, you're just using a very expensive code template. But that autonomy has to operate within a framework that guarantees the output meets your standards regardless of how the agent got there.

For us, that resulted in Guber, a framework that has three hard boundaries:

Build integrity. Every AI-generated artifact goes through the same CI/CD pipeline as human-written code, with no exceptions and no fast lanes. Static analysis, dependency scanning, container image signing, and SBOM generation all apply equally. We follow the SLSA (Supply-chain Levels for Software Artifacts) framework for provenance tracking, which means every build artifact can be traced.

Execution isolation. Agents run in ephemeral containers with scoped, short-lived credentials. They can only access the repositories, APIs, and infrastructure they've been explicitly granted. This is least-privilege for non-human actors, mirroring the application to a service account, not merely "Zero Trust" as a marketing term.

Human review, scoped appropriately. We no longer review every line of code. For well-understood patterns like CRUD endpoints, boilerplate integrations, and infrastructure-as-code, we've shifted from line-by-line review to spec-compliance verification. The traceability check confirms requirement linkage. The test suite confirms behavior. The human confirms intent.

For security-sensitive code paths (authentication, authorization, cryptographic operations, data handling that touches classified data), human review remains non-negotiable. The agent flags these paths automatically based on the spec's sensitivity classification. This isn't a philosophical position. It's a practical one: the cost of a missed vulnerability in an auth flow is categorically different from a missed edge case in a pagination endpoint.

What We Learned Building All of This #

At BrainGu, we went through the same messy adoption curve as everyone else. Our engineers started using AI tools individually, with no consistency in how they prompted, what they validated, or how they integrated the output. The results were predictably uneven. Some teams saw real acceleration. Others spent more time debugging AI-generated code than they would have spent writing it themselves.

That experience led us to build SmoothGlue Guber, an agentic AI framework designed specifically for the kind of governed, auditable development that DoD programs require. The core idea is straightforward: give agents enough structure that their output is verifiable, enough freedom that they're actually useful, and enough observability that you can measure whether the whole system is improving over time.

We're using Guber in-house on real work, not just demos. One of our teams recently used it to redesign a console component that simplifies how operators interact with one of our platforms. The kind of work that would normally take a developer a couple of days of context-gathering, implementation, and review cycles was done in hours. Not because the agent skipped steps, but because the spec-driven workflow and verification architecture kept the agent on the rails while it moved fast. That's one data point, not a study. We're collecting more. But it was enough to change the internal conversation from "is this useful?" to "how do we scale this responsibly?"

The last part matters more than most people think. AI agent output is non-deterministic. The same spec, given to the same model on two different days, can produce meaningfully different implementations. You cannot outsource your intellect. If you're not measuring outcomes (defect rates, review cycle times, requirement coverage, and code churn), you're flying blind on whether your AI-assisted pipeline is actually better than what it replaced.

We track these metrics across our projects. Not because the numbers are always flattering, but because they're the only honest way to answer the question: "Is this working?"

The Hard Questions #

For those of you leading engineering teams, particularly in defense and government:

What does your team's AI governance look like today, and would it survive an audit?
If your AI agents output functioning code that leads to tech debt, how do you account for that in your productivity metrics?
What is your verification architecture? If the same model family writes your code and your tests, what breaks the self-confirmation loop?

These aren't rhetorical questions. They have concrete technical answers, and the teams that figure them out first will have a significant advantage.

The Dark Factory works only when the lights stay on for the engineers who own the results.

See It in Action #

We're planning a live walkthrough of Guber's verification pipeline, from spec to merged PR, with the governance layer running in real time. If you're building AI-assisted workflows for DoD programs or mission-critical systems and want to see how this works in practice rather than in slides, reach out to us on our website or follow BrainGu on LinkedIn for the announcement.