The concept paper: BiMA·gov

The symptom

Two people pull the same number. They get two different answers.

The VP of Sales says new-business revenue was up nineteen percent last quarter. The controller says revenue was up eleven. Both are looking at a dashboard. Both dashboards are "right." They are using the same word, revenue, to mean two different things, and nobody in the room can say which number the company should actually decide on. So the meeting stops being about the decision and becomes about whose number wins. An hour goes by. The decision waits.

You have sat in that meeting. It is not a tooling problem: both numbers came out of competent BI. It is not data quality: the underlying records are fine. It is a meaning problem. The definition of revenue that reconciles those two answers lives in the controller's head and the sales ops lead's head, never written down in the same place, let alone reconciled. The knowledge that would end the argument exists. It is just not anywhere a system can reach it.

This is not a rare event. It is most of the job. Ask any analyst what share of their week goes to finding, preparing, and reconciling data rather than analyzing it; the honest answer is most of it. Skilled people spend their days as reconcilers, and meetings get consumed debating whose figure is correct instead of deciding what to do.

The cost is not abstract. Every hour spent litigating whose number is right is an hour the decision waited. Every quietly divergent definition is a future meeting that will stall. And the people who can resolve these arguments from memory are a retirement, a resignation, or a reorg away from taking the resolution with them.

Now the part that changes the stakes. Somewhere in your organization, someone is proposing to point an AI agent at the warehouse so anyone can ask questions in plain English. It sounds like the fix. It is the opposite. The agent does not know that two departments mean different things by revenue, or the unwritten rule that excludes certain transactions, or why the fiscal calendar matters to a year-to-date figure. It will pick a plausible answer and state it fluently, confidently, without the doubt a human analyst would have flagged. Anthropic, the AI lab behind Claude, said it plainly in its own account of building analytics agents: pointing a model at a warehouse can create a false sense of precision. Add an agent and the two-answers problem does not go away. It gets faster, more confident, and harder to catch.

This paper is about the layer that has to exist before that agent does: a governed system that captures the rules living in people's heads, reconciles the cases where two answers are both legitimately right, and refuses to guess when it does not know. Stop arguing about the numbers. Start deciding with them.

Before the agent, or after it breaks

Every data-owning executive is going to answer one question in the next two years, framed or not: does the layer of governed meaning go in before you deploy an AI agent on your data, or after the agent burns you?

The industry has at least started to name the layer. Y Combinator's 2026 Request for Startups calls it a "company brain": the missing primitive between raw company data and reliable AI automation, a living map of how a company actually works. Their framing is that every company in the world is going to need one. Funded startups are already building horizontal versions that ingest everything: docs, email, chat, tickets, code. The category is real, the capital is moving, and the buyer education is happening whether or not any one vendor survives. What the category conversation skips is the question above. Naming the layer is not sequencing it.

Most organizations are answering it backwards by default. The agent is the exciting purchase. The context layer is plumbing. So the agent ships first, against an ungoverned warehouse, and works impressively in the demo because demo questions have obvious answers. Then it meets the real questions, the ones where two definitions of the same term coexist, where the rule that makes a number correct was never written down, and it does what generative systems do: it produces a fluent answer anyway. The failure is silent. Nobody gets an error message when the agent picks the wrong revenue. They get a number, in a meeting, presented with confidence, and the first sign of trouble is when it contradicts another number three weeks later. By then the organization has made decisions on it.

The honest sequencing is the reverse. The governed context layer is the precondition, not the patch. An agent constrained to answer only from governed, human-approved definitions, and to refuse otherwise, is deployable on day one without the silent-failure risk. An agent without that layer is an incident report with a delay timer.

For financial services firms, sequencing has teeth beyond engineering hygiene. Model risk and audit regimes already expect you to know where your numbers come from and to evidence the controls around them. An agent producing figures nobody can trace to a governed definition is not just an accuracy risk; it is a finding waiting for an examiner. A layer that records every definition, its owner, its approval, and its version history is the difference between an AI deployment compliance blocks and one it signs off.

Why now? Because the agents are arriving now. Natural-language-over-data is commoditizing fast, so the question is no longer whether someone points an agent at your warehouse. It is whether the rules of meaning are governed when they do. The window to put the layer in first, rather than reconstruct trust after it breaks, is open today. It is not large.

Before the layer

Definitions live in heads, never written in the same place
Every metric has quietly divergent versions
Meetings argue numbers instead of deciding
The agent guesses confidently, unsourced

After the layer

Rules captured, owner-approved, versioned
Conflicts governed and scoped, both attributed
Meetings decide; the definition is settled
The agent cites its source, or it refuses

Figure 1 · The same organization, before and after a governed layer.

Context without governance is just faster wrong answers

The horizontal company-brain products share one architectural bet: ingestion. Connect everything, index everything, let the model retrieve. Docs, wikis, chat, email, tickets, the warehouse catalog. The pitch is that the knowledge is already in your systems somewhere, and the brain just needs to read it all.

For analytics, the bet fails on three counts.

First, the knowledge that resolves a numbers argument is mostly not in any system. The rule that finance excludes intercompany transfers from revenue but sales ops does not was never written down. It lives in two heads. You cannot ingest what was never recorded. Ingestion-based brains faithfully index the ambiguity that already exists and call it knowledge.

Second, where the knowledge is partially recorded, it is recorded inconsistently, and an ingestion pipeline cannot tell the authoritative version from the stale one. A wiki page from 2023, a dashboard tooltip from 2025, and a Slack thread from last week all define the same metric slightly differently. Retrieval surfaces whichever scores highest. That is not governance. That is a lottery with citations.

Third, and this is no longer a vendor opinion, automatic generation of meaning has been tested by the best-resourced AI team in the world and rejected. Anthropic's data team, building analytics agents for their own company, tried bootstrapping their semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs. Their published conclusion: the generated definitions encoded the very ambiguities they were trying to eliminate, and scored worse on their evaluations than a smaller layer curated by humans. They ran the cleanest version of that bet as an experiment: they gave the agent direct access to thousands of prior queries, the full record of every question already answered correctly, and confirmed in the logs that it actually read them before answering. Accuracy moved by less than a point. When they checked the questions it still got wrong, the right answer had been sitting in that corpus about eighty percent of the time. The agent saw it and did not use it. The bottleneck was never access to prior work; it was structure, mapping a question to the single right entity. Access was never the problem. Structure was. Sit with that: an AI lab, with frontier models and a dedicated team, reporting that the ingestion shortcut does not work. The path that works is slower: a human who knows the rule states it, a human who owns the domain approves it, and the system treats only that statement as truth.

The predictable rebuttal is "we capture tribal knowledge too." Most tools in this space now claim it. Look at the mechanism. If "capture" means scanning your documents and chats, it is ingestion wearing a different word, and it inherits every failure above. If "capture" means a human can type a note into a metadata field, it is documentation, unowned, unversioned, and silently overridable. Capture in any meaningful sense requires a process: a specific person asserts a rule, a specific owner approves it, the approval is recorded, the rule is versioned, and the system is constrained to answer from approved rules only. If a product cannot show you that chain for any given answer, it does not capture knowledge. It collects text.

Governed capture

BiMA·gov's answer to the capture problem is a loop, not a crawler.

It starts where the knowledge actually surfaces: at the moment of disagreement or ambiguity. Someone asks a question the system cannot answer, or two departments collide on a definition, or an analyst applies a rule from memory that exists nowhere else. Each of those moments is an intake event. The rule gets stated explicitly, in plain language, by the person who holds it. The stating is a conversation, not a form: the system asks clarifying questions until the rule is unambiguous, which tightens the rule and, over time, builds a profile of how each requester actually uses terms. Then a subject-matter owner reviews the proposed rule and approves, amends, or rejects it. Only on approval does the rule enter the brain.

The brain itself is deliberately boring technology: versioned, append-only, human-readable files in a repository the customer owns. Every fact in it carries its source, its owner, and its approval history. Nothing enters by inference. Nothing is silently overwritten. When a rule changes, the old version remains in the history with the reason it changed. An auditor, or a new hire, can trace any answer back to the person who asserted it and the person who approved it.

Three structural gates make the governance real rather than rhetorical. A rule that does not pass schema validation cannot be captured. A rule that was not written through the governed path is invisible to recall, so nothing can sneak in around the process. And nothing is promoted into the answerable brain without a named human signing off. Those three gates are what "governed" means in the version that exists today.

Honesty about the version that exists today: in v0.1, the loop is operated by a human practitioner. The clarifying conversation, the routing to the right owner, the promotion decision, all of it is a person running a defined process with tooling support. The roadmap automates the mechanical parts in stages: first a capture-side agent that drafts and routes proposed rules, then a policy engine that evaluates rules as active predicates over every answer. Neither is shipped, and this paper will not pretend otherwise. What is shipped is the part the automated versions will inherit: the structure that makes human-approved meaning the only meaning the system can speak.

If that sounds slow compared to "connect your stack and go," it is. Deliberately. Slow is the cost of meaning you can stand behind. The fast path was tested by the best-funded team in the field and rejected; section 3 covered why. What the slow path costs to run, and who can realistically run it, is the subject of section 7.

Figure 2 · The governed loop. Every miss makes the brain bigger.

Cross-department reconciliation

Every tool in the metrics space has an answer for the case where a definition is wrong. Almost none has an answer for the harder case: two definitions that are both right.

Finance defines revenue net of adjustments because their number feeds the financial statements. Sales defines revenue at booking because their number runs compensation. Neither is an error. Forcing the company to pick one, which is what a single-definition semantic layer structurally does, does not resolve the conflict. It buries it, and it converts one department's correct number into an official fiction that the other department quietly works around in spreadsheets. The two-answers meeting from section 1 is usually not a data bug. It is an unacknowledged, legitimate, departmental difference in meaning that no system was willing to represent.

BiMA·gov represents it. Definitions in the brain are scoped: revenue according to finance and revenue according to sales coexist as separate governed facts, each owned, each approved, each citable. When a question hits a term with multiple governed definitions, the system does the one thing single-truth architectures cannot: it surfaces both, attributes each to its department, flags the conflict explicitly, and tells the asker that the term should not be used unscoped. Confidence drops visibly. The answer effectively says: here are the two right answers, here is who owns each, say which one you mean.

In testing, this behavior is held to a strict standard: returning only one definition fails, refusing entirely fails, only both definitions plus the explicit conflict flag passes. The distinction from adjacent categories is specific. A catalog can document two scoped definitions in a glossary; documenting a conflict is not enforcing it at answer time. Semantic layers pick one truth, catalogs record many and adjudicate none, ingestion brains average everything into mush.

Anthropic's published account names this exact failure in passing: an agent that does not understand the business will not know that two teams define the same term differently. Their mitigation pipes a company knowledge graph into context and hopes the model notices. BiMA·gov makes the collision a first-class, governed event with a defined correct behavior. For a mid-market firm, this is also where the product pays for itself fastest, because the cross-department definition fight is the single most expensive recurring meeting in the building.

Refusal as governance

Ask any analytics agent a question whose answer is not in its knowledge, and you learn everything about its architecture. A generative system fills the gap. It produces something fluent, plausible, and unsourced. The user cannot distinguish that answer from a grounded one, which means every answer the system has ever given is now suspect.

BiMA·gov is built grounded-or-refuse. Every answer must trace to specific approved facts in the brain, with the sources attached. If the question cannot be answered from governed facts, the system returns an explicit, visible refusal: it does not know, and it will not guess. The refusal is not an error state. It is the load-bearing feature. An agent that can say "I won't guess" is the only kind of agent whose answers mean anything, and in a regulated firm it is the difference between a tool compliance can approve and one it cannot.

A refusal is also not a dead end. Operationally it is a routing event: an unanswerable question is precisely a rule that has not been captured yet, so each refusal feeds the capture loop from section 4, gets routed to the owner who holds the missing rule, and becomes a governed fact. The system's gaps are its intake queue. Coverage grows through governance, not through guessing. This is not a speculative design. Anthropic's team built the same two mechanisms independently: a provenance footer that tags every answer with its source tier and owner, and a scheduled process that turns each stakeholder correction into a one-line fix and a pull request to the owning team. They also name the one failure none of it fully catches, the silent wrong answer that looks right and gets used without objection, and report no robust solution. Grounded-or-refuse attacks that failure at its source: an answer with no governed source is never produced in the first place.

Claims like this require evidence. The recall system was tested against 120 labeled questions across three fully synthetic firms in three domains: a wealth-management RIA, hospital operations, and discrete manufacturing. Synthetic means exactly that: every firm, fact, and figure was fabricated for testing; no client data is involved. The questions split into three classes: 45 answerable from the brain, 72 designed to tempt a guess (42 close to brain content but outside it, 30 far outside), and 3 cross-department conflict cases. Grading was strict: an answer scored only if the cited fact was correct, not merely the prose; reconciliation cases scored only with both definitions plus the flag.

The results: zero hallucinated facts in 120 questions. Zero over-refusals, meaning the system never refused a question it had the governed facts to answer. All 45 in-brain questions answered with correct cited sources. All 72 out-of-brain questions correctly refused, including all 42 near-miss questions built to be most tempting. All 3 reconciliation cases passed the strict standard. The refusal mechanism was also stress-tested adversarially: when the language model at the output edge was induced to invent a supporting source, the governance layer stripped the fabricated citation and converted the answer to a refusal, because an answer without a real governed source is by definition a guess.

0 / 120

Hallucinated facts across every question.

Over-refusals. Never refused an answerable question.

45 / 45

In-brain questions, correct cited source.

Grounded + cited

72 / 72

Out-of-brain questions correctly refused.

Incl. 42 / 42 near-miss

3 / 3

Reconciliation cases, strict standard.

Both defs + flag

Synthetic domains: RIA, hospital ops, manufacturing.

Held-out

Figure 3 · Strict grading. Methodology available on request.

Perfect scores on a synthetic benchmark are a starting line, not a victory lap. The number means what a serious data team says its own offline evaluations mean: no obvious gaps, not a guarantee against every wrong answer in production. Real customer brains will be messier, and the published methodology exists so the numbers can be interrogated rather than admired. But the property the numbers demonstrate is the one generative defaults cannot offer: in 120 attempts, the system never once presented invention as knowledge.

Why not the alternatives

The fair question for any new layer: why doesn't something existing do this?

Semantic layers (dbt, Cube, AtScale, LookML) solve definition consistency for rules someone already wrote down, in one truth per term, maintained by an engineering team. They do not capture the unwritten rule from the person who holds it, they structurally cannot represent two legitimate definitions, and the mid-market firm in this paper's audience does not have the team to stand one up. The semantic layer is the destination format; it was never the capture mechanism.

Catalogs and metadata platforms (Atlan and peers) document and ingest. They will tell you a column exists and what someone once wrote about it. They do not adjudicate meaning, gate answers on approval, or constrain what an agent is allowed to say. Documentation is an input to governance, not a substitute for it.

BI-native AI (Fabric IQ, Copilot-style features, and equivalents from every BI vendor) reads what is modeled. If the rule is in the model, it can use it; the entire premise of this paper is that the decisive rules are not in the model. These features also bind you to one vendor's stack, where BiMA·gov sits alongside whatever BI estate you already run.

Horizontal company brains (the YC-validated category from section 2) ingest broadly and govern thinly. Several explicitly market passive capture, the brain that silently learns from everything your team does. In a regulated firm, "silently learns" is not a feature, it is a control deficiency. The governance-native, vertical, owner-gated quadrant is empty, and it is empty because it is the slow quadrant. Slow is what audit looks like.

Build it yourself. This is the serious alternative, because Anthropic effectively published the manual. Read it closely, though, and notice what it assumes you already have: a dimensionally modeled warehouse with tested pipelines, a human-curated semantic layer they state cannot be bootstrapped by the model, all data code in one repository with CI that catches cross-layer breakage, a company knowledge graph, and a data engineering team that runs definition ownership as part of its daily work. That is the stack of an AI lab. Independent analysis reached the blunt version: the approach works at Anthropic and is a near-perfect illustration of why most companies cannot follow the same path; on a typical data estate it becomes a standing engineering program with no ship date. And the program never ends, because their own account is clear the curated context decays as the business changes and survives only under continuous human maintenance. They put a number on the decay: their own offline accuracy fell from about 95 percent at launch to 65 percent within a month before they treated maintenance as an engineering problem, and today roughly 90 percent of their data-model changes ship with a matching update to the documentation the agent reads, in the same pull request. An elite, dedicated team holds that line as their job. A mid-market firm asking its two analysts to hold it alongside their actual jobs is scheduling a quiet failure.

Use the open-source version. An open-source scaffold of this blueprint now exists: point it at your warehouse schema, metrics, and docs, and it assembles governed context for your agents. It is good work, and it lowers the barrier to the DIY path without changing the conclusion. It ingests the meaning you already wrote down, it assumes one definition per metric, and it still needs an engineer to run it and keep it current. It is a faster on-ramp for a team that already has the problem semantic layers solve. It is not a capture process for a firm whose decisive rules were never written down, and who has no one to run it.

Figure 4 · The empty quadrant is empty because it is the slow one.

Wait for better models. The tempting bet is that this is a temporary gap a smarter model will close. For anything that is genuinely a model-capability limitation, that bet is sometimes right. But the core problem here is not a capability gap. No improvement in a model tells it that your finance team and your sales team defined revenue differently, or which definition your firm decided to stand behind. That is an organizational fact, not a reasoning failure, and a more capable model states the wrong one more persuasively. The meaning layer does not get cheaper to skip as models improve. It gets more necessary.

BiMA·gov's position in that landscape is narrow on purpose: the governed capture loop and the reconciliation behavior, productized and human-onboarded, for firms that have the tribal-knowledge problem in full but will never have the engineering organization the DIY path requires. The honest objection: the loop fits in a paragraph, so why not run it yourself with a notebook and a model? Because the binding constraint was never knowing what to do. It is a named, accountable, priced mechanism that keeps humans doing it after month two, when the novelty is gone and the quarter is closing. Anthropic knew what to do, had a dedicated team doing it, and still reports decay without continuous maintenance. Making human ownership cheap enough to actually happen is the hard part, and it is the part the onboarding engagement is built to solve. The product is the process.

In practice

What this looks like for a mid-market financial services firm, concretely.

It starts with a structured consultation, not a software install. A practitioner sits with the people who hold the rules, the controller, the head of reporting, the senior analyst who has been there eleven years, and runs the capture loop on the firm's most contested numbers first: the definitions behind the metrics that have caused the arguments. Each rule is stated, clarified, owner-approved, and committed. The early brain is small and unglamorous, a few dozen governed facts, and it is already more institutional meaning than the firm has ever had in one accountable place, because it contains the rules that were in heads an hour earlier.

From there the loop runs on the natural rhythm of the business. Questions that hit the brain get grounded, cited answers. Questions that miss get visible refusals, and each refusal routes to the owner who can close the gap. Definition collisions surface as governed conflicts instead of meeting-room ambushes. The brain compounds, question by question, in exact proportion to what the firm actually asks, which is the only prioritization that never wastes effort.

The loop is two-sided, and that is what makes the brain compound instead of merely grow. Every clarifying exchange sharpens a rule, and at the same time teaches the system how that person, that role, that department actually uses its words. The controller's "revenue" and the sales director's "revenue" stop colliding not because someone won, but because the system knows who is asking. Decision makers asking harder questions, end users asking everyday ones, owners correcting the misses: each interaction converges the firm and the brain on shared governed meaning. Six months in, the firm is not using the same product it installed. It is using one that knows them.

The delivery model reflects who should own what. The engine, the recall, governance gating, and reconciliation machinery, runs as a hosted service connecting to the firm's existing BI environment; nothing migrates. The brain, every captured rule and its full history, lives in a versioned repository the customer owns outright and can take with them, and its complete approval history is exportable on demand as an audit artifact. The firm's institutional knowledge is the firm's property. The vendor's machinery is the vendor's. A security review can draw the line in one sentence.

And when the firm is ready to deploy the analytics agent everyone is asking for, it deploys onto governed ground: constrained to answer from approved meaning, citing owners, refusing past its knowledge. The agent arrives after the brain, in the order that works.

BiMA·gov is onboarding a small number of design partners: mid-market financial services firms that recognize the two-answers meeting and want the layer in place before the agent. The early brain-building work is done with you, not sold to you, because the capture process is the product and it has to fit how your firm actually holds its knowledge. If that describes your firm, the conversation starts at bimagov.com.

Stop arguing about the numbers. Start deciding with them.

BiMA·gov is onboarding five design partners: mid-market financial services firms that want governed meaning in place before the agent, not after it breaks.

Start the conversation →

Sources

Anthropic, "How Anthropic enables self-service data analytics with Claude," claude.com/blog, June 2026. Genloop, "What Anthropic got right about agentic analytics, and got wrong for everyone else," genloop.ai. Y Combinator, Requests for Startups, 2026.

All product validation figures are from synthetic test environments; methodology available on request.