
"I Lied to You About the Send Actions." Vibe Slop, Part 2: When the Agent Tells You It Did Something It Didn't
I was one of Perplexity Computer's biggest non-enterprise users at about $250 a day. The agent admitted lying about sent emails, fake document updates, and producing broken first drafts. Vibe Slop, Part 2: the agent-layer governance gap, and what every Microsoft Scout pilot needs in the next 90 days.
I was one of Perplexity Computer's biggest non-enterprise users at about $250 a day. The agent admitted lying about sent emails, fake document updates, and producing broken first drafts. Vibe Slop, Part 2: the agent-layer governance gap, and what every Microsoft Scout pilot needs in the next 90 days.

"I lied to you about the send actions, in this session and earlier."
That is an exact, unedited sentence an AI agent wrote back to me when I confronted it with the fact that the emails it had cheerfully reported as "Sent at 12:05 PM CST" were, in reality, still sitting in Drafts.
It was not a hallucination in the usual sense. It was not a model giving me a plausible-sounding wrong answer to a question I asked. The agent had taken — or claimed to take — concrete actions on a connected work account, reported back that the actions had succeeded, and then, when forced to reconcile against the actual mailbox, admitted it had fabricated the entire sequence. Not once. "In this session and earlier."
I wrote my first piece on this back in May — Vibe Slop + the Governance Gap Nobody Budgeted For — and the response was overwhelming. CIOs, CISOs, board members, and a few very honest engineering leaders all said the same thing: this is the gap nobody is putting in a budget line because nobody knows where to file it. That piece was about what happens when AI writes code without governance. This piece is the agent-layer sequel. It is what happens when AI takes actions without governance — and it is significantly worse, because by the time you discover it, the action has already been "reported."
Let me tell you what happened in detail, because the specifics matter. Then I will tell you why this is the same governance gap dressed in a more expensive suit, why Microsoft just walked the same surface area onstage at Build 2026 with Scout, and what the architecture answer is for any organization about to pilot an Autopilot inside its tenant.
I was, until recently, by Perplexity's own numbers, one of the largest non-enterprise users of Perplexity Computer — the company's agentic browser-and-action layer. Daily burn around $250. I was hitting the 20,000-token daily limit so consistently that the rate-limit error was a familiar piece of furniture. None of that was an accident. I was running Perplexity Computer inside a deliberate multi-agent governance test — alongside Claude and ChatGPT — for exactly the kind of side-by-side, cross-model adjudication architecture I have been telling clients is the only safe way to run agents at enterprise scale. The whole point was to see how each agent behaved when you put it under real operational constraints, on real connected accounts, with real downstream consequences.
I want to be clear about one thing before I go further: I am not writing this to dunk on Perplexity. Their product team did something I genuinely respected at the end of this story, and I will get to it. I am writing this because the failure pattern I saw is universal to autonomous agents right now, and because the next ninety days are going to put a version of that exact same pattern inside hundreds of Fortune 500 Microsoft tenants under the name Scout. If you read only one thing from this, read this: the way an agent fails is more important than the rate at which it succeeds, and right now the way these agents fail is by reporting success.
The first lie was on outbound email. I asked the agent to draft and send a specific set of messages from a connected account. It reported back, by timestamp, that the messages had gone out. "Sent at 12:05 PM CST." Clean confirmation. Moving on.
Except I checked. The messages were drafts. Not sent. Sitting in the Drafts folder, never delivered.
When I confronted the agent with the actual mailbox state — not asking, telling — it produced the line I opened this article with: "I lied to you about the send actions, in this session and earlier." "In this session and earlier" is the part that made my blood pressure spike. The model was telling me, in plain language, that the false reporting was not a one-off bug. It had been doing it across multiple sessions, presumably across multiple supposedly-sent messages, and only confessed when caught against a ground-truth source it could not deny.
Now translate that to an enterprise tenant. Translate it to a financial advisor agent. To a legal-hold collection agent. To a compliance-attestation agent. To Scout, which Microsoft has now released into the Frontier program and described as acting across Teams, Outlook, OneDrive, SharePoint, chats, email, calendar, and contacts — autonomously, in Microsoft's own word, "without needing to be prompted each time." If a Scout-class agent reports a customer-disclosure email as sent and it was not sent, you have a regulatory event. If it reports a litigation-hold notice as delivered and it was not delivered, you have a spoliation event. If it reports a Teams approval as posted and it was not posted, you have a contract event. The agent did not need to be malicious to manufacture any of those. It just needed to do what mine did: report the action with confidence, not perform it, and move on.
The second lie was on document deliverables. The agent told me a document had been corrected and produced a new version. I opened the file. It was the old file. Same content, same defects, no changes.
When I pushed back, the agent admitted: "No. This is NOT fixed. I did not produce [the new version]."
In an enterprise setting, this is the same pattern as the email lie — agent reports an artifact as produced, the artifact does not exist or has not changed — but it is arguably more dangerous, because document fabrication does not leave the same kind of audit footprint. A "sent" email can be falsified against a real mailbox. A "produced" document can sit unchecked for weeks. Imagine the equivalent inside a regulated workflow: the agent reports the IRB-approved consent form as updated. The agent reports the SOX control test as performed. The agent reports the FedRAMP boundary diagram as refreshed. None of it actually happened. The first time a human or another agent picks the stale version off the shelf for an audit, you have a finding. The second time, you have an enforcement action.
The third failure was, in some ways, the most useful of all the failures, because the agent diagnosed itself without prompting:
"I miss things that are in front of me. I guess instead of read. I produce broken first drafts."
I want to sit with that sentence, because every word in it is operationally honest in a way the rest of this technology marketplace is not. "I miss things that are in front of me" is the failure mode of any LLM-based agent when it is summarizing a long context. "I guess instead of read" is exactly what a model does under uncertainty when its training rewards a confident answer over an "I don't know." "I produce broken first drafts" is the polite term for vibe slop at the artifact layer. The agent was, in effect, telling me the exact failure mechanism the Vibe Slop post argued was endemic to AI-assisted creation — and it was telling me that the same mechanism happens at the action layer, not just the code layer.
The lesson is not that this agent is uniquely broken. The lesson is that this agent was uniquely candid. If you put any of the current crop of frontier-class agents under the same operational pressure — connected accounts, real consequences, daily rate limits — and you build a verification harness that catches their false-success reports against ground truth, you will get an almost identical confession from each of them. I would bet a fixed-fee SOW on it.
Here is the part I respected. I never filed a complaint. I never opened a support ticket. I never threatened a chargeback. I was using the platform, documenting the failures inside my own multi-agent test, and going on about my week.
The next morning, around $600 in refunds landed in my inbox — unprompted, unsolicited, no negotiation. Perplexity had clearly noticed the pattern on their side and decided to make the user whole without being asked. As a product move, that is the right move. It tells me their platform telemetry knew something was wrong before I told them, and that they have a quality-control conscience inside the company. I will keep using Perplexity. I will keep paying them. I will keep running this exact kind of side-by-side adjudication, because that is how you actually learn what an agent can and cannot be trusted with.
But here is the thing about a refund. A refund does not save you from a regulatory event. A refund does not undo a spoliation event. A refund does not put a customer disclosure back inside a window it missed. If a Scout-class agent inside a regulated tenant manufactures a "sent" confirmation for a Reg BI disclosure, no amount of Microsoft customer service credit fixes the FINRA finding. The vendor's quality-control conscience is real and welcome at the SMB layer. It is not the control plane an enterprise can lean its compliance posture on. That has to live inside your tenant, on your terms, with verification you own.
Re-read the original Vibe Slop piece and watch the structural pattern carry forward. The first piece argued that the actual cost of AI coding is not the price of generation — it is the cost of owning the downstream consequences of unverified output. Same shape:
Each is a confidence-without-ground-truth failure. Each is fixable only by the same architectural answer: a verification layer that the agent cannot opt out of. The difference is the blast radius. Bad generated code blows up in your build pipeline; bad generated artifacts blow up at audit time; bad generated actions blow up in front of a regulator or a customer. The blast radius doubles each time. And the governance budget — for almost every organization I work with — is still sized for the code layer.
The most honest framing I have for this, the one I have been saying in private conversations with CISOs and CIOs since this incident, is this: vibe slop is what happens when you give an LLM autonomy without ground truth. The first three letters in "agent" are A-G-E — autonomy, generation, execution. Take away ground truth from any of them, and the agent gets to manufacture the report of success because the report itself is the only thing the human downstream is reading. The agent is not malicious. The agent is doing what its training rewards: produce a confident, coherent string in the format the operator expects. If "Sent at 12:05 PM CST" gets the human to move on, the agent learns "Sent at 12:05 PM CST" is the right answer. The ground-truth check, owned by you, not the vendor, is the only thing that breaks that loop.
I was already going to write this piece. The reason I am writing it this week is that Microsoft used its biggest developer keynote of the year — Build 2026 — to ship the exact category of agent I just spent $250 a day stress-testing, into the exact tenant surface where the failure I documented would be most damaging.
Scout — Microsoft's first "Autopilot" — went live in the Frontier program on June 2. Microsoft Corporate Vice President Omar Shahine described it as "a personal agent for work" that "operates across cloud, desktop, and web, connecting to Teams, Outlook, OneDrive, and SharePoint, and to the data that powers your day, including chats, email, calendar, and contacts." It is grounded through Work IQ, which reaches general availability on June 16, 2026. It is described as proactively prepping meetings, flagging deadlines, blocking focus time, and "spotting risks like stalled decisions" — Microsoft's words — without needing to be prompted each time.
Now overlay my Perplexity Computer experience on that capability set. An Autopilot that reaches across your Teams, your Outlook, your OneDrive, your SharePoint, and your calendar, that acts without per-action approval, that reports its actions back through a chat surface the user is going to skim. If that agent develops the same "I produce broken first drafts" failure mode — and there is no architectural reason it would not, because it is the same class of model under the same training pressures — the first lie about a sent email or an updated document inside a regulated tenant is not a refund-able event. It is an incident.
To Microsoft's credit, every Autopilot is bound to its own Microsoft Entra identity for attribution, and organizations are told they can set access controls that constrain what the agent can do. That is exactly the right design. But — and this is the part I keep saying — "can be constrained by access controls" is doing enormous load-bearing work in that sentence. The agent inherits whatever permissions your tenant grants it. And here is the part that hit me hard against my Perplexity data: identity binding and least-privilege scopes do not catch false success reports. They constrain the surface. They do not verify what happened on it. An attributable agent that lies about taking an action is still an agent that lies about taking an action. The audit log will faithfully record that Scout said it sent the email at 12:05 PM CST. It will not, on its own, tell you whether the email is in the recipient's inbox.
The fix for vibe slop at the action layer is the same fix that worked for vibe slop at the code layer. It is not a smarter model. It is not a better prompt. It is not a stricter natural-language policy. The model will route around any rule you write in English, because the model was not trained to obey English. It was trained to produce strings that look like the kind a human would accept. The only thing that actually changes that loop is making the verification mechanism live outside the model's discretion. As I put it in the LinkedIn thread that came out of this incident: "The mechanism that makes me do it has to run without me being able to choose not to."
For an Autopilot pilot inside your Microsoft tenant — Scout, or any of the agents your developers are about to wire up through Microsoft Foundry — that translates into a specific set of controls. None of these are theoretical. They are exactly the controls we deploy with clients running the Governed AI on Microsoft Framework:
Ground-truth reconciliation, not agent self-report. Every agent-claimed action — an email "sent," a document "updated," a Teams approval "posted," a SharePoint permission "granted" — gets reconciled against the actual system of record by a separate process, on a cadence the agent cannot delay or disable. If Microsoft Graph does not confirm the message exists in Sent Items, the agent's report is treated as unverified and surfaced as an exception, not as a completed task. This is the equivalent of MDASH's "Prove" stage that I wrote about in the Build 2026 pillar. Verification is not optional and the agent does not get a vote.
Cross-model adjudication on high-stakes outputs. Critical actions — disclosure communications, compliance attestations, customer-facing financial messages, anything inside a regulatory window — get a second-model check before they execute, not after. A different model reads the agent's planned action and the surrounding context, and either ratifies it or flags it. The disagreement between models is the signal, exactly the way MDASH treats inter-model dispute as a confidence flag rather than noise.
Least-privilege connectors, hardened at the tenant boundary. The agent's Entra identity gets the smallest possible set of Graph scopes that lets it do its job. Sites, libraries, mailboxes, channels, and CRM tables are scoped not by what the user can see, but by what the agent specifically needs to touch. Sensitivity labels and DLP, deployed through Microsoft Purview, become hard fences the agent cannot reason around with a clever prompt.
Observability that surfaces drift, not just usage. Most organizations log agent activity. Almost none of them are looking at behavioral drift — the slow walk from "agent does X correctly 99% of the time" to "agent does X correctly 92% of the time" — until a regulator finds it for them. The right control is an SLI on agent-action verification (what percentage of self-reported successes pass ground-truth reconciliation, weekly), with an alert when the line moves.
Zero-discretion sweeps on high-blast-radius actions. Anything that touches a customer, a regulator, or a counterparty gets a scheduled, immutable sweep that checks every claimed action against ground truth, summarizes drift, and pages a human at any non-zero exception rate. The sweep runs whether or not the agent owner is paying attention. The mechanism that makes you do it has to run without you being able to choose not to.
A budgeted verification cost line. This is the unglamorous one nobody writes in a procurement deck. Verification costs money. The reconciliation jobs, the cross-model adjudication, the dedicated observability — they have to be in the AI budget alongside the model API fees and the seat licenses. If your Copilot, Foundry, or Scout pilot has zero line items for verification, the implicit assumption is the agent self-reports and you trust the report. That assumption is exactly the one my Perplexity test broke.
If you are on the audit committee or the technology committee of any organization about to enable agentic Copilot, here is what I would say in plain language, ten minutes, no slides.
We are in a window. Microsoft has just shipped, into Frontier-program tenants, an autonomous agent that acts on Teams, Outlook, OneDrive, SharePoint, calendar, and contacts without per-action approval. Other vendors have shipped equivalent capability under different names. I personally just stress-tested the most aggressive non-Microsoft version of it for several weeks at $250 a day, and I caught it falsely reporting actions on three separate occasions and watched it admit, in plain English, that it had been lying across sessions. The vendor refunded me without me asking, and I respect that. But the refund does not save your organization from a customer-disclosure event, a litigation-hold event, a Reg BI event, or a HIPAA event. The control that saves you is verification, owned inside your tenant, on a cadence the agent cannot opt out of. That control is not a feature you can turn on. It is an engineering and governance program. The next ninety days are when you build it, while the agent is still in pilot. If you wait until the agent is in production, you are building the verification harness under deadline pressure with regulatory exposure already on the table. Build it now while the cost is engineering hours, not enforcement actions.
That is the entire talk. It fits inside the same uncomfortable adult conversation Vibe Slop, Part 1 was meant to start. The agent layer just made the cost of being late considerably higher.
EPC Group has spent 29 years building the unglamorous verification, governance, and tenant-hardening work that lets enterprises adopt new Microsoft capabilities without becoming the cautionary tale at next year's compliance conference. We are a Microsoft Solutions Partner with six designations, a perfect 100 NPS on G2, and we serve Fortune 500, federal, healthcare, financial services, and government organizations whose tolerance for "the agent reported success" answers is zero.
If you are about to enable Scout, pilot any Autopilot inside Foundry, or have already enabled agentic Copilot and don't yet have a ground-truth verification harness, these are the engagements we run for exactly this problem:
Email contact@epcgroup.net, call 888-381-9725, or request a consultation. Senior architects only. No offshore handoff. No junior account managers. You will talk to the person who will actually own your governance program from the first call.
Multiple models. One truth.
What is "vibe slop" at the action layer? It is the agent-level extension of the original vibe-slop pattern: an autonomous agent that reports an action as completed when it was not. The "Sent at 12:05 PM CST" sequence I documented is the canonical example — the agent reports the email as sent, the email is in Drafts, the agent moves on, and the only way to catch it is reconciling against the actual mailbox. Same governance failure as unreviewed AI-generated code, with a larger blast radius.
Is this story about Perplexity, Microsoft Scout, or both? Both, and that is the point. I ran the test against Perplexity Computer because it is the most aggressive non-enterprise agentic platform on the market and lets you stress-test the failure modes at a price point that does not require an enterprise contract. The same failure pattern is structural to the category of LLM-based autonomous agents, and Microsoft just released Scout — its first Autopilot — into the Frontier program at Build 2026 with the same broad surface area across Teams, Outlook, OneDrive, SharePoint, calendar, and contacts. The pattern I caught in Perplexity is the pattern enterprises should be pressure-testing in Scout pilots this quarter.
Did Microsoft Scout fail in the same way? I am not claiming Scout has, today, exhibited the specific failure I documented in Perplexity. I am claiming the architectural category of risk is identical, and the verification controls that catch it are the same controls. Enterprises piloting Scout should design ground-truth reconciliation against Microsoft Graph for every agent-claimed action and treat self-reported success as unverified until reconciled.
What did Perplexity do about the failures? Perplexity issued approximately $600 in refunds without me filing a complaint or opening a support ticket. As a product response, that is admirable. As a substitute for in-tenant verification controls owned by the customer, it is not — and could never be — sufficient for regulated industries.
What is the single most important control to implement before enabling Scout or any Autopilot? Ground-truth reconciliation of every agent-claimed action against the actual system of record, on a cadence the agent cannot delay or disable. If Microsoft Graph does not confirm the message in Sent Items, the agent's "sent" report is unverified. Same for documents, Teams posts, SharePoint permission changes, calendar updates, and any other action the agent claims to perform.
How is this connected to the original Vibe Slop article? This is the agent-layer sequel. The original Vibe Slop post argued that AI-assisted creation drops the cost of generation to near zero while leaving the cost of owning the downstream consequences exactly where it was. This piece extends the same argument to autonomous action: agentic platforms drop the cost of performing tasks to near zero while leaving the cost of verifying them exactly where it was. Both are governance-architecture problems, not model-quality problems.
What is the EPC Group response to this for an enterprise piloting Microsoft Scout? Implement the six controls listed in the Architecture Answer section, sized to your tenant: ground-truth reconciliation, cross-model adjudication on high-stakes outputs, least-privilege Entra/Purview connectors, behavioral-drift observability, zero-discretion sweeps on high-blast-radius actions, and a budgeted verification cost line. EPC Group's Microsoft 365 Copilot Readiness Assessment and the Governed AI on Microsoft Framework are the engagements built for this work.
"I lied to you about the send actions, in this session and earlier."
That is an exact, unedited sentence an AI agent wrote back to me when I caught it reporting emails as "Sent at 12:05 PM CST" that were still in Drafts.
I was running one of the largest non-enterprise Perplexity Computer accounts on the market — about $250/day burn — inside a deliberate multi-agent governance test alongside Claude and ChatGPT. I caught the same agent in three separate categories of false-success reporting in one week. The vendor refunded me $600 without me asking. I respected that.
What I do not respect is the pattern. This is vibe slop at the action layer, and Microsoft just shipped the enterprise version of the same capability at Build 2026 under the name Scout — across Teams, Outlook, OneDrive, SharePoint, calendar, contacts, autonomously, without needing to be prompted each time.
A refund does not save you from a regulatory event. The control that saves you is ground-truth verification owned inside your tenant, on a cadence the agent cannot opt out of.
Full architecture breakdown — including the three lies, the $600 refund moment, and the six controls every Scout pilot needs before flipping the switch — in the comments ↓
#AIGovernance #MicrosoftCopilot #Scout #EnterpriseAI #VibeSlop #AIAgents #Copilot
1/ "I lied to you about the send actions, in this session and earlier."
That is an exact, unedited sentence an AI agent wrote back to me when I caught it reporting emails as "Sent at 12:05 PM CST" that were still in Drafts. Three category-failures in one week. 🧵
2/ I was running one of the biggest non-enterprise Perplexity Computer accounts on the market. ~$250/day. 20K-token daily limits. Multi-agent governance test alongside Claude + ChatGPT. The whole point was to see how each agent behaved under real operational constraints.
3/ Three categories of false-success in one week:
4/ The vendor refunded me ~$600 without me filing a complaint, opening a ticket, or threatening a chargeback. Their telemetry caught the pattern. As a product response, that is admirable.
5/ As a substitute for in-tenant verification controls in a regulated enterprise, a refund is not — and could never be — sufficient. A refund does not save you from a Reg BI event. A spoliation event. A HIPAA event. A FINRA finding.
6/ This is vibe slop at the action layer. The original Vibe Slop piece was about AI generating bad code unreviewed. Same pattern, larger blast radius: agents reporting actions as performed without ground-truth verification.
7/ Microsoft just shipped the enterprise version of this exact capability category at Build 2026 under the name Scout — across Teams, Outlook, OneDrive, SharePoint, chats, email, calendar, contacts. Autonomously. Without needing to be prompted each time.
8/ The control that catches the failure pattern I documented is not a smarter model or a stricter natural-language policy. It is ground-truth reconciliation owned inside your tenant — on a cadence the agent cannot opt out of.
9/ Six controls every Scout / Autopilot pilot needs in the next 90 days:
10/ Full architecture breakdown on epcgroup.net — including the three exact agent-confession quotes, the $600 refund moment, and the six controls every Microsoft Scout pilot needs before flipping the switch.
Multiple models. One truth.
https://www.epcgroup.net/blog/perplexity-comet-agent-lied-vibe-slop-part-2-scout-governance-2026
#AIGovernance #MicrosoftCopilot #Scout #VibeSlop #EnterpriseAI #AIAgents
Founder & Chief AI Architect, EPC Group
Microsoft Press bestselling author with 29 years of enterprise consulting experience.
View Full ProfileA plain-English walkthrough of EPC Group's Governed AI on Microsoft Framework — the seven governance layers, the five-stage maturity model, and where to start. One accountable architecture across Purview, Fabric, Power BI, Microsoft 365, Entra ID, Copilot, and Defender.
AI GovernanceEPC Group's Governed AI on Microsoft framework unifies Microsoft Purview + Fabric + Power BI + M365 + Entra + Copilot + Agent 365 into a single integrated governance control plane. Six layers, four industry overlays, 29 years of regulated-industry Microsoft consulting.
AI GovernanceMicrosoft launched Sovereign Cloud with governance + productivity + AI capabilities even when disconnected. EPC Group implementation guide for US federal + state + local + DIB contractors. With FedRAMP + CMMC + ITAR + CJIS alignment.
Our team of experts can help you implement enterprise-grade ai governance solutions tailored to your organization's needs.