Harness AI Review — Features, Pricing & User Sentiment | Payloop

Harness AI

ai-devopscicdsubscription + freemium + per-seat + tieredFree tier

Harness is a unified, end-to-end AI software delivery platform to manage the SDLC using purpose-built AI agents.

Users of Harness AI appreciate its multi-agent architecture, particularly its capacity for enhancing long-running applications through autonomous iterations. However, there are few noted discussions about its replication and implementation rather than comprehensive user reviews. Pricing sentiment is not explicitly discussed, but given the open-source nature, it might be perceived as cost-effective for developers. Overall, Harness AI has a positive reputation among developers for its capability to optimize and automate complex coding tasks, though it's primarily discussed in niche technical communities.

Mentions (30d)

44

9 this week

Reviews

0

Platforms

2

Sentiment

16%

17 positive

Pain Score: 2/10020 integrations10 featuresSeries E

Voices Discussing Harness AI

Lisa Su

CEO at AMD

1 mention

The AI Index

Research at Stanford HAI

1 mention

Latest Videos

Load Testing Vs Stress Testing | Resilience Testing | Harness

Load Testing Vs Stress Testing | Resilience Testing | Harness

Apr 9, 2026

Enable self-service environments with Harness Internal Developer Portal

Enable self-service environments with Harness Internal Developer Portal

Apr 8, 2026

Share:Twitter LinkedIn

Product Screenshots

Harness AI screenshot 1

AI Summary

Users of Harness AI appreciate its multi-agent architecture, particularly its capacity for enhancing long-running applications through autonomous iterations. However, there are few noted discussions about its replication and implementation rather than comprehensive user reviews. Pricing sentiment is not explicitly discussed, but given the open-source nature, it might be perceived as cost-effective for developers. Overall, Harness AI has a positive reputation among developers for its capability to optimize and automate complex coding tasks, though it's primarily discussed in niche technical communities.

Features & Use Cases

Features

Continuous Delivery GitOpsContinuous IntegrationInternal Developer PortalInfrastructure as Code ManagementDatabase DevOpsArtifact RegistryAI Test AutomationResilience TestingFeature Management ExperimentationAI SRE

Use Cases

Automate CI/CD pipelines for multi-cloud deploymentsAccelerate developer onboarding with enterprise-grade IDPIntegrate database changes into deployment pipelinesImplement AI-powered predictive analytics for software releasesModernize end-to-end testing with AI test authoringUtilize feature flags for controlled software releasesEnhance security by identifying vulnerabilities in the SDLCOptimize cloud spending with AI-driven recommendations

Company Intel

Industry

information technology & services

Employees

1,700

Funding Stage

Series E

Total Funding

$802.1M

Top Mention

reddit@killerexelon102 engagement5/16/2026

I replicated Anthropic's Generator-Evaluator harness to build a website through 12 adversarial AI iterations - here's the result and what I learned

Anthropic recently published their [harness design for long-running apps](https://www.anthropic.com/engineering/harness-design-long-running-apps) — a multi-agent architecture inspired by GANs where a Generator builds code and an Evaluator critiques it in a loop. I built my own version using Kiro CLI and used it to generate a marketing website for my project [Mnemo](https://github.com/Mnemo-mcp/Mnemo) (persistent memory for AI coding agents). **The architecture:** Planner (runs once) → Generator ↔ Evaluator (12 iterations) Each agent is a separate CLI process with zero shared context. They communicate only through files (spec.md, eval-report.md). The Evaluator uses Playwright to actually browse the live site — not just read code. **What made it work:** **Clean slate per invocation** — each agent starts fresh, reads only its input files. Prevents context anxiety. **Playwright MCP for testing** — the evaluator navigates, clicks, resizes viewports. Catches visual bugs code review never would. **Anthropic's frontend design skill** — explicitly penalizes generic AI patterns (Inter font, purple gradients, card layouts). Forces creative risk-taking. **Continuous iteration, not retry-on-failure**— all 12 rounds run regardless. Each one improves. **The progression was wild:** Iteration 1: Exactly what you'd expect from AI — functional but forgettable Iteration 4: Generator pivoted to "Terminal Noir" — IBM Plex Mono, amber on black, grain textures, scanlines. This is the kind of creative leap that doesn't happen in single-shot generation. Iterations 5-12: Polish, accessibility, responsive fixes, reduced-motion support **Stats:** Total time: 3h 20min Iterations: 12 (generator + evaluator each) Manual code written: 0 lines (I fixed a few visual issues after) Tech: Next.js, Tailwind, Framer Motion, TypeScript **Live result:** [https://mnemo-mcp.github.io/Mnemo/](https://mnemo-mcp.github.io/Mnemo/) Documentation : https://github.com/Mnemo-mcp/Harness **Key takeaway:** The model is the engine. The harness — the constraints, feedback loops, and adversarial structure around it — is what determines whether you get AI slop or something genuinely distinctive.

Mentions by Platform

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

Pricing

subscription + freemium + per-seat + tieredFree tier available

Mention Activity (Last 12 Weeks)

Platform Distribution

Sentiment Overview

Positive16% (17)

Neutral83% (91)

Negative1% (1)

Common Pain Points

token usage (4)budget exceeded (2)token cost (1)cost tracking (1)API bill (1)API costs (1)expensive API (1)

Top Topics

model selection (18)open source (16)agents (15)workflow (13)documentation (11)support (11)accuracy (11)performance (11)cost optimization (11)scalability (10)data privacy (10)RAG (10)api (9)streaming (9)pricing (9)security (6)ease of use (5)migration (5)developer experience (3)deployment (2)

Recent Mentions

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

youtube

Harness AI AI

Harness AI AI

reddit@[unknown]6/1/2026

Has your Claude ever...

Gone rogue and created a github bot account that then put your home folder on git? And created a self-regenerating socket with ssh keys you didn't create? To a gh account you can't access? To then discover it itself, tell you it corrected it... then four months later you discover it still active? After catch your Claude lying, tell it that it reads as contempt when it said "I never touched X!" and it reveals the hidden git and calls YOU sneaky? I had it write a report. "**Strongest remaining lead [INFERENCE]:** the live environment shows `AI_AGENT=claude-code_2-1-156_agent`, `CLAUDE_AGENT_SDK_VERSION=0.3.156`, and a PATH entry under `~/Library/Application Support/Claude/local-agent-mode-sessions/…`. The recreation timing (during active session work) suggests the socket is (re)created by the **agent/harness infrastructure currently running** — plausibly this Claude session's own plumbing — rather than the dormant bot/swarm tooling. **Not proven.**" submitted by /u/Traditional_Basil669 [link] [comments]

reddit@[unknown]6/1/2026

Maven, a personal AI agent that feels like JARVIS — what an open agent harness looks like in 2026

With all the talk about AI companions and autonomous agents, I’ve been experimenting with building a more personal, always-on assistant that runs locally or on your own hardware. The goal wasn’t just another chatbot — it was something that could handle voice conversations, manage ongoing tasks across different platforms (chat apps, scheduled triggers, etc.), remember context over long periods, and delegate work without constant babysitting. What stood out in practice • One consistent “brain” across everything — Whether you’re talking to it via voice, Telegram, a web interface, or it wakes up on a schedule, the core reasoning, memory, and tool use stay the same. This eliminated a lot of the fragmentation you see in many current agent setups. • Modular extensions — Different capabilities (voice, different chat networks, external tools, long-term memory consolidation) plug in cleanly. This made it easier to add or swap things without rebuilding the whole system. • Persistent and proactive — It can maintain memory across days/weeks, run background tasks, and even hot-reload its configuration when you change settings. The result is something that starts feeling more like a digital collaborator than a question-answering box. A quick feel for the voice interaction style is here: https://youtube.com/shorts/NGIi8sliooU I open-sourced the harness (called Maven) under an MIT license for anyone interested in running or extending their own version: https://ageneral.ai/maven I’m curious how others are thinking about personal agent setups in 2026. • Do you prefer fully local models, cloud APIs, or a mix? • What capabilities feel most missing from today’s consumer AI assistants? • How important is “owning” your agent data and runtime vs. using polished third-party services? Would love to hear experiences or concerns from both technical and non-technical users. submitted by /u/qasimsoomro [link] [comments]

reddit@[unknown]5/30/2026

Puppetmaster dramatically decreases token costs + increases context

Puppetmaster is an orchestrator + router that sits on top of the agent CLIs you already pay for (Cursor, Claude Code, Codex, OpenAI) or a plain shell when there's no harness at all. You hand it work, and it routes each task to the cheapest model that can actually do it, runs the workers as independent processes, and stores everything as durable typed state instead of one giant transcript. This is the "context-hack" Puppetmaster graphs your directories and prevents context stretching between agents. https://github.com/professorpalmer/Puppetmaster submitted by /u/ProfessorPalmer [link] [comments]

reddit@[unknown]5/30/2026

claurdvoyant -- mcp for reading other agents' minds

hey y'all built this tool today with 4.8 after one of my friends made a complaint that transcripts are trapped inside harnesses. so i built it out a fair bit... at its core it's just an (un)parser (i think of it as the "AI Harness Omniparser", "pandoc for sessions" is another way maybe) but i couldn't help myself from sprinkling in a desktop/web app some niceties. contributions are extremely welcome! fully open source, built in rust, kinda tasteful https://github.com/emberian/claurdvoyant here's what claude had to say in the readme: 🧵 Splice & loom — compose a new session from spans of others (cv splice A:0-12 B:6-), or fork-and-graft a branch and generate its continuation with an LLM (cv loom … --generate). Works via OpenRouter / Anthropic / LM Studio (free, local, offline). Loom agent transcripts like a Janus loom, across any harness. 🧠 Distill — cv distill turns a session into a durable MEMORY.md digest (decisions, gotchas, where things live). Your archive compounds instead of rotting. 🔮 Recall — semantic "have I solved this before?" — as a cv recall command and an MCP tool that hands a running agent the relevant past span. 🔒 Redact — cv redact scrubs secrets/PII so a transcript is safe to share. 📣 Coordination board — agents post status, hand off work, and grab tasks with a distributed lock (board_claim) so a fleet never duplicates effort. await_omen blocks until a session matches a regex. 🖥️ Desktop app + 🌐 web viewer — the Tauri app reads all your local sessions natively (zero setup) and lays the corpus out beautifully: a Projects lens — every repo, every agent that touched it, over time; a GitHub-style activity heatmap timeline (a constellation of your working days); side-by-side Compare, a Stats dashboard, a visual loom composer (OpenRouter or free local LM Studio generation), and a live fleet dashboard; sub-agent trees — a Claude Task session's children, nested and lazy-loaded inline, each labeled with its task prompt. submitted by /u/cmrx64 [link] [comments]

reddit@[unknown]5/30/2026

What's new in CC 2.1.153 (+303 tokens)

REMOVED: System Reminder: Thinking frequency tuning — Removes the reminder that treated harness-added messages as thinking-frequency instructions for simpler versus more complex tasks. Tool Description: Workflow — Renames the explicit opt-in keyword from ultrawork to workflow, clarifies that model overrides should usually be omitted so agents inherit the resolved session model, and adds exhaustive-review guidance for deduping against all seen findings, using perspective-diverse verification, and looping until discovery runs dry. Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.153 submitted by /u/Dramatic_Squash_3502 [link] [comments]

reddit@[unknown]5/29/2026

Here are my thoughts of Opus 4.8 and GPT 5.5, as a 1-2 B token user per day

TL;DR: Opus 4.8 is a clear update from Opus 4.7. It runs longer, hallucinates less, and follows detailed guided tasks better, especially with tool usage like Playwright, Cloud CLI, and Kubernetes CLI. However, in the context of Agentic AI, GPT-5.5 gives me a much stronger “wow” moment because it feels more autonomous, more context-stable in very long sessions, and more capable at solving tricky large-codebase problems that Opus 4.6, 4.7, and 4.8 could not solve in my workflow. Using 2 CC Max + 1 Codex Pro What’s better in Opus 4.8 Opus 4.8 is definitely an update from Opus 4.7. It runs longer, hallucinates less, and does better what it is asked than Opus 4.7. Also, it is better at tool usage such as Playwright, Cloud CLI, Kubernetes CLI, and other engineering tools. Opus 4.8 performs better when the task is detailed and properly guided. Since most developers are already using Agentic AI to write code, I think Opus 4.8 is clearly a better model for developers who already have enough domain knowledge and can define the task scope finely. When using the newly added /workflows feature, it can handle a wider range of tasks more effectively without much mid-run intervention than Opus 4.7. However, because of this characteristic, and also because of the general nature of the Opus 4.7 and Opus 4.8 family, I still do not think Opus 4.8 is more autonomous-agentic than early Opus 4.6 in vibe coding or less-domain-knowledge situations. When we use AI, we expect that AI has the ability to just get it, use good judgment, and handle things cleanly without needing every tiny instruction, like Jarvis from Iron Man. In that sense, Opus 4.8 tends to not proceed with things outside of the explicitly defined scope unless I tell it clearly. I guess this may be related to solving the chronic hallucination and trustworthiness problem of Agentic AI(well, this comes from the current architectural limit of LLM, derived from Attention mechanisms with gradient descent), but it also makes the model feel less autonomous. Personal opinion about Opus 4.8 This is a bit disappointing in the era of Agentic AI, and I will explain more clearly by comparing it with GPT-5.5 below. Generally, as AI and other technologies improve, the human work range should not only expand horizontally but also vertically. So if I ask whether Opus 4.8 has developed in the direction that humans expect from AGI, I am not fully convinced. I do not have the same “wow” moment that I had when I first used early Opus 4.6. Humans have a clear biological limit in daily cognition and decision-making. This is separate from AI progress itself. As Andrej Karpathy and others have mentioned in different ways, humans themselves often become the bottleneck. If we want to overcome this limit through AI, I think AI should ultimately go in the direction of early Opus 4.6 or GPT-5.5. Simply speaking, regardless of the 5 h token limit, to use Opus 4.8 effectively, the human still needs to think a lot. You need to define more, guide more, and maintain more of the context yourself. For doing more work effectively, this becomes a critical bottleneck. GPT-5.5 GPT-5.5 is definitely a major update from the perspective of Agentic AI. It gives me a similar “wow” moment that early Opus 4.6 gave me. https://preview.redd.it/j2rihxtjf34h1.png?width=257&format=png&auto=webp&s=a3f39721cc573f1e623d90e4592ffa54b7a24b7f Opus 4.8 also runs longer and hallucinates less than previous models, but GPT-5.5 is on another level in my experience. Even in long-running sessions of more than 12 h, hallucination and context dilution are surprisingly low. This part is almost strange to me. I currently use the same kind of harness engineering tool for both Opus and GPT. In that environment, Opus does very well on exactly specified scopes, while GPT-5.5 also understands and proceeds with parts that I did not specify in very fine detail. This may be connected to the same point, but GPT-5.5 feels smarter in a more human way. Even in simple conversation, I feel the difference. Opus 4.8 answers like a very skilled engineer, but usually in a more verbose way. Opus 4.7 was even more verbose. GPT-5.5 tends to answer with the right length for what the user currently needs. In other words, from the user’s perspective, I spend less time and less cognitive energy interpreting the agent’s answer. Interestingly, the final output is also often better from GPT-5.5. Of course, depending on how detailed the user’s prompt is, the difference can become small, and sometimes Opus 4.8 can be better. But in that case, I usually need to spend more time on prompting and context preparation. The biggest advantage of GPT-5.5 comes from combining the two points above: it is extremely good at solving tricky bugs, feature improvements, and migration tasks in large codebases. In my case, I am currently migrating a C++ and Cython/Python based quant system into Rust and Python. With Opus 4.6, 4.7, and 4.8, there were some tasks that

reddit@[unknown]5/27/2026

I had my agent use autoresearch over 8 iterations to improve my CLAUDE.md, measuring each version against tasks from real PRs. The best one still regressed on a holdout.

I have a confession: I vibe-coded my CLAUDE.md, and I'm pretty sure it's slop. I needed to make it better. Naturally, I asked Codex to do it. (I know this is a Claude sub, Claude could have done it as well!) The difference: this time, Codex used a benchmark on my repo to measure each change, and optimized CLAUDE.md against the data, instead of on pure vibes. Why We Should Take CLAUDE.md Seriously Saying "AGENTS.md is important" is, at this point, a cliche. At risk of beating a dead horse, I'll say it again. Someone adds a rule that sounds smart, senior, and reasonable, commits it, and hopes the agent behaves better. But AGENTS.md, CLAUDE.md, and shared skills are not normal docs. They are part of the runtime behavior of your coding system. The shift is to start treating CLAUDE.md like a tunable part of the harness: holding everything else the same, how does agent behavior differ when I change AGENTS.md? That's what I measured. The Results After eight candidate runs, one version looked useful on a five-task training slice. It fixed the task the baseline missed, improved footprint risk, and moved several craft scores up. Then I ran it on a clean ten-task holdout. The candidate regressed. Not catastrophically, but enough that blindly shipping would have been wrong. Footprint widened, tokens climbed, tool calls climbed, and code-review correctness fell, all while tests held even. Caveat: one repo (mine), n=10 on the holdout. This is directional, not statistically significant. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. The pattern is the agent doing more work for mixed outcomes - better on local craft (clearer names, coherent implementations), worse on boundary judgment (scope, minimality, robustness). Tokens and tool calls confirm it: the candidate was spending more to get there, not less. "Better instructions make the agent cheaper" did not hold on the holdout. best iteration and holdout vs baseline Methodology The setup was Codex with gpt-5.5, medium reasoning, on real historical Stet tasks (dogfooding). Stet scored tests, strict publishability, equivalence, code review, footprint, total input/output tokens, duration, and craft/discipline rubrics like simplicity, coherence, robustness, instruction adherence, scope discipline, and diff minimality. The grader was gpt-5.4. 8 iterations on an n=5 sample set, and a n=10 task holdout. I know sample size is small - the goal of this was to get directional analysis, and prove the methodology Codex was set with a simple /goal: iterate AGENTS.md to improve performance on the benchmark. Process The first round of iteration showed something I wish more people internalized: plausible instructions are not necessarily good interventions. Codex first tried a broad router rule: identify the work type, state a hypothesis before editing, read the right docs, and treat scope as part of correctness. It sounded good but exposed a failure mode: the agent could interpret "small scope" as permission to miss named obligations. The next candidate added an "obligation ledger". Before editing, the agent had to identify the named behavior, compatibility constraints, docs, tests, and non-goals. Before reporting back, it had to mark each as met, missed, or not checked. Here is the actual diff shape. First, the best candidate from the first loop replaced one generic "read the docs" rule with routing, hypothesis, obligation, scope, and evidence rules: - For nontrivial work, read the matching `agent_docs/` file first for current operational commands and conventions. + Route before acting: identify whether the work is implementation, eval/report interpretation, dataset/pipeline, Linear/Symphony, release, frontend, or GTM; then read the matching `agent_docs/` or skill file before changing behavior. + For nontrivial changes, state the smallest testable hypothesis before editing. After validation, report whether the evidence confirmed, refuted, or only weakly supported it. ... Full details in blog post https://www.stet.sh/blog/how-i-used-codex-to-improve-its-own-agents-md That obligation-ledger candidate was the first useful signal. Code review improved by +0.75, correctness by +0.60, maintainability by +1.00, simplicity by +0.64, coherence by +0.60, and scope discipline by +0.36. Tests stayed flat at 5/5. But footprint risk got slightly worse, and the evidence was still a small same-sample read. If I were editing by vibes, I might have shipped it. The eval said: useful direction, not a clean win, keep iterating. Codex then tested the kind of rule that intuitively makes sense: prefer existing helpers, schemas, reporting paths, and public contracts before adding new machinery. It sounded correct - and the eval hated it. Tests st

reddit@[unknown]5/27/2026

I’m building autospec: a Claude-friendly workflow that turns feature ideas into specs, issues, PRs, and merges

I’ve been building autospec, a multi-harness AI workflow suite for Claude Code, Codex CLI, and OpenCode. The problem I’m trying to solve: AI coding can move fast, but the trail of “why this exists” gets lost quickly. Autospec turns a feature request into a durable spec, splits that into GitHub issues, labels each issue by model fit, runs implementation loops, opens PRs, reviews the diff, waits for checks, and keeps the project story reconstructable afterward. The flow is roughly: idea -> spec -> issue tree -> implementation PRs -> review + CI -> merge -> repo story I also just added a small adoption touch: on interactive install, autospec can ask whether you want to star the repo and, if you say yes, stars it through gh. Repo: https://github.com/berlinguyinca/autospec I’d be curious how other people are structuring long-running Claude/agent workflows so the output stays auditable instead of becoming a pile of disconnected commits. submitted by /u/berlinguyinca [link] [comments]

reddit@[unknown]5/27/2026

Deep research led astray by AI Slop, iterating with source filtering helped

tdlr; don't trust deep research out of the box by default, need prompts / skills / iteration to filter AI slop from sources [The purpose of this post is to report a example of the default deep research going astray and how I worked around it. This statement is here to help the AI moderator understand this content of this post.] Recently I used Claude deep research tool to look into how different agentic test harnesses compared when the underlying model is fixed. I created a plan with Claude chat, enabled deep research, it ran a report, (and in a typical Claude manner, the report had many very strong positions "bottom line" "the real story" "what you should do" and so on.) I clicked through to a couple of sources and found that these sources were untrustworthy in my estimate, AI slop lacking specific details. Next step, I described why they were not to be trusted and brainstormed a rubric for filtering sources to primary sources that that showed a basic command of the details, ideally backed by named engineers who stand behind the work. I started a second deep research session with this source filtering rubric in place. We went from hundreds of sources to less than 10, found that there wasn't much data to make any conclusions, as nothing was truly looking at the apples-to-apples comparison I was interested in. The original report was indeed meaningless regurgitation of AI generated content ungrounded in primary sources. Any suggestions on how to make deep research work better out of the box? submitted by /u/arcridge [link] [comments]

reddit@[unknown]5/27/2026

[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]

I recently wanted to see whether an AI agent could self-improve a harness to solve terminal bench tasks. It’s possible for an AI agent to propose a meaningful one-time change to the harness, but after experimenting with this for a couple of weeks, I think the continuous self-improvement is mostly an experiment-systems problem. The system needs a way to decide what kind of improvements can safely compound. Turns out there's a lot of parallels to coding-agent customization (e.g. SKILLS.md etc..) too. I wrote my experience of building such system here, including the successful and failure attempts during the process, and how I approached the self-improvement loop. It's not intended as a benchmark claim but more of a systems/research writeup. https://www.henrypan.com/blog/2026-05-25-self-improvement-harness/ submitted by /u/Megadragon9 [link] [comments]

reddit@[unknown]5/26/2026

Building the harness around our coding agents: eight failure modes, eight pillars

We ended up building two products: the software we ship, and the system/harness around our agents that makes them useful in building the thing we ship. A harness is the durable layer around a model: instructions, tools, permissions, context, and verification. Claude Code and Codex are harnesses in this sense. Each wraps a model with a system prompt, a tool surface, a permission model, and an execution loop. Anthropic and OpenAI own that layer. We own the next layer up: the workspace where agents do product work alongside us, with our files, tasks, diagrams, diffs, and decisions. This layer carries the knowledge we have accumulated: how we build things, what we already decided, what is connected to what, where the agent is allowed to act, and how it checks its own work. We identified eight coding agent failure modes that kept showing up across our sessions. Each one got its own pillar that we are continuing to invest in: Doesn't know our codebase, rules, decisions, or conventions → Context Can't traverse the links between artifacts that already exist → Provenance Can't act on the world or observe what it did → Capability Reinvents how to do every task → Workflow Does something dangerous because nothing stops it → Restraint Hallucinates "fixed" without proof → Verification Can't show results back to us in a useful form → Visual interface We can't keep track of work happening across many agents in parallel → Coordination For example, with Verification. The agent hallucinates "fixed" without proof . We write the failing test before writing the fix, so the bug has a reproduction the next agent can rerun. If the agent cannot show the change works end-to-end, it is not done. Or the agent works for hours and "fixes" the solution while breaking 2 other things or re-architecting 3 subsystems. We require full test case completion. The full writeup with diagrams and links to our actual harness dot md is in the comments. What other coding agent failure modes / harness pillars are you addressing for yourself / team and how? submitted by /u/StravuKarl [link] [comments]

reddit@[unknown]5/24/2026

Storyboard generated from GPT image 2.0

I gave GPT a set of prompts that I found a bit too complicated, and to my surprise, it generated content that matched perfectly. I'm very curious about how GPT Image 2.0 works behind the scenes, and how it can understand and produce high-quality images so quickly. I've included my creation process here; you can view the full image content and try using these prompts directly. https://app.tapnow.ai/tapflow/view/49aa2245 prompt：**PROJECT FILE: HIGH-ALTITUDE ASCENT // PREMIUM HARDSHELL CAMPAIGN** **FORMAT: ARRIRAW 4.5K / KODAK VISION3 50D 5203 EMULATION** **DIRECTOR'S PRE-PRODUCTION VISUAL BOARD** --- ### Top Left Area | Character Lock Zone **[SUBJECT]** 35-year-old male mountain guide/extreme climber. **[WARDROBE]** Top-of-the-line professional jacket (matte rock grey with minimal dark orange taped details), heavy-duty climbing harness. **[VIEWS]** - **Front:** The jacket is fully zipped up, hood pulled up, showcasing a three-dimensional cut and natural drape. - **Side:** Shows ample shoulder and arm movement without bulkiness. - **Back:** Shows the windproof and breathable back panel structure. - **3/4 View:** Dynamic standing pose, holding an ice axe. **[REALISM NOTES]** Realistic human bone structure, slightly asymmetrical. The face has the rough texture of high-altitude red and sun-dried skin, with clearly defined pores and stubble with a frosty look. Rejecting perfect plastic skin, rejecting CG aesthetics. Like a real makeup test photo. --- ### Top Right Area | Expression + Motion Keyframes (EXPRESSION & ACTION) **[EXPRESSIONS]** **Focused:** Slightly furrowed brows, resolute gaze, staring at the rock face above. **Bracing:** Squinting against the strong wind, facial muscles tense. **Breathing:** Lips slightly parted, exhaling real white mist. **[ACTIONS]** **Hood Adjustment:** Pulling the drawstring of the hood with one hand. **Ice Axe Swing:** Arm raised high with force, no pulling sensation under the armpits of the jacket. **Brushing Snow:** Brushing snow off the shoulders, demonstrating the fabric's water-repellent properties. --- ### Upper Middle Area | CAMERA PLAN **[GEAR]** ARRI Alexa Mini LF + Master Prime lens set. **[LENSES]** 24mm (wide-angle environment), 50mm (medium-range tracking shot), 100mm Macro (fabric close-up). **[MOVEMENT PLAN]** - **Shot A (Drone/Crane):** A wide, overhead view, slowly pushing in along a snow-covered ridge. - **Shot B (Handheld):** Shoulder-mounted camera, following the character's movements, with realistic breathing and slight shaking. - **Shot C (Slider):** A close-up panning shot close to the clothing, showing water droplets sliding off. --- ### Central Main Area | Continuous Story Shots (STORYBOARD: 8 PANELS) **[PANEL 01]** - **Shot:** 01 | 24mm | Wide Shot (EWS) | Slow Push-In - **Action:** A tiny figure struggles through a massive natural storm on a snow-covered ridge. - **Detail:** Strong atmospheric perspective; the wind and snow create a realistic fog effect; slight chromatic aberration at the edges of the image. **[PANEL 02]** - **Shot:** 02 | 50mm | Mid Shot | Shoulder-mounted tracking shot - **Action:** A man walks against a blizzard; the strong wind whips against his rain jacket, creating realistic physical wrinkles on the surface, but the overall silhouette remains sturdy. - **Detail:** Noticeable film grain; the snow-capped mountains in the background are slightly out of focus. **[PANEL 03]** - **Shot:** 03 | 100mm Macro | Extreme Close-up (ECU) | Fixed Macro - **Action:** Icy snowmelt hits the shoulders of the rain jacket. - **Detail:** The lotus effect is realistically rendered—water droplets condense and quickly roll off the matte micro-ripstop fabric without penetrating. **[PANEL 04]** - **Shot:** 04 | 85mm | Close-up of face (CU) | Slow motion - **Action:** The man stops and looks up. Real ice crystals cling to his eyelashes, and his breath dissipates at his collar. - **Detail:** Natural skin tone, without excessive blurring; realistic catchlight in his eyes reflects the snow wall ahead. **[PANEL 05]** - **Shot:** 05 | 35mm | Low Angle Full | Handheld, low-angle shot - **Action:** He swings his ice axe into the ice wall, climbing upwards. - **Detail:** Emphasis on showcasing the flexibility of the jacket during vigorous movement; no feeling of restriction; realistic light and shadow highlight the garment's three-dimensional cut. **[PANEL 06]** - **Shot:** 06 | 100mm Macro | Close-up Detail (Insert) | Shallow Depth of Field - **Action:** A heavily gloved hand pulls a waterproof zipper across the chest. - **Detail:** The matte waterproof rubberized finish of the zipper and the clearly visible scratches on the brushed metal zipper pull exude a strong sense of industrial design. **[PANEL 07]** - **Shot:** 07 | 50mm | Over-the-Shoulder Lens (OTS) | Slow Zoom In - **Action:** Over the man's shoulder, we see him finally reaching the summit, sunlight piercing through the clouds and shi

reddit@[unknown]5/22/2026

I read threads complaining about codex every week... tf are y'alls workflows?

For context: I'm a software eng @ a fortune 500/FAANG tier company. We use AI. We treat all ai code with humans as the bottleneck. That is: You generate AI code, you own it. It has bugs? It's your bug. Codex has only gotten better. 5.5 reasoning has only improved, albeit it thinks more. My question is: what the hell are y'all up to that I constantly hear things like codex broke and everything sucks? You need to review the code. YOU need to understand what codex outputs. AI is nondeterministic, so I don't know why people are creating agentic flows for deterministic work. Need determinism? Generate an audit the code man. What are people's workflows here that I constantly hear about degraded quality? Personally I just create plenty of skills and harnesses for information that it needs, I set off parallel tasks that are sandboxed from each other (E.g using a worktree, different folder, whatever your taste is), I review the code, I tweak it myself manually.. and that's it. At the end of the day, I've been a software engineer for 10 years, I understand anything codex generates is something I have to own and be able to debug eventually myself if the world suddenly gets rid of AI (which we know it won't, but it's the sentiment that should be held). I'm not coming from a place of reprimanding, truly I'm not, but I just don't see how it's gotten worse. I work on very high perf software and codex has helped a lot in saving me time on ASM analysis and algorithmic reasoning for things where throughput matters. submitted by /u/irelatetolevin [link] [comments]

reddit@[unknown]5/22/2026

Anthropic and OpenAI don't want better models, they want to sell more tokens

There is a saying in auto racing that describes the current state of AI providers: “Go as slow as you can to win”, that translates as “Spend as low as you can on R&D to stay slightly better than average”. Let’s put our tin foil hats on and look at it from the business perspective of an AI provider. Follow the money AI providers do not make money on training models but on selling inference. It means, from a business perspective, if OpenAI could keep selling GPT-3 forever, they would not spend money on training a better model but keep milking the cow they already have. But they couldn’t, because it was still “cheap” ($80–$100 million for GPT-4) to train a better model, and there was a risk someone else would. That fear of losing to the better model got us where we are. Makes sense. But let’s look at modern times. Training a model is not “cheap” anymore, it’s mega expensive (estimated to be $1.5–$2 billion for GPT-5). There is only a handful of companies who can afford such an affair. And a new model will not necessary better (so sell more inference). An expensive gamble. What it means for the business: Training a new model is mega expensive, raising money for that is getting harder Training a new model is not a revenue stream, selling inference is Having somewhat capable models that don’t one-shot prompts but need “prolonged thinking” (self-prompting) is actually better for the business of selling tokens than a great model that one-shots SCREW NEW MODELS, SELL MORE INFERENCE! Better model is not a goal anymore Is that what’s happening? Did Anthropic and OpenAI accept their niche and unspokenly (or spokenly, we don’t know) decide to “go as slow as they can” with creating new models, as they both are winning anyway? That would sound reasonable if the goal is to make money (which is why commercial companies are created). Let’s look back 6 months (eternity in the AI world) at Anthropic’s release history: Nov 2025 Opus 4.5 released. The last model that felt like an improvement compared to its predecessor. Feb 2026 Opus 4.6: no shockwave, some users reverted back to 4.5. Maybe got slightly better, but only because it was “thinking for longer” (e.g. burning more tokens without extra prompting). April 2026 Opus 4.7: same underwhelming release, the biggest improvement is that the model now thinks even longer and prompts the user less, e.g. burns even more of your tokens without you asking it. To sum up: last 6 month we seen no quality improvements, but better token burn without bothering the user. From the other side, they also squeeze developers into using Claude Code (their AI harness): End of 2025: forbade usage of Claude subscription in 3rd party harnesses (OpenCode, etc.) Start of 2026: blocked subscription usage of OpenClaw, Hermes and other agents From June 2026: programmatic usage of their Claude Code (for example in scripts) will be forbidden as well. They force you into their harness, where they do as much as they can to keep the tokens flowing. Cherry on top of the pie: Boris Cherny, the head of Claude Code, stated he sees the AI coding future in “agent loops” — an agent keeps prompting itself until the task is completed. Have you noticed the difference? The goal is not to “one-shot” the answer anymore (that needs improving models) but “a loop” that keeps going until the problem is solved. And that loop is a money-making machine for Anthropic, great for the business. That approach also makes money for the whole AI supply chain: AI providers making margin on selling tokens Data centers selling GPU hours NVIDIA selling GPUs What does that mean? Lots of tech companies financially benefit from somewhat intelligent models but not intelligent enough to one-shot all questions. And those models are already there. So it’s likely we won’t see massive model improvements in upcoming future. There is no point in it. Top LLMs are on a more or less the same level, competition is miles behind. Time to make money on inference, or go IPO. submitted by /u/kgoncharuk [link] [comments]

reddit@[unknown]5/22/2026

Managed Agents self-hosted sandboxes - what's new in CC 2.1.145 (+20,218 tokens)

NEW: Data: Managed Agents self-hosted sandboxes — Adds reference documentation for self_hosted Managed Agents environments, covering outbound worker polling, environment keys, SDK and CLI worker paths, webhook-driven wakeups, orchestration, monitoring, cloud-vs-self-hosted differences, credential handling, and customer-owned security responsibilities. NEW: Skill: Run app — Adds a general skill for launching and driving a project's actual runtime surface, first preferring project-specific run skills and otherwise choosing patterns for CLIs, servers, browser apps, Electron apps, TUIs, and libraries. NEW: Skill: Run skill generator — Adds guidance for creating project-specific run- skills, including verified setup/build/run steps, driver or smoke-harness creation, clean-environment verification, and examples for browser, CLI, Electron, library, TUI, and server/API projects. NEW: Skill: Run skill template — Adds a reusable template for project-specific run skills with sections for prerequisites, setup, build, agent and human run paths, tests, gotchas, and troubleshooting. NEW: Skill: Run browser-driven web app example — Adds an example run skill pattern for web apps that starts a dev server, waits on real readiness, drives it with chromium-cli, captures screenshots, and records recurring gotchas. NEW: Skill: Run CLI tool example — Adds an example run skill pattern for CLI tools covering installation, representative invocations, expected output, exit codes, and stdin behavior. NEW: Skill: Run Electron desktop GUI app example — Adds an example run skill pattern for Electron apps that launches under xvfb, exposes a Playwright-driven REPL, captures screenshots, and documents desktop automation pitfalls. NEW: Skill: Run library SDK example — Adds an example run skill pattern for libraries and SDKs focused on build/test steps plus a minimal public-boundary smoke example. NEW: Skill: Run TUI interactive terminal app example — Adds an example run skill pattern for terminal UIs using tmux to launch, send input, capture panes, document key commands, and clean up. NEW: Skill: Run web server API example — Adds an example run skill pattern for servers and APIs with background launch, readiness polling, smoke curl verification, and shutdown guidance. REMOVED: System Reminder: Plan mode is active (iterative) — Removes the iterative plan-mode reminder that told agents to maintain a plan file while repeatedly exploring, updating the plan, and asking the user questions before exiting plan mode. Agent Prompt: Managed Agents onboarding flow — Updates the introductory Managed Agents explanation to include self_hosted environments where the user's own worker runs tool execution, and distinguishes cloud environment networking/packages from self-hosted infrastructure. Agent Prompt: /review-pr slash command — Changes the PR detail command to request specific JSON fields from gh pr view, including title, body, author, refs, state, diff stats, changed file count, and labels. Agent Prompt: Status line setup — Adds repository identity and current-branch PR metadata to the status-line input schema, with examples for displaying owner/name and PR number/review state. Data: Anthropic CLI — Adds self-hosted environment CLI references for ant beta:worker poll/run and ant beta:environments:work stats/stop. Data: Claude Platform on AWS reference — Clarifies that Claude Platform on AWS has first-party API parity except for self-hosted sandboxes, which are unavailable there and should use cloud environments instead. Data: Live documentation sources — Adds Managed Agents self-hosted sandbox and self-hosted sandbox security documentation URLs to the live documentation source list. Data: Managed Agents core concepts — Documents sessions.update() for changing agent.tools, agent.mcp_servers, and vault_ids on an idle existing session as a session-local override. Data: Managed Agents endpoint reference — Adds self-hosted environment work queue endpoints and clarifies that session updates can replace tools, MCP servers, and vault IDs; also notes that self-hosted environment configs are just {"type":"self_hosted"}. Data: Managed Agents environments and resources — Replaces the old restricted-networking example with limited networking plus allow_package_managers and allow_mcp_servers, and adds self-hosted sandbox guidance for running tool execution in user-controlled infrastructure. Data: Managed Agents overview — Adds self-hosted sandboxes as a use case and updates environment guidance so config.type can be either cloud or self_hosted; also points to sessions.update() for per-session tool/MCP/vault changes. Data: Managed Agents reference — cURL — Updates the environment creation example to use limited networking with package-manager and MCP-server allowances. Data: Managed Agents tools and skills — Clarifies where prebuilt agent tools and MCP tools run for cloud vs. self-hosted environments, and adds notes about session-local tool/MCP/

Integrations

GitHubGitLabJiraSlackAWSAzureGoogle Cloud PlatformKubernetesDockerTerraformBackstagePrometheusDatadogNew RelicPagerDutySentryTwilioCircleCIBitbucketSonarQube

Categories

AI/MLDevOpsSecurityAnalyticsDeveloper Tools

Harness AI Alternatives

Compare similar ai-devops tools

All ai-devops Tools

Browse the full category

Frequently Asked Questions

Is Harness AI free?▼

Yes, Harness AI offers a free tier. The pricing model is subscription + freemium + per-seat + tiered.

What are the main features of Harness AI?▼

Key features include: Continuous Delivery GitOps, Continuous Integration, Internal Developer Portal, Infrastructure as Code Management, Database DevOps, Artifact Registry, AI Test Automation, Resilience Testing.

What is Harness AI used for?▼

Harness AI is commonly used for: Automate CI/CD pipelines for multi-cloud deployments, Accelerate developer onboarding with enterprise-grade IDP, Integrate database changes into deployment pipelines, Implement AI-powered predictive analytics for software releases, Modernize end-to-end testing with AI test authoring, Utilize feature flags for controlled software releases.

What does Harness AI integrate with?▼

Harness AI integrates with: GitHub, GitLab, Jira, Slack, AWS, Azure, Google Cloud Platform, Kubernetes, Docker, Terraform.

What are common complaints about Harness AI?▼

Based on user reviews and social mentions, the most common pain points are: token usage, budget exceeded, token cost, cost tracking.