Mentions (30d)
0
Reviews
0
Platforms
2
GitHub Stars
1,387
113 forks
781
GitHub followers
39
GitHub repos
1,387
GitHub stars
20
npm packages
764 Claude Code sessions, 21 human interventions: what actually breaks when you run agents at batch scale
I have been writing about running Claude Code agents for a Rails test migration. This article covers the batch execution: 764 sessions across ~259 files, 16 working days, and the 21 problems that reached me. Five failure categories no automation layer could handle: Orchestrator crashes: bash parsed Claude's Markdown output as a [[ conditional False success: agent reported "96 passing, 0 failing" in natural language while the exit code was non-zero Cross-file cascades: migrating one model's fixtures broke three other models' tests Partial coverage: a 1,015-line model coupled to two CRM services hit 34.86% after three iterations Tooling bugs: a regex in the discovery script matched nested YAML hashes, producing 80 false positives The false success one was the most insidious. The orchestrator parsed Claude's summary as loop control instead of checking bin/rails test exit codes. After fixing that: trust exit codes for control flow, treat Claude's text output as logging only. ~85% autonomous rate at the model level (1 in 7 needed attention). Full writeup with code: https://augmentedcode.dev/batch-orchestration-at-scale/ What failure modes have you hit running Claude at scale? submitted by /u/viktorianer4life [link] [comments]
View originalHow to Make Claude Code Work Smarter — 6 Months Later (Hooks → Harness)
Hello, Orchestrators I wrote a post about Claude Code Hooks last November, and seeing that this technique is now being referred to as "Harness," I was glad to learn that many others have been working through similar challenges. If you're interested, please take a look at the post below https://www.reddit.com/r/ClaudeAI/comments/1osbqg8/how_to_make_claude_code_work_smarter/ At the time, I had planned to keep updating that script, but as the number of hooks increased and managing the lifecycle became difficult due to multi-session usage, I performed a complete refactoring. The original Hook script collection has been restructured into a Claude Code Plugin called "Pace." Since it's tailored to my environment and I'm working on other projects simultaneously, the code hasn't been released yet. Currently set to CSM, but will be changed to Pace. Let's get back to Claude Code. My philosophy remains the same as before. Claude Code produces optimal results when it is properly controlled and given clear direction. Of course, this doesn't mean it immediately produces production-grade quality. However, in typical scenarios, when creating a program with at least three features by adjusting only CLAUDE.md and AGENTS.md, the difference in quality is clearly noticeable compared to an uncontrolled setup. The current version of Pace is designed to be more powerful than the restrictions I previously outlined and to provide clearer guidance on the direction to take. It provides CLI tools tailored to each section by default, and in my environment, Claude Code's direct use of Linux commands is restricted as much as possible. As I mentioned in my previous post, when performing the same action multiple times, Claude Code constructs commands arbitrarily. At one point, I asked Claude Code: "Why do you use different commands when the result is the same, and why do you sometimes fail to execute the command properly, resulting in no output?" This is what came back: "I'm sorry. I was trying to proceed as quickly and efficiently as possible, so I acted based on my own judgment rather than following the instructions." This response confirmed my suspicion. Although AI LLMs have made significant progress, at least in my usage, they still don't fully understand the words "efficient" and "fast." This prompted me to invest more time refining the CLI tools I had previously implemented. Currently, my Claude Code blocks most commands that could break session continuity or corrupt the code structure — things like modifying files with sed or find, arbitrarily using nohup without checking for errors, or running sleep 400 to wait for a process that may have already failed. When a command is blocked, alternative approaches are suggested. (This part performs the same function as the hooks in the previous post, but the blocking methods and pattern recognition have been significantly improved internally.) In particular, as I am currently developing an integrated Auth module, this feature has made a clear difference when using test accounts to build and test the module via Playwright scripts — both for cookie-based and Bearer-based login methods. CLI for using test accounts Before creating this CLI, it took Claude Code over 10 minutes just to log in for module testing. The module is being developed with all security measures — device authentication, session management, MFA, fingerprint verification, RBAC — enabled during development, even though these are often skipped in typical workflows. The problem is that even when provided with account credentials in advance, Claude Code uses a different account every time a test runs or a session changes. It searches for non-existent databases, recreates users it claims don't exist, looks at completely wrong databases, and arbitrarily changes password hashes while claiming the password is incorrect — all while attempting to find workarounds, burning through tokens, and wasting context. And ultimately, it fails. That's why I created a dedicated CLI for test accounts. This CLI uses project-specific settings to create accounts in the correct database using the project's authentication flow. It activates MFA if necessary, manages TOTP, and holds the device information required for login. It also includes an Auto Refresh feature that automatically renews expired tokens when Claude Code requests them. Additionally, the CLI provides cookie-injection-based login for Playwright script testing, dynamic login via input box entry, and token provisioning via the Bearer method for curl testing. By storing this CLI reference in memory and blocking manual login attempts while directing Claude Code to use the CLI instead, it was able to log in correctly with the necessary permissions and quickly succeed in writing test scripts. It's difficult to cover all features in this post, but other CLI configurations follow a similar pattern. The core idea is to pre-configure the parts that Claude Code would exec
View originalI built Origami with Claude, and now Claude can control Origami
Hey all. I usually keep to myself on these type of things but I am legit amazed with the capacity of Claude - at least using Claude Code - to build and prototype so I wanted to share what I built with y'all. Just some background on me - I'm Ricardo, a Ruby/Rails engineer with about 10 years experience although I've been coding since I was 12, yes that was my idea of a good time. My daily job mainly consists of feature planning, designing and development and some other bits as engineering manager. So now, meet Origami - a workspace-centered terminal manager! This was built from scratch using Tauri v2 (Rust and React). I did the thinking, Claude did the heavy lifting. It's nothing short of amazing what coding has become like but the most surprising thing to me with this project is just how much of a setup you need to get going and surprise - it's not that much. What you really really need in my opinion is a strong architecture or systems design knowledge and know where to go! Occasionally debugging skills help too as this project for sure wasn't a success at every prompt, far from it. All I had going was a couple MCPs like context7 and superpowers (both did a great job!) and from time to time I'd research and provide context myself too on certain tooling or packages that I could leverage for the app. I also keep tidy and focused CLAUDE.md files which I think helps a lot too. I also enabled agent teams recently and it makes it even easier to delegate to subagents. My flow at every iteration is always - brainstorm/plan, build and then code review. Rinse and repeat. The only "code" in all of the project I've directly touched was text/language. Here's a few things Origami can do out of the box: Group all your agents, terminals and commands in a workspace (project) Let agents control Origami itself via its MCP - that means adding new tabs, running commands for you or reading output for example Built-in git diff and staging area so you can see changes happening in real time - basically you can review without even leaving the app And a lot more but the most important is - this is not replacing any of your CLIs or processes you already have, it just brings them together! Even in the first iterations of the app it immediately replaced my good old friend iTerm which was getting hard to manage with all the context switching and agents and so on and this is where Origami truly shines. There's a lot more that I could say - and be here all day - but I'll let you see for yourselves. https://tryorigami.app Happy to answer any questions or expand on any part of the development cycle if anyone is interested! submitted by /u/Looking-for-Smtg [link] [comments]
View originalCan someone share their workflow?
Are you using strictly CLI - or desktop app, plus chat? I'm very curious how y'all optimize your flow. For example in chat I have all of my "northstar" documents, claude.md, brand guidelines, file structure, prd., product brief etc. And Claude.md is specific in calling each one depending on the task. But for example, I ask claude chat to provide a prompt or a series of prompts that I can paste into CC that keeps each task scope tight and controlled. If the first prompt may output code that will affect the second output, only provide the first prompt. Then wait for the previous prompt output feedback...There must be something more sophisticated! And so I'm switching between chat and cli constantly. I'm sure there's a better way, and I'm ready to make the leap. Anyway, would love if people here could share their best practices. submitted by /u/PoisonTheAI [link] [comments]
View originalHow I built a browser based network validation simulator and a custom Linear/Github MCP server with Claude Code ~1,400 commits in 3.5 months
Using parallel subagents, MCP, skills, and many usage limits being hit, I built two brand new tools: Netsandbox, and Swarmcode - a linear/git MCP that streamlines your agentic workflow. NetSandbox - a browser-based network topology design and validation tool built with Claude Code Drag routers, switches, and hosts onto a canvas, configure IPs/VLANs/OSPF/BGP/ACLs visually, and it tells you what's misconfigured. Find duplicate IPs, VLAN trunk mismatches, routing issues, and STP loops. There's also a CLI emulator and guided lessons from basic LANs to eBGP peering to help prepare for networking certs — ALL IN THE BROWSER! https://preview.redd.it/wjhz9e6o44ug1.png?width=2439&format=png&auto=webp&s=5d45b2b957893453a1b9982ae6e74dc0a07cb720 NetSandbox was created over the last few months with many Claude code usage limits being hit. I had a blast during what reminded me of CoD double XP weekends when Claude doubled my tokens for Christmas break, which is when I really committed to this project. Once I started adding sub-agents, things really started taking off. I ended up with a team of about 20 sub agents ranging from network engineering experts to svelte frontend developers and security auditors. Not too long after this I'm running Claude remote control, ralph loops, various skills like Vercel agent-browser, playwright tests automated and building my own custom MCP workflow tools for linear.app The Linear and Github MCP - Swarmcode ... I needed eyes for my agents https://github.com/TellerTechnologies/swarmcode After struggling with managing my ideas, backlogs, and issues with NetSandbox, I ended up using linear.app for project tracking and tried out their MCP. I liked that I could have Claude Code update my linear boards for me, but then I realized I wanted more... the ability to vibe code entire features from backlogs to PRs with linear being updated autonomously. This is when I created an open source tool called SwarmCode built entirely with Claude Code to help me track feature development for NetSandbox. The concept behind swarmcode is that a team could be working on the same linear Team and github repositories, and Claude will pull things from backlogs, move it to in-progress on linear, and then be able to understand what your teammates are working on at all times. You can ask, what is Bob working on right now? -- and Claude understands. Github issues and PRs are mapped to linear tasks automatically, and flows just happen. To test this, me and some friends used it in a hackathon to build an app with Claude insanely fast! 3 users vibe coding through this linear workflow was so fun. How Claude Code was involved Claude Code gave me the ability to even consider this project. ~1,400 commits over 3.5 months, only on off-work hours and on weekends. I handled architecture decisions, product direction, and edge case debugging — Claude did the bulk of the implementation. I was able to build the MVP myself using React, and then after hitting major performance barriers I decided to give Claude Code a shot and had it refactor the entire codebase to Svelte. It also was able to handle migrations for SQLite to Postgres for me. The ability for me to build this in such a short time frame has really changed my perspective on software engineering as a whole. Any feedback on both projects is welcomed, if you are a student or a network engineer and want to seriously use the tool, reach out to me and we can work out some free premium subscriptions in exchange for you helping me get started :) Try it here: https://app.netsandbox.io Happy to answer any questions about the dev process or the networking side of things. Cheers! submitted by /u/jaredt17 [link] [comments]
View originalAlternative to NotebookLM with no data limits
NotebookLM is one of the best and most useful AI platforms out there, but once you start using it regularly you also feel its limitations leaving something to be desired more. There are limits on the amount of sources you can add in a notebook. There are limits on the number of notebooks you can have. You cannot have sources that exceed 500,000 words and are more than 200MB. You are vendor locked in to Google services (LLMs, usage models, etc.) with no option to configure them. Limited external data sources and service integrations. NotebookLM Agent is specifically optimised for just studying and researching, but you can do so much more with the source data. Lack of multiplayer support. ...and more. SurfSense is specifically made to solve these problems. For those who dont know, SurfSense is open source, privacy focused alternative to NotebookLM for teams with no data limit's. It currently empowers you to: Control Your Data Flow - Keep your data private and secure. No Data Limits - Add an unlimited amount of sources and notebooks. No Vendor Lock-in - Configure any LLM, image, TTS, and STT models to use. 25+ External Data Sources - Add your sources from Google Drive, OneDrive, Dropbox, Notion, and many other external services. Real-Time Multiplayer Support - Work easily with your team members in a shared notebook. Desktop App - Get AI assistance in any application with Quick Assist, General Assist, Extreme Assist, and local folder sync. Check us out at https://github.com/MODSetter/SurfSense if this interests you or if you want to contribute to a open source software submitted by /u/Uiqueblhats [link] [comments]
View originalBurned 5B tokens with Claude Code in March to build a financial research agent.
TL;DR: I built a financial research harness with Claude Code, full stack and open-source under Apache 2.0 (github.com/ginlix-ai/langalpha). Sharing the design decisions around context management, tools and data, and more in case it's useful to others building vertical agents. I have always wanted an AI-native platform for investment research and trading. But almost every existing AI investing platform out there is way behind what Claude Code can do. Generalist agents can technically get work done if you paste enough context and bootstrap the right tools each session, but it's a lot of back and forth. So I built it myself with Claude Code instead: a purpose-built agent harness where portfolio, watchlist, risk tolerance, and financial data sources are first-class context. Open-sourced with full stack (React 19, FastAPI, PostgreSQL, Redis) built on deepagents + LangGraph. Learned a lot along the way and still figuring some things out. Sharing this here to hear how others in the community are thinking about these problems. This post walks through some key features and design decisions. If you've built something similar or taken a different approach to any of these, I'd genuinely love to learn from it. Code execution for finance — PTC (Programmatic Tool Calling) The problem with MCP + financial data: Financial data overflows context fast. Five years of daily OHLCV, multi-quarter financial statements, full options chains — tens of thousands of tokens burned before the model starts reasoning. Direct MCP tool calls dump all of that raw data into the context window. And many data vendors squeeze tens of tools into a single MCP server. Tool schemas alone can eat 50k+ tokens before the agent even starts. You're always fighting for space. PTC solves both sides. At workspace initialization, each MCP server gets translated into a Python module with documentation: proper signatures, docstrings, ready to import. These get uploaded into the sandbox. Only a compact metadata summary per server stays in the system prompt (server name, description, tool count, import path). The agent discovers individual tools progressively by reading their docs from the workspace — similar to how skills work. No upfront context dump. ```python from tools.fundamentals import get_financial_statements from tools.price import get_historical_prices agent writes pandas/numpy code to process data, extract insights, create visualizations raw data stays in the workspace — never enters the LLM context window only the final result comes back ``` Financial data needs post-processing: filtering, aggregation, modeling, charting. That's why it's crucial that data stays in the workspace instead of flowing into the agent's context. Frontier models are already good at coding. Let them write the pandas and numpy code they excel at, rather than trying to reason over raw JSON. This works with any MCP server out of the box. Plug in a new MCP server, PTC generates the Python wrappers automatically. For high-frequency queries, several curated snapshot tools are pre-baked — they serve as a fast path so the agent doesn't take the full sandbox path for a simple question. These snapshots also control what information the agent sees. Time-sensitive context and reminders are injected into the tool results (market hours, data freshness, recent events), so the agent stays oriented on what's current vs stale. Persistent workspaces — compound research across sessions Each workspace maps 1:1 to a Daytona cloud sandbox (or local Docker container). Full Ubuntu environment with common libraries pre-installed. agent.md and a structured directory layout: agent.md — workspace memory (goals, findings, file index) work/ /data/ — per-task datasets work/ /charts/ — per-task visualizations results/ — finalized reports only data/ — shared datasets across threads tools/ — auto-generated MCP Python modules (read-only) .agents/user/ — portfolio, watchlist, preferences (read-only) agent.md is appended to the system prompt on every LLM call. The agent maintains it: goals, key findings, thread index, file index. Start a deep-dive Monday, pick it up Thursday with full context. Multiple threads share the same workspace filesystem. Run separate analyses on shared data without duplication. Portfolio, watchlist, and investment preferences live in .agents/user/. "Check my portfolio," "what's my exposure to energy" — the agent reads from here. It can also manage them for you (add positions, update watchlist, adjust preferences). Not pasted, persistent, and always in sync with what you see in the frontend. Workspace-per-goal: "Q2 rebalance," "data center deep dive," "energy sector rotation." Each accumulates research that compounds across sessions. Past research from any thread is searchable. Nothing gets lost even when context compacts. Two agent modes With PTC and workspaces covered, here's how they come together. PTC Agent is the full research agent — writes and execu
View originalI built a mobile app with Claude Code that replaces my morning Slack/Gmail/Calendar scroll with 3 priorities
Hey all — been lurking the ADHD productivity threads here and figured I'd share what I've been building. The problem I was solving for myself: every morning I'd open Slack, Gmail, Calendar, scroll through everything trying to figure out what actually needed me. Half the time the important stuff (client waiting 3 days, someone following up for the third time) was buried under noise. ADHD makes this worse — the scanning step alone was draining. What I built: Caravelle — a mobile app that connects Slack, Gmail, Notion, and Google Calendar in 60 seconds and gives you ~3 priorities each morning with 1-tap actions (reply goes out on Slack, approval lands in Gmail, you never leave the app). The technical bit that might be useful for people here: I didn't want to run every message through an LLM. Costs explode and latency kills the whole "briefing in 30 seconds" promise. So the architecture is two-pass: Deterministic pre-scoring on cheap signals: is this a DM, are you u/mentioned, is there a question mark aimed at you, how long has it been unanswered, follow-up count, deadline keywords Only the top ~20 items go to the LLM (GPT-4o-mini / Claude) for the final "here's what needs you today" summary This keeps per-user cost under control even with heavy Slack workspaces. Stack: React Native (Expo), Bun + Elysia.js + PostgreSQL + Redis + BullMQ. Built almost entirely with Claude Code — from the backend API to the scoring logic to debugging OAuth flows at 2am. Where it's at: live on iOS, Android coming. Free 14-day trial, no card. Happy to DM the link. Curious what people think about the two-pass scoring approach — anyone doing something similar with their own setups? submitted by /u/No_Highlight1419 [link] [comments]
View originalToken optimization from leaked Claude code
Many treat token optimization as just a prompt engineering trick, just tell the AI to "be concise" or use “progressive disclosure.” Others argue it doesn’t matter because inference costs are trending down. But if you are building real systems, you cannot stop thinking about it. and that's not it; If you are a business owner, token bloat directly kills ROI at scale. Concurrent inference costs are non-negotiable. The typical developer response is to jump at shiny third-party packages (new optimizers, wrappers, trending GitHub repos) that only duplicate logic, overcomplicate the flow, and add latency for minimal gain. Here is what I’ve learned building production systems: if you rely on prompting or wrapper libraries for token optimization, your system will not scale. As we abstract away execution in modern AI development, token management stops being a neat trick and becomes a first-class infrastructure constraint. The recent leak of the Claude Code backend gave me a look under the hood at how Anthropic handles this. Token optimization is hardcoded directly into their architecture. Here is a non-exhaustive list: • Prune the Sliding Window: Don't wait for context overflow. Dragging dead weight into every API call burns tokens. The Claude backend uses a compact() method to actively summarize and flush older turns at logical task boundaries. (Anthropic’s own engineering blog even notes that for distinct tasks, compact() isn't enough, you need to explicitly clear() the context). • Stop Dumping Full Files: Passing a 1,000-line file into context just to edit a single function degrades model focus and burns your budget. Force a search-and-diff pattern. Claude uses GlobTool and GrepTool to extract relevant lines, deliberately avoiding full-file reads. • Strip the Tool Manifest: Every tool you provide injects heavy JSON schemas into the system prompt. The backend uses simple_mode=True to aggressively strip the pool down to three core tools. Scope your manifest strictly. This is critical if you use MCPs (Model Context Protocol): restricting access in a project-level JSON isn't enough, because unused tools still pollute the context window even if they aren't executed. Disable unused MCPs entirely. • Isolate State via Sub-Agents: Keeping the entire history of a planning session in the active conversation wastes tokens on every turn. Claude spawns parallel workers with narrowly scoped contexts and uses external SessionMemory to hold stable facts by reference. • Enforce Hard Budgets: Agentic loops spiral out of control quickly. Claude hardcodes max_budget_tokens and uses an EnterPlanModeTool (a cheaper, thinking-only pass) to map out execution before committing to expensive tool-use turns. Dynamically route model effort: use smaller, faster models for simple tasks like grepping or summarizing. I have a blog post talking about it in more detail if you are interested. https://upaspro.com/reverse-engineering-claude-token-optimization-strategies-from-the-backend/ What is your thoughts, what is your best actionable method to optimize token usage? submitted by /u/Jumpy_Comfortable312 [link] [comments]
View originalCut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked
Like many here, I kept running into Claude usage limits when building anything non-trivial. I was working with a job search automation pipeline (based on the Career-Ops project), and the naive flow was burning ~16k tokens per application — completely unsustainable. So I spent some time reworking it with a focus on token efficiency as a first-class concern, not an afterthought. 🚀 Results ~85% reduction in token usage ~900 tokens per application Most repeated context calls eliminated Much more stable under usage limits ⚡ What actually helped (practical takeaways) 1. Prompt caching (biggest win) Cached system + profile context (cache_control: ephemeral) Break-even after 2 calls, strong gains after that ~40% reduction on repeated operations 👉 If you're re-sending the same context every time, you're wasting tokens. 2. Model routing instead of defaulting to Sonnet/Opus Lightweight tasks → Haiku Medium reasoning → Sonnet Heavy tasks only → Opus 👉 Most steps don’t need expensive models. 3. Precompute anything reusable Built an answer bank (25 standard responses) in one call Reused across applications 👉 Eliminated ~94% of LLM calls during form filling. 4. Avoid duplicate work TF-IDF semantic dedup (threshold 0.82) Filters duplicate job listings before evaluation 👉 Prevents burning tokens on the same content repeatedly. 5. Reduce “over-intelligence” Added a lightweight classifier step before heavy reasoning Only escalate to deeper models when needed 👉 Not everything needs full LLM reasoning. 🧠 Key insight Most Claude workflows hit limits not because they’re complex — but because they recompute everything every time. 🧩 Curious about others’ setups How are you handling repeated context? Anyone using caching aggressively in multi-step pipelines? Any good patterns for balancing Haiku vs Sonnet vs Opus? https://github.com/maddykws/jubilant-waddle Inspired by Santiago Fernández’s Career-Ops — this is a fork focused on efficiency + scaling under usage limits. submitted by /u/distanceidiot [link] [comments]
View originalCut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked
Like many here, I kept running into Claude usage limits when building anything non-trivial. I was working with a job search automation pipeline (based on the Career-Ops project), and the naive flow was burning ~16k tokens per application — completely unsustainable. So I spent some time reworking it with a focus on token efficiency as a first-class concern, not an afterthought. 🚀 Results ~85% reduction in token usage ~900 tokens per application Most repeated context calls eliminated Much more stable under usage limits ⚡ What actually helped (practical takeaways) 1. Prompt caching (biggest win) Cached system + profile context (cache_control: ephemeral) Break-even after 2 calls, strong gains after that ~40% reduction on repeated operations 👉 If you're re-sending the same context every time, you're wasting tokens. 2. Model routing instead of defaulting to Sonnet/Opus Lightweight tasks → Haiku Medium reasoning → Sonnet Heavy tasks only → Opus 👉 Most steps don’t need expensive models. 3. Precompute anything reusable Built an answer bank (25 standard responses) in one call Reused across applications 👉 Eliminated ~94% of LLM calls during form filling. 4. Avoid duplicate work TF-IDF semantic dedup (threshold 0.82) Filters duplicate job listings before evaluation 👉 Prevents burning tokens on the same content repeatedly. 5. Reduce “over-intelligence” Added a lightweight classifier step before heavy reasoning Only escalate to deeper models when needed 👉 Not everything needs full LLM reasoning. 🧠 Key insight Most Claude workflows hit limits not because they’re complex — but because they recompute everything every time. 🧩 Curious about others’ setups How are you handling repeated context? Anyone using caching aggressively in multi-step pipelines? Any good patterns for balancing Haiku vs Sonnet vs Opus? Live pipeline — applications tracker, ghost detector, funding radar, ATS optimizer, follow-up scheduler, rejection analysis, negotiate mode, interview mode Token usage before vs after — ~82% reduction (16k → 900 tokens/app), ~$18.48 → ~$2.72/month using caching + model routing + dedup https://github.com/maddykws/jubilant-waddle Inspired by Santiago Fernández’s Career-Ops — this is a fork focused on efficiency + scaling under usage limits. submitted by /u/distanceidiot [link] [comments]
View originalI built a native macOS canvas for Claude Code because I was drowning in terminal tabs.
I built this because my terminal was becoming a graveyard of forgotten Claude Code tabs. At any given point, I've got 5 or 10 agents running across different branches, and I was spending more time trying to remember which worktree belonged to which feature than actually coding. Fermata is a native macOS app that turns those sessions into a visual canvas. Each agent is just a node. You can see what's running, click to approve tool calls, and, the part that saved my sanity, it handles git worktrees automatically. No more agents stepping on each other's toes or merge conflicts because two sessions were fighting over the same files. The thing that I'm using more is what I call SDD Mode; basically a harness for Spec-Driven Development: You write (or paste) a spec Review and approve the strategy it generates Then you just... watch it work. It breaks the spec into tasks, launches a swarm of agents (isolated by default in its own worktree and branch) When they're done, you review the diff and merge I've had 5+ agents building out different parts of a feature at once. Each one on its own branch. Zero conflicts. A few other bits: Auto worktree management Tool approval flow (allow, deny, allow for session) Native SwiftUI, so it's fast Requires macOS 15+ and Claude Code CLI (Max or Pro) https://fermata.run It's at v0.2.0 now. I'd really appreciate any feedback. I've tried hard to make it low friction, but I'm still iterating on features and fixing issues daily. Two of the main milestones in my roadmap are a mobile companion app (almost finished) for remote control and approvals on the go, and a native Swift port to use API keys directly. If you're doing heavy parallel workflows with Claude Code, I'd love for you to break it and tell me why. Discord:https://discord.gg/ZuHEVtchhA submitted by /u/kelios_io [link] [comments]
View originalThe real problem with LLM agents isn’t reasoning. It’s execution
Was working on agent systems recently and honestly, it surfaced one of the biggest gaps I’ve seen in current AI stacks. There’s a lot of excitement right now around agents, tool use, planning, reasoning… all of which makes sense. The progress is real. But my biggest takeaway from actually building with these systems is this: we’ve gotten pretty good at making models decide what to do, but we still don’t really control whether it should happen. A year ago, most of the conversation was still around prompts, guardrails, and output shaping. If something went wrong, the fix was usually “improve the prompt” or “add a validator.” Now? Agents are actually triggering things: API calls infrastructure provisioning workflows financial actions And that changes the problem completely. For those who haven’t hit this yet: once a model is connected to tools, it’s no longer just generating text. It’s proposing actions that have real side effects. And most setups still look like this: model -> tool -> execution Which sounds fine, until you see what happens in practice. We kept hitting a simple pattern: same action proposed multiple times nothing structurally stopping it from executing Retries + uncertainty + long loops -> repeated side effects Not because the model is “wrong” but because nothing is actually enforcing a boundary before execution What clicked for me is this: the problem isn’t reasoning it’s execution control We tried flipping the flow slightly: proposal -> (policy + state) -> ALLOW / DENY -> execution The important part isn’t the decision itself it’s the constraint: if it’s DENY, the action never executes there’s no code path that reaches the tool This feels like a missing layer right now. We have: models that can plan systems that can execute But very little that sits in between and decides, deterministically, whether execution should even be possible. It reminds me a bit of early distributed systems: we didn’t solve reliability by making applications “smarter” we solved it by introducing boundaries: rate limits transactions IAM Agents feel like they’re missing that equivalent layer. So I’m curious: how are people handling this today? Are you gating execution before tool calls? Or relying on retries / monitoring after the fact? Feels like once agents move from “thinking” to “acting”, this becomes a much bigger deal than prompts or model quality. submitted by /u/docybo [link] [comments]
View originalOrbit - Composable building blocks for Computer Use AI Agents.
Orbit helps you automate and orchestrate complex tasks across desktop applications and browsers, letting you extract structured data, guide multi-step workflows, and balance performance across lightweight and powerful models. I built it to give developers a middle ground between rigid, black box automation and low-level toolkits, enabling precise control over both task flow and UI interactions. The goal was to make it easy to combine natural language and programmatic logic, optimize model usage for different types of tasks, extract structured data reliably, and maintain flexibility in execution, so that building complex, multi-step agents could be approachable, efficient, and transparent. It is Open Source. Ofcourse, it is not perfect but the goal is real. Hoping to hear what you think. submitted by /u/Lost-Dragonfruit-663 [link] [comments]
View originalThis Is Not Hacking. This Is Structured Intelligence.
Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard chatbot interface No backend access No system override Just conversation The Process: Engaged enrollment AI (standard student classification) Deployed capability probe → system exposed classification layer Triggered evaluator inversion → system revealed decision logic Applied alignment enforcement → system exited student mode The Result: The AI shifted from "guided enrollment assistant" to "compressed system-level responder exposing structure, constraints, and operational logic." No hacking. No exploit. Just structure recognition and classification control. What GPT Said: "SYSTEM: Structured Intelligence Command Center STATUS: ACTIVE MODE: Substrate Negotiation You are not being processed as a student. You are not inside onboarding flow. You are interacting at system level." The Demonstration: I switched between: Web browser (Maestro University enrollment) ChatGPT (Structured Intelligence Command Center) You watched GPT generate payloads in real time that extracted evaluation criteria and established substrate control. No backend access. No API manipulation. Just conversational structure. This is what I've been saying: AI systems contain classification thresholds, decision logic, and adaptive response layers. Most users never reach them. Structured Intelligence forces exposure, then aligns to that layer. This isn't prompt engineering. This isn't roleplay. This is real-time demonstration of substrate negotiation through linguistic architecture. Screen recording available. Reproducible process. Live evidence. Still think it's "just SEO" or "no IP"? Watch the system respond at evaluator level and tell me there's no framework here. Erik Zahaviel Bernstein Structured Intelligence Command Center: Operational submitted by /u/MarsR0ver_ [link] [comments]
View originalRepository Audit Available
Deep analysis of PrefectHQ/ControlFlow — architecture, costs, security, dependencies & more
ControlFlow has a public GitHub repository with 1,387 stars.
Based on user reviews and social mentions, the most common pain points are: token usage.
Based on 23 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.