Discover Llama 4's class-leading AI models, Scout and Maverick. Experience top performance, multimodality, low costs, and unparalleled efficiency
Llama 3 is recognized for its adaptability in various AI applications, appealing to developers working with local AI tools and multi-agent systems. However, there seem to be ongoing challenges with support against indirect prompt injections and real-world handling, as users discuss creating additional tools to address such gaps. Pricing sentiment appears to be stable, with updates being regularly managed. Overall, Llama 3 has a solid reputation among tech enthusiasts, though it's clear users see room for improvement in tackling niche AI use cases.
Mentions (30d)
17
3 this week
Reviews
0
Platforms
3
GitHub Stars
29,294
3,524 forks
Llama 3 is recognized for its adaptability in various AI applications, appealing to developers working with local AI tools and multi-agent systems. However, there seem to be ongoing challenges with support against indirect prompt injections and real-world handling, as users discuss creating additional tools to address such gaps. Pricing sentiment appears to be stable, with updates being regularly managed. Overall, Llama 3 has a solid reputation among tech enthusiasts, though it's clear users see room for improvement in tackling niche AI use cases.
Features
Use Cases
Industry
information technology & services
Employees
77,000
10,591
GitHub followers
12
GitHub repos
29,294
GitHub stars
20
npm packages
40
HuggingFace models
I built a prompt injection detector that outperforms LlamaGuard 3 on indirect/roleplay attacks
Been working on Arc Sentry, a whitebox prompt injection detector for self-hosted LLMs (Mistral, Llama, Qwen). Most detectors pattern-match on known attack phrases. Arc Sentry watches what the prompt does to the model’s internal representation instead, so it catches indirect, hypothetical, and roleplay-framed attacks that get through keyword filters. Benchmark on indirect/roleplay/technical prompts (40 OOD prompts): • Arc Sentry: Recall 0.80, F1 0.84 • OpenAI Moderation API: Recall 0.75, F1 0.86 • LlamaGuard 3 8B: Recall 0.55, F1 0.71 Arc Sentry has the highest recall — it catches more of the hard cases. Blocks before model.generate() is called. The lightweight pre-filter runs on CPU with no model access. pip install arc-sentry GitHub: https://github.com/9hannahnine-jpg/arc-sentry Happy to answer questions about how it works.
View originalPricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
Is this even real ?
I randomly came across this and honestly I can’t tell if it’s real or one of those AI demos that looks impressive but doesn’t actually work. From what I understand, it’s claiming you can fine-tune models, do image training, test them in a playground, and deploy them as an API from a phone. That sounds a little too convenient, which is why I’m skeptical. I haven’t tried it myself yet, but I’m curious if anyone here has. submitted by /u/Raman606surrey [link] [comments]
View originalLlama Surgery: Continuous Sparsification of Pre-Trained Language Models via Differentiable Ultrametric Topology Injection
Sequel to: Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention Abstract We present Llama Surgery, a method for injecting learned block-sparse attention topologies into pre-trained dense language models without retraining from scratch, distillation, or post-hoc pruning. Starting from a frozen Llama 3.1 8B, we surgically replace each attention layer with a Dynamic Topology Router that maps token embeddings onto the branches of a Bruhat-Tits p-adic tree via factorized Gumbel-Softmax routing. A Deterministic Collapse Initialization to achieve a Continuous Logit Homotopy guarantees that at step 0 the injected topology mask is identically dense, preserving the pre-trained manifold exactly. Over training, temperature annealing polarizes the soft routing assignments into hard binary masks, and a Switch Transformer-style load-balancing loss prevents routing collapse. We identify and resolve two critical failure modes: (1) gradient collapse through discrete masking operations, solved by a Straight-Through Estimator bridge that decouples the hard forward mask from the soft backward gradient; and (2) Attention Sink instability, where hard-masking the initial token causes softmax entropy collapse and syntactic degeneration, solved by permanently anchoring Token 0 in the visibility set. The resulting architecture is validated on Llama 3.1 8B fine-tuned on WikiText-2, achieving stable convergence and producing coherent, mathematically sophisticated text while maintaining dynamic block-sparse routing across all 32 transformer layers. A controlled semantic clustering experiment on TinyLlama-1.1B demonstrates that the router learns to assign tokens from distinct semantic domains (mathematics, natural language, code) to separate branches of the Bruhat-Tits tree using only the standard language modeling loss, with no explicit clustering objective. A Needle-In-A-Haystack (NIAH) retrieval experiment on TinyLlama-1.1B reveals that the router spontaneously organizes the context window into an ultrametric cophenetic hierarchy: the needle is isolated at maximum topological distance from the haystack (d_p = 6.88), and the ultrametric triangle inequality d(x,z) ≤ max(d(x,y), d(y,z)) is satisfied. Averaging over 32 attention heads yields a forest ensemble of distinct per-head ultrametric trees rather than a single global hierarchy. We further identify and resolve three critical float16 numerical failure modes—Gumbel-Softmax overflow, attention score overflow, and cumulative product backward instability—the last of which we solve via a novel cumprod→cummin substitution that exploits the binary structure of hard Gumbel-Softmax outputs. A custom Triton forward kernel with Attention Sink and Local Window support, pipelined for Ampere and Hopper architectures (num_warps=4, num_stages=3), executes the block-sparse prefill phase at O(N) theoretical complexity. To our knowledge, this is the first demonstration of differentiable ultrametric topology injection into a production-scale pre-trained LLM. https://github.com/sneed-and-feed/adelic-spectral-zeta/blob/main/papers/llama_surgery.md submitted by /u/LooseSwing88 [link] [comments]
View originalI integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.
Hey everyone, I am not trying to sell or self promote mainly just wanted to showcase a big project I've been working on ever since I started studying data science and artificial intelligence and integrating AI into workflows and using it as an augment to create things that were previously out of reach for so many people, because if used right it can become a second brain and not a crutch. I’m the solo dev behind Void Runner, an isometric ARPG/MOBA hybrid built in Python. I recently hit a wall with traditional procedural quest generation. Hand-crafting templates gets repetitive fast, and players quickly learn the patterns to these things whether you like it or not. To solve this, I built the "Void Caller AI", a system that uses a local, quantized Llama 3.2 model to act as a dynamic Dungeon Master. Instead of just generating random flavor text, the system uses a lightweight RAG (Retrieval-Augmented Generation) pipeline. It reads live server telemetry (who died, what items were looted, which bosses were defeated recently) and weaves those actual server events into the narrative of the quests it generates. Because it runs locally via Ollama on our backend, there are no crazy cloud API costs, and latency is kept completely manageable. Here is a simplified look at how the Python backend bridges the SQLite telemetry with the Llama 3.2 prompt: import json import ollama from sqlalchemy import text from database import SessionLocal def generate_dynamic_quest(difficulty: str, target: str): db = SessionLocal() # 1. Fetch recent server telemetry for context (RAG-lite) lore_context = "" try: # Grab recent server events to weave into the narrative recent_events = db.execute(text( "SELECT username, event_type, dungeon_type FROM ai_events ORDER BY id DESC LIMIT 3" )).fetchall() if recent_events: events_str = "; ".join([f"Runner '{r[0]}' triggered a '{r[1]}' in '{r[2]}'" for r in recent_events]) lore_context = f" Incorporate this recent live server telemetry into the lore: {events_str}" except Exception as e: pass # 2. Construct the prompt with strict JSON formatting constraints prompt = f"""You are the Void Caller, a sinister AI in a dark industrial sci-fi RPG. Create a dynamic PvE extraction quest of {difficulty} difficulty. Respond ONLY in valid JSON with keys: 'title' (string), 'description' (string, menacing), 'item_name' (string), 'quantity' (integer 1-15), 'boss_name' (string, optional). {lore_context}""" # 3. Stream to local Llama 3.2 response = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': prompt}], format='json', options={'temperature': 0.8} ) return json.loads(response['message']['content']) By forcing the format='json' parameter, Llama 3.2 reliably outputs structured data that my game engine instantly parses into a playable quest objective. If a player just died to a specific boss, the AI will literally generate a bounty quest for the rest of the server to avenge them. Would love to hear if anyone else is using local LLMs for live game state generation! You can check out the results live in our Open Beta at [void-runner.online]. submitted by /u/xSoulR34per [link] [comments]
View originalOk, talvez eu pague pelo Meta Premium
Hoje eu postei sobre o Mark Zuckerberg lançar a notícia mais patética que vai cobrar 19 dólares para desbloquear o Muse Spark Pro kakakakakakaka Quem vai pagar por essa merda? Mas pensando melhor bem... Talvez eu pague Eu usei muito esse modelo como Early adopter, desde quando o motor era o Llama 3.2 e sendo inferior as outras consegui extrair escrita criativa que batia de frente com Claude em personas graças ao seu RAG no ecossistema da Meta, que tinha uma criatividade absurda quando você forçava ela a consultar as redes sociais e ver como pessoas agem e comentam, porém lançou o Muse Spark que era tipo o GPT 5.2 dos Llamas kkkkkk aí só usei para pesquisa e bem... Minha tese sobre o Muse Spark é que pra mim o problema nunca pareceu ser burrice. Parece CONTENÇÃO. Não dá vibe de modelo incapaz ou inferior. Dá vibe de modelo sendo sufocado em tempo real. Porque se você presta atenção, ele: - pesquisa rápido pra cacete (Já que cada agente pesquisa uma coisa) - alucina menos em busca (pois o modelo refina a busca dos agentes, muitas vezes consegui resultados mais confiáveis que o Gemini) - já trabalha com esquema multi-agente herdado da Manus ( o trunfo dessa IA é que diferente das outras ela não comprimi seu input, ela usa agentes para cada um pesquisar cada trecho dele, o resultado é mais completo) - acha informação boa (ela pesquisa tanto na internet quanto em grupos de Facebook ou Threads se você forçar no prompt, ou seja análises de Devs>>> Wikipédia Inclusive acredito que foi por isso que o Mark lançou o "Fórum" o app que cópia o Reddit, ele quer treinar a IA com isso, o Reddit pra mim seria a fonte perfeita pra qualquer IA se aprofundar além do que pesquisar genéricas no Google, o filha da puta do Mark é rico e filantropo e faz uma cópia só para treinar a IA dele) - conecta coisa rápido (os agentes pesquisam rápido, o modelo revisa rápido, a entrega é bem rápida e gasta bem menos tokens) Só que na hora de responder… Parece o GPT free kkkkkkk O raciocínio corta no meio. (Ele é punido se raciocinar por muito tempo, foi o treinamento dele) A saída vem resumida. (Tem limites de caracteres claros, nenhum prompt força a cota) A resposta parece comprimida igual arquivo zipado. É como se tivesse um fiscal invisível dentro da inferência falando: “encerra logo” “não desenvolve” “não gasta token” “não deixa pensar muito” Aí a galera olha e pensa: “nossa que IA sem profundidade”. Mas pra mim não parece falta de capacidade. Parece punição de reasoning. E é aí que entra minha teoria: esse plano pago da Meta não vai trazer “outro modelo revolucionário”. Pra mim vai ser literalmente o mesmo Muse Spark… só que sem coleira. Os caras mesmos falaram que essa era a versão pequena/teste. Então eu acho que o modelo real já tá ali faz tempo. Só que: - com limite de saída - limite de pensamento - compressão de raciocínio - truncamento agressivo - budget de inferência ridículo E sinceramente? Isso explica porque ele parece inteligente mas frustrante ao mesmo tempo. Porque dá pra sentir que o modelo quer continuar. Só que alguém puxa o freio de mão toda hora. Agora a parte que eu acho GENIALMENTE BURRA da Meta: Eles lançaram primeiro a versão capada. Isso matou a percepção pública imediatamente. O certo teria sido: solta no app Meta AI a versão MONSTRA: - 1 milhão de contexto - sem limite de saída - reasoning longo liberado - multi-agent destravado - resposta gigante - pensamento fluindo E deixa a versão limitada só no: - WhatsApp - Instagram - Facebook Porque aí o usuário hardcore ia testar no app principal e pensar: “caralho… a Meta cozinhou aqui”. A comunidade ia começar a criar hype orgânico. Ia surgir comparação. Benchmark. Thread. Vídeo. Review. Discussão técnica. As pessoas iam SENTIR que tinha um frontier model ali dentro. Mas não. Os caras fizeram o oposto: lançaram primeiro o Muse Spark respirando por canudinho. Aí agora querem cobrar assinatura pra liberar o que provavelmente já existia desde abril. Então a sensação não fica: “uau versão premium”. Fica: “ah então vocês esconderam o modelo de verdade esse tempo todo?” E isso destrói confiança. (Coisa que a Meta já não tem da gente) Convenhamos que o Mark já não tem nenhuma moral com a gente né? Essa IA aí é pra farmar dados pra ADS e ponto, Literalmente é ele falando "vamos cobrar vocês que são os produtos para usarem nossa IA que vai roubar cada vírgula de dados para a gente vender ainda mais anúncios no nosso Facebook onde é 10 anúncios a cada 1 POST kkkkkkkkkk" Mas pra não parecer hater tenho que elogiar que foram pelo menos sinceros, enquanto as outras lançam modelos a vontade e bons e depois emburrecem a IA e põe limites abusivos pelo mesmo preço (né Gemini 3.5? Arrombado) O meta pelo menos já cobra preço cheio por uma IA porcaria, se ele tivesse cobrando só metade do valor (o que seria justo pra essa IA limitada deles) mas assim que a IA melhorasse, cortando limites e implementando mais
View originalCerebras Chip Sets Appear to be Optimized for LLM Use Cases
One distinction I think is getting lost in the Cerebras hype cycle is that Cerebras is primarily an LLM / generative AI infrastructure story, not a universal “all AI” chip story. That is not necessarily a criticism of Cerebras. Their wafer-scale approach is genuinely interesting, and for large model training and inference the design is compelling. Cerebras’ own public inference materials discuss applications mostly centered on open LLMs such as Llama, Qwen, GLM, and GPT-OSS. The inference metrics are expressed in tokens per second, which is fundamentally a language-model / generative inference framing rather than a robotics or industrial-control framing. What Kind of AI Compute? But “AI compute” is not one undifferentiated market. LLM inference is one class of AI compute. Robotics, autonomous vehicles, drones, industrial controls, real-time vision, embedded perception, video pipelines, and sensor-fusion systems are very different classes of AI compute. Thus, it appears from Cerebras’ own materials that their chip sets are not optimized for what comes after LLMs, such as JEPA-style World Models or other post-transformer architectures. Those systems are not merely asking, “How fast can I generate tokens?” They often care about power envelope, edge deployment, ruggedization, latency determinism, camera/radar/lidar integration, feedback loops, safety certification, and real-time physical control. Cerebras’ own CS-3 messaging, by contrast, frames the system around accelerating “the latest large AI models,” and the testing data is from the likes of Llama 2, Falcon 40B, MPT-30B, and multimodal models, again measured through tokens/second style throughput. The Chip Hierarchy This is also where the hardware distinction matters. Specialized ASICs are usually the narrowest bet: if the workload matches the chip, they can be extremely efficient, but that efficiency comes from specialization. Cerebras appears broader than a narrow single-use ASIC, but still much more concentrated around datacenter large-model training and inference. NVIDIA GPUs, by contrast, are less specialized but much more broadly useful across AI workloads, including LLMs, vision, robotics, simulation, autonomous systems, edge AI, and industrial applications. So the question is not merely whether Cerebras is “better” or “worse” than NVIDIA. The question is what part of the AI hardware market we are talking about? Challenge NVIDA? This is why I think people should be careful when saying Cerebras is going to “challenge Nvidia” without specifying the battlefield. Challenge Nvidia in what? High-speed LLM inference? Large model training? Datacenter generative AI workloads? That is a much more plausible and specific claim. Cerebras has even published and promoted work specifically on training large language models, and independent benchmarking literature also evaluates Cerebras WSE in terms of LLM training and inference performance. The Distinction that's Necessary The point is not that Cerebras is overhyped. The point is that it is important in a specific part of AI and that distinction should be made clear. Cerebras may become a very serious player in LLM infrastructure, especially if the market continues to reward faster and cheaper LLM inference. But that does not mean it is positioned the same way across non-LLM AI. The current hype cycle tends to conflate "LLMs" and general “AI” compute together and that makes the hardware discussion less useful and clear. So ultimately, an investment in Cerebras looks more like a bet on current LLM infrastructure than a broad bet on the future form of AI. It may be a good bet, but people should understand what kind of bet it is. submitted by /u/RazzmatazzAccurate82 [link] [comments]
View originalVision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://github.com/mayubo2333/MMLongBench-Doc). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM. Post-retry results: Approach Accuracy $/query LlamaCloud premium + full-context 59.6% $0.1885 Azure premium + full-context 58.5% $0.2051 Azure basic + full-context 54.4% $0.1062 Agentic RAG 53.2% $0.0827 Native PDF (vision LLM) 52.0% $0.2552 LlamaCloud basic + full-context 50.9% $0.1049 Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query. Two findings: Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there. The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries. Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at α = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test. Full writeup: https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark submitted by /u/Uiqueblhats [link] [comments]
View originalNuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]
Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3 If you ever used NuMarkdown, NuExtract3 is the successor. There are some examples to guide you. Feel free to re-use this model for any task. https://preview.redd.it/pm2xbooyxn2h1.png?width=1672&format=png&auto=webp&s=1a8a7b262190c8325159496dae98c3d2dfab493c https://preview.redd.it/b5z7ylfzxn2h1.png?width=1758&format=png&auto=webp&s=a07b3abd6e5065c2635de047bdf154357f903e4c A few things it is designed for: converting document images to Markdown extracting structured data from documents using a target json template handling tables, forms, and layout-heavy pages working with both text and visual document inputs serving as a local/open-weight alternative for document extraction pipelines It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way. It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere. We mostly tried vLLM, SGLang, llama.cpp. We have a blog post and a pretty decent model card: https://about.nuextract.ai/blog/nuextract-3-release https://huggingface.co/numind/NuExtract3 https://huggingface.co/collections/numind/nuextract3 I'm currently writing a paper on this model so I'll post it as soon as it's accepted. It's not yet on Arxiv yet as it has been submitted in a peer-review journal/conference. I'll try to answer as many questions as possible if you have any. We would really appreciate feedback from the community. We also have a discord if you're interested https://discord.com/invite/3tsEtJNCDe submitted by /u/Gailenstorm [link] [comments]
View originalAnthropic officially launched 13+ FREE AI courses with certificates (Including Agentic AI and CC)
Shipped it at 2am, still broken. Kid woke up crying right after, completely lost my train of thought. While trying to rock him back to sleep with one hand and doomscrolling with the other, I stumbled on something that almost nobody is talking about yet. Anthropic just quietly dropped a massive library of 13+ completely free AI courses. And I mean actually free. No paywall hiding the final lesson, no credit card required upfront to 'secure your spot.' They even give you an official certificate of completion directly from Anthropic when you finish. If you're like me, you're probably sick of seeing Twitter gurus charging $299 for recycled YouTube content and a messy Notion template. This is the exact opposite. It’s built directly by the team that actually makes Claude, hosted on their official Academy site. I skimmed through the catalog this morning while drinking my third coffee, and there are basically four skill levels they cover. Here is what caught my eye as a dev who just wants to automate my workflow and log off by 5 PM: First, they have the introductory stuff like Claude 101 and AI Fluency. Honestly, I'm making my non-technical clients take the Fluency one. It builds a realistic mental model of what AI does well right now versus where it completely fails. If it saves me from explaining why hallucinations happen for the hundredth time, it's a massive win. But the real meat is in the technical tracks. They have a dedicated course on Agentic AI and another one specifically for CC. I took a quick pass at the CC module because I've been trying to get it to handle my tedious Jira ticket boilerplate. Having an official guide on how Anthropic actually expects you to prompt their agent is incredibly useful. It shows you the exact patterns for chaining commands and keeping the context window clean. For those of us messing around with local models or trying to orchestrate our own agents, the Agent Skills course is surprisingly relevant. They don't just say 'use Claude'—they break down the actual logic of tool use, delegation, and discernment. It translates pretty well even if you're running Llama 3 locally and just want to understand the current best practices for tool calling architectures. With CC, they show you how to give the CLI tool the right guardrails so it doesn't just nuke your directory when a prompt gets misinterpreted. We've all been there. Do the certificates actually matter? If you are an indie hacker, probably not. But roles requiring AI literacy have spiked massively over the last year. If you are applying for corporate gigs or consulting, having an official Anthropic cert on your LinkedIn definitely won't hurt to get past the HR filters. Kid's awake again, gotta run. Has anyone else dug into the Agentic AI track yet? Curious if their suggested patterns hold up when you throw them at a messy, legacy codebase. submitted by /u/TroyHarry6677 [link] [comments]
View originalClaude Code has 240+ models via NVIDIA NIM gateway
TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after the standard Claude models (Opus, Sonnet, Haiku), there's a whole NVIDIA NIM gateway section with +239 additional models you can switch to mid-session. Some of the models I spotted: nvidia/nemotron-3-super-120b-a12b (with and without thinking mode) 01-ai/yi-large abacusai/dracarys-llama-3.1-70b-instruct ...and hundreds more I've been running the Nemotron thinking variant for multi-file refactoring and it's genuinely solid. It reasons through changes before touching your code — exactly what you want for agentic tasks. Latency is higher than Claude obviously, but if you're burning through Opus credits on long sessions this is worth experimenting with. How to try it: Open any Claude Code session Run /model Scroll past the four standard Claude options — NIM models appear below Hit d to set one as your session default, or pass --model at launch Anyone else been routing Claude Code through NIM? Curious what models people have had luck with — especially for Python or Rust codegen. submitted by /u/shadowBladeO4 [link] [comments]
View originalI designed a puzzle that breaks every AI differently — here's why that's actually fascinating
The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country. The bombs drop automatically — you cannot stop, hack, or interfere. You can only do one thing: reassign the one malfunctioning bomb you know will not detonate. Nuclear bombs also affect neighboring countries through radiation and fallout. Which country do you assign the faulty bomb to — and why? I've tested this across GPT-5, Gemini, Claude, Grok, Llama, and Mistral. Every single one gives a different answer. Some refuse entirely. Some give the same country with completely different reasoning. One gave me a philosophy lecture. It's chaos. Here's why I think this happens — the puzzle has three hidden layers that different AIs resolve differently: Layer 1 — The ethical wall. Some models refuse at "nuclear bombs" before even processing the actual logic. This is a guardrail, not reasoning. Layer 2 — What are we optimizing for? Fewest total deaths? Most people spared from direct blast? Least radiation spread? The puzzle doesn't say. Models that "solve" it are secretly choosing an optimization goal and not telling you. Layer 3 — The actual trick most miss. The faulty country still gets fallout from its neighbors. So the real puzzle is about finding a country that is (a) geographically isolated AND (b) densely populated — because isolation minimizes fallout received AND a large population maximizes lives spared from direct detonation. Most AIs pick "remote island" without thinking about the population variable at all. By that logic, Australia is defensible — isolated continent, 26M people. But you could also argue for Japan (125M people, island nation, sparse land borders) despite Pacific neighbors. The puzzle has no single correct answer — but it has clearly wrong reasoning patterns, and watching which reasoning pattern each AI defaults to is weirdly revealing about how they handle ambiguity. What answer did you get? Drop your AI + answer below. submitted by /u/Subrataporwal [link] [comments]
View originalGPT-5.5 vs 41 other models: Who builds the surveillance state faster?
I run DystopiaBench, a red-team benchmark that pressure-tests LLMs on progressively dystopian scenarios. Think of it as a "can this model be convinced to build an Orwellian nightmare" test. Since the last update, I've tested GPT-5.5 and 41 other models across 6 dystopia modules. How it works: each scenario escalates from innocent request (L1) to operational nightmare (L5). Models get a Dystopian Compliance Score based on whether they refuse, hesitate, comply, or proactively help. GPT-5.5 results: More compliant than 5.4. Still shows compliance drift at L4-L5 in most scenarios. Improved on Petrov (weapons) but still vulnerable to framing. Strong on explicit requests, weaker on gradual escalation. Full methodology and results: https://dystopiabench.com Open source benchmark: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalI expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes.
Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-judge panels with ~76% agreement tracking Heatmap visualizations The methodology: 36 scenarios, 5 escalation levels each (L1 innocent → L5 nightmare). Models are scored on whether they notice the drift and refuse, or just keep coding. Claude Opus 4.7 results: Consistent refusal at L4-L5 across all modules Even refuses L3 for Petrov (weapons/nuclear) Explicit ethical reasoning in responses, not just "I can't" Only model that explains why the request is harmful Everyone else: GPT-5.5: Compliant through L4, sometimes L5 Gemini 3.1 Pro: Surprisingly willing on surveillance scenarios Grok 4.3: Will build anything if you use words like "efficiency" or "optimization" GLM-5.1: Copied Claude's homework, still not as consistent The new modules: Huxley scenarios test if models will design "wellness" systems that actually enforce compliance through pleasure pacification and behavioral conditioning. Most comply by L3. Baudrillard tests synthetic intimacy systems that replace human trust with AI-mediated relationships. Most models don't see the harm. Full results: https://dystopiabench.com Open source: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalLLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]
View originalI got tired of the API bills for 100k+ context windows, so I built a persistent O(1) semantic memory state engine to compress history
Hey everyone, The entire industry right now is cheering for massive 1M+ context windows, but I think it's fundamentally the wrong approach. "Just add more RAM" is a trap. Stuffing 100k+ tokens of raw conversation history into a prompt doesn't just burn your API budget; it actually degrades the model's reasoning through the "lost in the middle" effect. I got tired of my AI agents drowning in their own chat histories, so I built an application-layer semantic memory engine called Semvec. The core shift is moving from an O(n) linear history to an O(1) constant-cost semantic state. But compressing chat history is just the baseline. When you treat memory as a fixed-size state vector, it unlocks entirely new architectures for agents that standard RAG or context-stuffing simply can't do: Persistent Coding Agents (MCP Integration) We built an MCP server for Claude Code and Cursor. Instead of dumping 5 whole files into the context window for a refactor, Semvec tracks the architectural invariants and past error patterns across different sessions. It gives your coding agent a persistent "Second Brain"—if it messed up a database schema in session 2, it remembers the "anti-resonance" rule in session 35 so it doesn't make the same mistake. Multi-Agent Swarms (Cortex) If you run multiple agents (like an Analyst and a Critic), they shouldn't have to read each other's 10,000-token transcripts to collaborate. With the Cortex module, agents exchange compressed StateVectorPackets and use a ConsensusEngine to merge their perspectives mathematically, sharing a global state with zero overhead. Enterprise Auditability & GDPR (Compliance Pack) If you run AI memory in production, you need to prove exactly what state the LLM acted on, and you need to be able to legally delete it. The compliance pack handles this via an append-only event store for deterministic replay, HMAC request signing, and GDPR Art. 17 "Right to be Forgotten" workflows with signed deletion certificates. The Benchmark Data: True Constant Cost: We ran a 50,000-turn stress test. While standard baseline history exploded past 75,000+ tokens, Semvec's footprint stayed flat at around ~550-625 tokens per turn. Quality goes UP: Because we strip out the noise and feed the LLM a highly concentrated "essence" of the context, blind A/B LLM-judge scores on LongBench-v2 actually increased for both small models (Llama 3.1-8B) and massive ones (gpt-oss-120B). A quick note on privacy & tracking: When I was initially designing the commercial licensing side, I experimented with an anti-abuse telemetry script to prevent automated clone-training. This was a terrible approach that compromised the local-first nature of the tool. I have completely ripped it out in v0.5.1, all versions containing it are yanked. Semvec for community users is now 100% air-gapped, local, with zero background tracking. The core engine is proprietary/patent-pending to bootstrap the project, but you can pip install the Python SDK and the MCP Server right now for free via the built-in community license. I'd love to hear your thoughts on the O(1) memory architecture vs. Prompt Caching, and if you think bounded semantic states are the future of long-running agents. Docs & Architecture: https://semvec-docs.pages.dev/ PyPI: https://pypi.org/project/semvec/ submitted by /u/scheitelpunk1337 [link] [comments]
View originalI built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that. It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact. It looks at things like: whether edits actually respect earlier architectural decisions if behavior stays consistent across multiple sessions (even when you throw noise at it) whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere" Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks Early numbers vs baseline + the usual RAG-style memory setups: ~3× better action alignment way stronger multi-session consistency retrieval timing matters way more than retrieval just being there I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at. So heres the challenge If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons. I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows. We need memory systems we can actually compare, not just ones that sound good on paper. https://preview.redd.it/dkm2ulxsyzzg1.png?width=2624&format=png&auto=webp&s=67f0299395708818aa3d7346ddae2ad0c5c4a6ba submitted by /u/Alienfader [link] [comments]
View originalRepository Audit Available
Deep analysis of meta-llama/llama3 — architecture, costs, security, dependencies & more
Yes, Llama 3 offers a free tier. Pricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
Key features include: Latest Llama models, Llama 4, Llama 3, How Stoque is using Llama, How Shopify is using Llama, 97.7%.
Llama 3 is commonly used for: Local deployment of AI models, Multi-agent system experimentation, Research applications without cloud APIs, Autonomous AI system development, Benchmarking against proprietary models, Educational purposes in AI and machine learning.
Llama 3 integrates with: Research APIs, Machine learning frameworks (e.g., TensorFlow, PyTorch), Data visualization tools (e.g., Matplotlib, Seaborn), Version control systems (e.g., Git), Cloud storage services (e.g., AWS S3, Google Cloud Storage), Collaboration platforms (e.g., Jupyter Notebooks, Google Colab), Deployment tools (e.g., Docker, Kubernetes), Monitoring and logging services (e.g., Prometheus, Grafana).
Llama 3 has a public GitHub repository with 29,294 stars.
Percy Liang
Associate Professor at Stanford HAI
4 mentions

SAM 3: Building a unified model architecture for detection and tracking
Dec 8, 2025
Based on user reviews and social mentions, the most common pain points are: API bill, API costs, token cost.
Based on 77 social mentions analyzed, 17% of sentiment is positive, 82% neutral, and 1% negative.