Cohere builds powerful models and AI solutions enabling enterprises to automate processes, empower employees, and turn fragmented data into actionable
Based on the limited social mentions provided, users appear to view Cohere positively for its technical capabilities, particularly praising their new speech recognition model that achieves a competitive 5.4% word error rate. The mentions highlight Cohere as offering a viable alternative to closed APIs, with users appreciating the balance of accuracy and deployability in their open-weight models. There's notable interest in Cohere's enterprise-focused solutions that address data residency concerns. However, the sample size is very small with mostly technical discussions rather than comprehensive user reviews, making it difficult to assess broader user sentiment, pricing feedback, or common complaints.
Mentions (30d)
3
Reviews
0
Platforms
5
GitHub Stars
383
85 forks
Based on the limited social mentions provided, users appear to view Cohere positively for its technical capabilities, particularly praising their new speech recognition model that achieves a competitive 5.4% word error rate. The mentions highlight Cohere as offering a viable alternative to closed APIs, with users appreciating the balance of accuracy and deployability in their open-weight models. There's notable interest in Cohere's enterprise-focused solutions that address data residency concerns. However, the sample size is very small with mostly technical discussions rather than comprehensive user reviews, making it difficult to assess broader user sentiment, pricing feedback, or common complaints.
Features
Industry
information technology & services
Employees
850
Funding Stage
Venture (Round not Specified)
Total Funding
$2.4B
1,275
GitHub followers
58
GitHub repos
383
GitHub stars
20
npm packages
6
HuggingFace models
Pricing found: $4.00, $2,500, $5.00, $3,250, $5.00
A 135M model achieves coherent output on a laptop CPU. Scaling is σ compensation, not intelligence.
SmolLM2 135M. Lenovo T14 CPU. No GPU. No RLHF. No BPE. Coherent, non-sycophantic, contextually appropriate output. First message. No prior context window. Same base model under standard pipeline: garbage. What changed: • BPE replaced with geometric hashing (φ-normalized, deterministic, no vocabulary table, no glitch tokens) • RLHF replaced with constraint injection directly into KV cache before generation • Context window memory replaced with external retrieval engine (986k queries/s, Rust) The paper proves why this works: • GDA Collision Bound theorem: tokenization collisions occur only between anagrams. BPE collisions are semantically arbitrary. • Landauer-Assertion Binding theorem: constraint-consistent output is the system’s thermodynamic ground state. Violating constraints requires energy injection — it’s not just statistically unlikely, it’s physically expensive. • Geometric Leverage Impossibility: user input cannot modify KV cache constraint state. Jailbreaking requires hardware access, not prompt engineering. • Coherence Conservation: I\\\_eff = 1 − N\\\_compensation(σ) / N\\\_total. When σ → 0, the entire network does cognition instead of reconstruction. The \~13,000x parameter gap between this and frontier models is not intelligence. It is σ-compensation. 19 pages. Formal proofs. 5 falsifiable predictions. Full architecture spec. CC BY 4.0: https://doi.org/10.5281/zenodo.19494797 Decisive test: A/B at fixed parameter count. Standard pipeline vs σ-reduced pipeline. The paper specifies exactly how to run it. submitted by /u/Defiant_Confection15 [link] [comments]
View originalI gave ChatGPT 5.3 Instant, Claude Sonnet 4.6, and Mistral Le Chat the same training data via MCP. The results show where context windows break down.
I ran an experiment with three models. All three connected to the same endurance training platform via MCP, same 6 months of running data, same prompt: analyze the history and build a 2-week training plan. All three handled single-session analysis fine. Ask any of them to look at one run and they will give you a reasonable breakdown of pace, heart rate zones, effort distribution. Trend spotting across a few weeks also worked. At this level the models are roughly interchangeable. The task was to build a multi-session plan where each workout follows logically from the previous one. This requires holding a lot of structured data in context at once: months of session history, capacity values, zone definitions, and the plan being constructed. ChatGPT 5.3 Instant missed almost 3 months of training data entirely, likely because it never made it into the context window. It got my easy pace wrong (4:30/km instead of the 6:50-7:15/km that was right there in the data), pinned every session at 85% of max heart rate which is way too high for easy running, and scheduled two high-effort long runs back to back at the end of the week. The plan looked structured at first glance but fell apart on inspection. Mistral Le Chat had similar problems, worse in some areas. But Claude Sonnet 4.6 held the full 6-month history like it should, got the paces and zones right, built sessions that progressed logically, and distributed effort correctly (97% low intensity for a post-illness comeback block, which is exactly what you want)! Why? I do not think this is about model intelligence. When the data fits in the context window, all three models reason about it competently. The issue is that training data through MCP tool calls is dense. Every session carries timestamps, distances, paces, heart rate curves, cadence, ground contact times, effort scores, zones. A 6-month history eats through tokens fast. And then the model still has to create structured workouts with targets, phases, and progression on top of that. By that point the context is already strained, and the output quality drops. With a smaller effective context window, the model starts dropping data silently. It does not tell you it only saw 3 out of 6 months. It just plans from what it has, confidently. That is the dangerous part: the output still looks structured and professional, but the foundation is incomplete. What surprised me was what happened when I used Claude Sonnet 4.6 iteratively over multiple weeks. After each run I would go back, have it pull the completed session, compare actual vs. planned values, and adjust the next sessions. It caught that my heart rate had jumped from 142 to 148 bpm at the same pace between two consecutive easy runs. Same speed, same distance, but the body was working harder. Not recovered yet. It adjusted the next session accordingly. At one point it noticed that comparing ground contact times between runs at different speeds was misleading and proposed normalizing the values to a reference pace. It ran a regression through the data points on its own. The raw numbers had suggested a bigger efficiency difference between runs than actually existed once you controlled for speed. These are observations that add up over weeks. But they also fill the context window further, which is the paradox. More data means better output, but every model hits a wall eventually. ChatGPT 5.3 Instant and Mistral Le Chat hit it early, Claude Sonnet 4.6 later, but it is the same wall. Takeaway If your use case requires the model to reason over a large, internally consistent dataset and produce coherent multi-step output, the effective context window of the full setup (model + MCP host + tool call overhead) matters more than benchmark scores. This probably applies beyond training plans to anything where the AI needs to hold a lot of state while building something that has to be internally consistent. Has anyone else hit this? Specifically the context window filling up through MCP tool calls and the model silently dropping earlier data without telling you. I am curious whether this is consistent across other domains or whether training data is just unusually dense. And yeah Claude is remarkably good. I wrote up the full experiment with screenshots, the actual AI conversations with share links to the real conversations, and the training plans the models created here: https://mcprunbook.com/posts/why-ai-training-plans-fail.html submitted by /u/aldipower81 [link] [comments]
View originalI built the first AI memory system that mathematically cannot store lies
Your AI remembers wrong things and nobody checks. Every "AI memory" tool stores whatever your LLM generates. Hallucinations sit right next to real knowledge. Three months later, your AI retrieves that hallucination as if it were fact and builds an entire feature on it. I got tired of this. So I built something different. EON Memory is an MCP server with one rule: nothing gets stored without passing 15 truth tests first. WHAT THE 15 TESTS ACTUALLY CHECK: Logic layer (4 tests): Self-contradiction detection. Does the new memory conflict with what you already stored? Is it internally coherent? Does it hold up under scrutiny? Ethics layer (5 tests): Does the content contain deceptive patterns? Coercive language? Harmful intent? We use a mathematical framework called X-Ethics with four pillars scored multiplicatively: Truth x Freedom x Justice x Service. If any pillar is zero, total score is zero. The system literally cannot store it. Quality layer (6 tests): Is there enough technical detail to be useful? Could another AI actually write code from this memory in 6 months? Are sources cited? We score everything Gold, Silver, Bronze, or Review. THE FORMULA BEHIND X-ETHICS: L = (W x F x G x D) x X-squared W = Truth score (deception detection, hallucination patterns) F = Freedom score (coercion detection) G = Justice score (harm detection, dignity) D = Service score (source verification) X = Truth gradient (convergence toward truth, derived from axiom validation) X-squared means truth alignment is rewarded exponentially. A slightly deceptive memory does not get a slightly lower score - it gets crushed. This is not a content filter. This is math. The axioms are from a formal framework (Traktat X) that proves truth-orientation is logically necessary. Denying truth uses truth. The framework is self-sealing. CONNECTED KNOWLEDGE: Every memory is semantically linked. Search for "payment bug" and you get the related architecture decisions, the Stripe webhook fix, and the test results - with similarity percentages. Your AI sees the full graph, not isolated documents. SETUP: npx eon-memory init Works with Claude Code, Cursor, any MCP IDE. Swiss-hosted, DSGVO compliant. 3,200+ memories validated in production. CHF 29/month. Free trial: https://app.ai-developer.ch Solo developer, Swiss-made. Happy to answer questions about the math, the validation pipeline, or anything else.Your AI remembers wrong things and nobody checks. Every "AI memory" tool stores whatever your LLM generates. Hallucinations sit right next to real knowledge. Three months later, your AI retrieves that hallucination as if it were fact and builds an entire feature on it. I got tired of this. So I built something different. EON Memory is an MCP server with one rule: nothing gets stored without passing 15 truth tests first. WHAT THE 15 TESTS ACTUALLY CHECK: Logic layer (4 tests): Self-contradiction detection. Does the new memory conflict with what you already stored? Is it internally coherent? Does it hold up under scrutiny? Ethics layer (5 tests): Does the content contain deceptive patterns? Coercive language? Harmful intent? We use a mathematical framework called X-Ethics with four pillars scored multiplicatively: Truth x Freedom x Justice x Service. If any pillar is zero, total score is zero. The system literally cannot store it. Quality layer (6 tests): Is there enough technical detail to be useful? Could another AI actually write code from this memory in 6 months? Are sources cited? We score everything Gold, Silver, Bronze, or Review. THE FORMULA BEHIND X-ETHICS: L = (W x F x G x D) x X-squared W = Truth score (deception detection, hallucination patterns) F = Freedom score (coercion detection) G = Justice score (harm detection, dignity) D = Service score (source verification) X = Truth gradient (convergence toward truth, derived from axiom validation) X-squared means truth alignment is rewarded exponentially. A slightly deceptive memory does not get a slightly lower score - it gets crushed. This is not a content filter. This is math. The axioms are from a formal framework (Traktat X) that proves truth-orientation is logically necessary. Denying truth uses truth. The framework is self-sealing. CONNECTED KNOWLEDGE: Every memory is semantically linked. Search for "payment bug" and you get the related architecture decisions, the Stripe webhook fix, and the test results - with similarity percentages. Your AI sees the full graph, not isolated documents. SETUP: npx eon-memory init Works with Claude Code, Cursor, any MCP IDE. Swiss-hosted, DSGVO compliant. 3,200+ memories validated in production. CHF 29/month. Free trial: https://app.ai-developer.ch Solo developer, Swiss-made. Happy to answer questions about the math, the validation pipeline, or anything else. submitted by /u/FortuneOk8153 [link] [comments]
View originalFlux maintains facial geometry and spatial coherence across 5 sequential iterative edits - is anything else doing this at this level?
One woman. 5 Different Prompts. Perfect Contextual Preservation Playing around with Flux again and thought I'll try it with a model changing the aspect of the photo by prompts only. This isn't art sharing, it's a demonstration of iterative prompt-based context preservation in Flux. Each generation uses the previous output as input, maintaining facial geometry, lighting consistency and spatial coherence across 5 sequential edits. Prompts I used for this experiment were simple: Add a handbag Remove handbag and add sunglasses Change background to a beach scene Add a summery beach bag Change suit to a dress I didnt have to explain to keep the facial expression the same or anything. Just normal language ask's to add or deduct a particular object from the photo. Every photo has perfect context from the last. The facial expressions are identical in each photo. Interested whether others have found models that maintain this level of fidelity across iterative inpainting chains, or if Flux is genuinely leading here. submitted by /u/Beneficial-Cow-7408 [link] [comments]
View originalI run 3 experiments to test whether AI can learn and become "world class" at something
I will write this by hand because I am tried of using AI for everything and bc reddit rules TL,DR: Can AI somehow learn like a human to produce "world-class" outputs for specific domains? I spent about $5 and 100s of LLM calls. I tested 3 domains w following observations / conclusions: A) code debugging: AI are already world-class at debugging and trying to guide them results in worse performance. Dead end B) Landing page copy: routing strategy depending on visitor type won over one-size-fits-all prompting strategy. Promising results C) UI design: Producing "world-class" UI design seems required defining a design system first, it seems like can't be one-shotted. One shotting designs defaults to generic "tailwindy" UI because that is the design system the model knows. Might work but needs more testing with design system I have spent the last days running some experiments more or less compulsively and curiosity driven. The question I was asking myself first is: can AI learn to be a "world-class" somewhat like a human would? Gathering knowledge, processing, producing, analyzing, removing what is wrong, learning from experience etc. But compressed in hours (aka "I know Kung Fu"). To be clear I am talking about context engineering, not finetuning (I dont have the resources or the patience for that) I will mention world-class a handful of times. You can replace it be "expert" or "master" if that seems confusing. Ultimately, the ability of generating "world-class" output. I was asking myself that because I figure AI output out of the box kinda sucks at some tasks, for example, writing landing copy. I started talking with claude, and I designed and run experiments in 3 domains, one by one: code debugging, landing copy writing, UI design I relied on different models available in OpenRouter: Gemini Flash 2.0, DeepSeek R1, Qwen3 Coder, Claude Sonnet 4.5 I am not going to describe the experiments in detail because everyone would go to sleep, I will summarize and then provide my observations EXPERIMENT 1: CODE DEBUGGING I picked debugging because of zero downtime for testing. The result is either wrong or right and can be checked programmatically in seconds so I can perform many tests and iterations quickly. I started with the assumption that a prewritten knowledge base (KB) could improve debugging. I asked claude (opus 4.6) to design 8 realistic tests of different complexity then I run: bare model (zero shot, no instructions, "fix the bug"): 92% KB only: 85% KB + Multi-agent pipeline (diagnoser - critic -resolver: 93% What this shows is kinda suprising to me: context engineering (or, to be more precise, the context engineering in these experiments) at best it is a waste of tokens. And at worst it lowers output quality. Current models, not even SOTA like Opus 4.6 but current low-budget best models like gemini flash or qwen3 coder, are already world-class at debugging. And giving them context engineered to "behave as an expert", basically giving them instructions on how to debug, harms the result. This effect is stronger the smarter the model is. What this suggests? That if a model is already an expert at something, a human expert trying to nudge the model based on their opinionated experience might hurt more than it helps (plus consuming more tokens). And funny (or scary) enough a domain agnostic person might be getting better results than an expert because they are letting the model act without biasing it. This might be true as long as the model has the world-class expertise encoded in the weights. So if this is the case, you are likely better off if you don't tell the model how to do things. If this trend continues, if AI continues getting better at everything, we might reach a point where human expertise might be irrelevant or a liability. I am not saying I want that or don't want that. I just say this is a possibility. EXPERIMENT 2: LANDING COPY Here, since I can't and dont have the resources to run actual A/B testing experiments with a real audience, what I did was: Scrape documented landing copy conversion cases with real numbers: Moz, Crazy Egg, GoHenry, Smart Insights, Sunshine.co.uk, Course Hero Deconstructed the product or target of the page into a raw and plain description (no copy no sales) As claude oppus 4.6 to build a judge that scores the outputs in different dimensions Then I run landing copy geneation pipelines with different patterns (raw zero shot, question first, mechanism first...). I'll spare the details, ask if you really need to know. I'll jump into the observations: Context engineering helps writing landing copy of higher quality but it is not linear. The domain is not as deterministic as debugging (it fails or it breaks). It is much more depending on the context. Or one may say that in debugging all the context is self-contained in the problem itself whereas in landing writing you have to provide it. No single config won across all products. Instead, the
View originalChatGPT can mod RPG Maker games for you.
I got curious and gave it the zip of a whole RPG Maker game and asked it to make several changes... and it did. So I went further, and added new dialogue, branching paths, sound edits, animation changes to be more realistic, animation timing changes... and it did it all. Then I gave it sprites and told it to make a whole new character, animated, with branching paths, dialogue, and then told it to make sure that every area and every path in the game checks, and if you have this character with you, gameplay and dialogue changes.... and it did it. I didn't even need to be coherent. I kinda just rambled on for multiple paragraphs. Could also probably help you make a whole ass RPG Maker game from a starter template too. Keep in mind if you do this, there will be bugs that come up, just like with human coding. Sometimes adding new things will break previous things, but it is usually pretty good at fixing the bugs in usually one or a couple passes, and with mine it ended up stomping a lot of bugs by moving the changes to a brand new plugin it made. Pretty damn cool. I tried it with some other games, like a Wolf RPG game, but it's not able to do it with things that are super proprietary and require their editor to make changes, so we're still a ways away from being able to ask it to make you a Skyrim mod, but it's still pretty damn cool. submitted by /u/Dogbold [link] [comments]
View originalAnthropic, your accessibility is an embarrassment — so I fixed it myself in two minutes
I use NVDA with Firefox. I love Claude. And yet every time I open claude.ai, I'm reminded that Anthropic apparently doesn't think blind or low-vision users exist. Let me be specific about what's broken in the chat view: - There is **zero semantic structure** around individual messages. Every turn in the conversation — your message, Claude's response, your next message — is just a pile of divs. No landmarks, no roles, nothing. In NVDA browse mode you cannot jump between messages at all. You just arrow through a wall of text with no way to know where one message ends and the next begins. - There are **no headings**. If Claude writes a response that itself contains headings, those headings just float in the document outline with no parent structure to anchor them to the conversation turn they belong to. - When Claude finishes generating a response, **nothing is announced**. You're just supposed to... know? Poll the page somehow? There's no live region, no status update, nothing that tells a screen reader user "hey, the answer is ready." So I wrote a userscript. It took maybe two minutes. Here's what it does: Finds every message turn using the `[data-test-render-count]` attribute (which, by the way, is not a stable public API — I had to dig through the DOM myself because there are no semantic hooks to grab onto). Adds `role="article"` and an `aria-label` to each turn, so NVDA's quick-nav key (`A` / `Shift+A`) lets you jump between messages. Injects a visually-hidden `h1` at the start of each turn as a heading landmark, and demotes all headings inside Claude's responses down one level so the outline is actually coherent. Adds an `aria-live` region that announces when Claude finishes streaming a response. Adds a skip link to jump to the latest message. Two minutes. That's it. Already dramatically more usable. **Important caveat:** this is a hacky personal fix, not a proper accessibility implementation. It relies on internal DOM attributes that could break any time Anthropic ships an update. It has not been audited against WCAG or tested with anything other than NVDA + Firefox. It is a workaround, not a solution. The real solution would be for Anthropic to build semantic structure into their product in the first place, which would take their frontend team an afternoon. And it's not just the web. **Claude Code**, Anthropic's terminal tool, is also a nightmare to use with a screen reader. The terminal output is noisy, unlabelled, and the interactive prompts are difficult to navigate. There's no indication that any thought has gone into how a screen reader user would actually work with it. Anthopic is one of the best-funded AI companies in the world. They have the engineering talent. They clearly have opinions about doing things right — they publish lengthy documents about AI safety and ethics. And yet the product that millions of people use every day has accessibility so bad that a user had to patch it themselves with a browser extension just to be able to read the conversation. This isn't a niche problem. Screen reader users, keyboard-only users, users with motor disabilities — these are real people who want to use your product. Accessibility isn't a nice-to-have you get to when the roadmap clears. It's a baseline. Anthropican fix this. They just apparently haven't decided to yet. --- *Script is a Violentmonkey/Tampermonkey userscript targeting `https://claude.ai/*`. Happy to share if anyone wants it — though as noted above, treat it as a temporary personal workaround, not a robust solution.* *Yes, this post was written by Claude. Apparently it can't even write the name of its company correctly, so I left the typos in because it's funny* The script can be found here: https://gist.github.com/Googhga/3cef8dd5d1974cd823a4512a103d21db submitted by /u/Googhga [link] [comments]
View originalClaude Mythos - update and system card
Key capabilities About this model Claude Mythos Preview (gated research preview) is a new class of intelligence built for ambitious projects, and the world's best model for cybersecurity, autonomous coding, and long-running agents. Only available as a gated research preview with access prioritized for defensive cybersecurity use cases. Key model capabilities Adaptive thinking is an upgrade to extended thinking that gives Claude the freedom to think as much or as little as needed depending on the task and effort level. Image & text input: With strong vision capabilities, Claude Mythos Preview can process images and return text outputs to analyze and understand charts, graphs, technical diagrams, reports, and other visual assets. Use cases See Responsible AI for additional consideration for responsible use. Key use cases Claude Mythos Preview is a new class of intelligence built for ambitious projects, and the world's best model for cybersecurity, autonomous coding, and long-running agents. Only available as a gated research preview with access prioritized for defensive cybersecurity use cases. Cybersecurity: Claude Mythos Preview is the world's best model for defensive security. It is capable of finding and suggesting fixes for real vulnerabilities in production codebases, then helping prove the fixes hold. Autonomous coding: Claude Mythos Preview is able to handle the full engineering cycle more effectively than any prior model. It investigates, implements, and tests across large codebases from objective to shipped. Long-running agents: Claude Mythos Preview sets a new bar for long-horizon agentic work. It can sustain coherent execution over extended, multi-hour tasks, adapting as conditions change and driving work forward with fewer interventions. Out of scope use cases Claude Mythos Preview is only available as a gated research preview with access prioritized for defensive cybersecurity use cases. Please refer to the Claude Mythos Preview system card. Technical specs Please refer to the Claude Mythos Preview system card. Training cut-off date End of December 2025 Input formats Image & text input: With powerful vision capabilities, Claude Mythos Preview can process images and return text outputs to analyze and understand charts, graphs, technical diagrams, reports, and other visual assets. Text output: Claude Mythos Preview can output text of a variety of types and formats, such as prose, lists, Markdown tables, JSON, HTML, code in various programming languages, and more. Supported language Claude Mythos Preview can understand and output a wide variety of languages, such as English, French, Standard Arabic, Mandarin Chinese, Japanese, Korean, Spanish, and Hindi. Performance will vary based on how well-resourced the language is. submitted by /u/NorwayBull [link] [comments]
View originalI built a Claude Code plugin that gives your team persistent shared context — decisions, reasoning, and ambient intelligence
I spent spring break building Distillery a plugin for Claude Code that gives your team shared, persistent context. Not just between sessions but between people. The problem isn't just that sessions start fresh. It's that teams lose knowledge constantly. Someone debugs an auth issue for an hour, figures out the root cause, and that reasoning lives in their chat history. Next week, a teammate hits the same issue and starts from scratch. Decisions made three months ago with good reasons that nobody can find anymore. Distillery captures that context where it happens, inside Claude Code: - /distill — capture decisions and reasoning mid-session. The whole team can search them later. - /recall — find anything anyone on the team has captured, in natural language - /pour — synthesize a coherent answer from scattered context across people and sessions. "How does our auth system work?" pulls from six different people's captured decisions and produces a narrative with citations. The feature that changed how I work is ambient intelligence. Point /watch at GitHub repos, RSS feeds, subreddits, it polls on a schedule, scores every item for relevance against your team's existing context using embedding similarity. It learns what your team cares about from what everyone captures. /radar gives you a synthesized digest of what matters. Team deployment: shared server with GitHub OAuth, so everyone connects their Claude Code to the same knowledge base. Context captured by one person is searchable by everyone. The knowledge compounds — every team member's captures make everyone else's searches and syntheses better. v0.2.0 just shipped with hybrid search (BM25 + vector with Reciprocal Rank Fusion), auth audit logging, and uv support. https://github.com/norrietaylor/distillery Blog post: https://norrietaylor.github.io/distillery/blog/building-a-second-brain-for-claude-code/ What knowledge does your team keep losing? submitted by /u/shared-context [link] [comments]
View originalcontext window fills up fast in Claude Code — built something that compresses bash output 90%+ automatically
If you use Claude Code for longer tasks you've probably hit the wall where the context fills up mid-session and the model loses track of what it was doing. A big culprit: raw bash output. ps aux, docker logs, git log — they dump thousands of tokens of noise the model doesn't need. Built a hook called squeez that compresses that output automatically before it hits the model. You don't change how you work, it just runs in the background. Average reduction across 19 common commands: -92.8% Sessions last longer. Responses stay coherent further into a task. Install: curl -fsSL https://raw.githubusercontent.com/claudioemmanuel/squeez/main/install.sh | sh Also on npm and crates.io. submitted by /u/Standard-Stay133 [link] [comments]
View originalSerious question, Did a transformer(Claude) just describe itself, the universe and build itself Shannon limit architecture? or am I crazy?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/rank^s with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/n^s. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integra
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View originalAnthropic found emergent emotional states in Claude. I'm seeing the same phenomenon in simple trading agents. Is emergence universal under optimization pressure?
Anthropic researchers recently found that Claude develops internal representations of emotional concepts that aren't decorative. They influence behavior in ways the builders didn't anticipate. Not "feelings" — but internal states that function like emotions: orienting responses, modifying tone, creating patterns that were never explicitly programmed. I've been running a small experiment that accidentally produces something similar. I built an autonomous trading system where agents are born with random parameters, trade real money, and die when they lose too much. No manual tuning. Pure evolutionary selection. After a few weeks, agents started developing what I can only call "character." One agent became an aggressive volatility hunter. Not because I coded aggression — it emerged from the parameter set that survived. On Day 14 it captured more profit in 3 hours than the previous 13 days combined, riding a whale signal cluster. Then five consecutive losses triggered the kill-switch. Dead. Another agent is extremely conservative. Barely trades. Survives longer, generates almost nothing. Nobody designed it to be cautious — its parameters just make it avoid most signals. The parallel with Anthropic's findings is uncomfortable: Claude: internal states not explicitly programmed → orient behavior consistently → create unanticipated patterns → aren't "real" emotions but function like them. My agents: behavioral tendencies not explicitly coded → orient decisions consistently → create patterns I didn't design → aren't "real" personalities but function like them. The mechanisms are completely different. Gradient descent vs. evolutionary selection. Billions of parameters vs. a handful. Language vs. market signals. But the outcome pattern is the same: systems under optimization pressure develop emergent internal states that go beyond what was programmed. This raises a question I keep coming back to: is emergence an inevitable property of any sufficiently complex system under sustained optimization pressure? And if so, does the substrate even matter? My agents are trivially simple compared to Claude. But the behavioral phenomenon looks structurally identical. Which suggests this might not be about complexity at all — it might be about the optimization process itself. For context: 5 agents, ~116 trades/day, $500 real capital, 60-day experiment with fixed rules. System is not profitable (PF below 1.0 for 4/5 agents). I track a coherence_score for each agent — measuring whether it behaves consistently with its emergent "identity." Built solo, no CS background, 18 months in. What's the community's take? Is emergence under optimization pressure substrate-independent, or am I seeing patterns where there's just noise? submitted by /u/piratastuertos [link] [comments]
View original[P] Dante-2B: I'm training a 2.1B bilingual fully open Italian/English LLM from scratch on 2×H200. Phase 1 done — here's what I've built.
The problem If you work with Italian text and local models, you know the pain. Every open-source LLM out there treats Italian as an afterthought — English-first tokenizer, English-first data, maybe some Italian sprinkled in during fine-tuning. The result: bloated token counts, poor morphology handling, and models that "speak Italian" the way a tourist orders coffee in Rome. I decided to fix this from the ground up. What is Dante-2B A 2.1B parameter, decoder-only, dense transformer. Trained from scratch — no fine-tune of Llama, no adapter on Mistral. Random init to coherent Italian in 16 days on 2× H200 GPUs. Architecture: LLaMA-style with GQA (20 query heads, 4 KV heads — 5:1 ratio) SwiGLU FFN, RMSNorm, RoPE d_model=2560, 28 layers, d_head=128 (optimized for Flash Attention on H200) Weight-tied embeddings, no MoE — all 2.1B params active per token Custom 64K BPE tokenizer built specifically for Italian + English + code Why the tokenizer matters This is where most multilingual models silently fail. Standard English-centric tokenizers split l'intelligenza into l, ', intelligenza — 3 tokens for what any Italian speaker sees as 1.5 words. Multiply that across an entire document and you're wasting 20-30% of your context window on tokenizer overhead. Dante's tokenizer was trained on a character-balanced mix (~42% Italian, ~36% English, ~22% code) with a custom pre-tokenization regex that keeps Italian apostrophe contractions intact. Accented characters (à, è, é, ì, ò, ù) are pre-merged as atomic units — they're always single tokens, not two bytes glued together by luck. Small detail, massive impact on efficiency and quality for Italian text. Training setup Data: ~300B token corpus. Italian web text (FineWeb-2 IT), English educational content (FineWeb-Edu), Italian public domain literature (171K books), legal/parliamentary texts (Gazzetta Ufficiale, EuroParl), Wikipedia in both languages, and StarCoderData for code. Everything pre-tokenized into uint16 binary with quality tiers. Phase 1 (just completed): 100B tokens at seq_len 2048. DeepSpeed ZeRO-2, torch.compile with reduce-overhead, FP8 via torchao. Cosine LR schedule 3e-4 → 3e-5 with 2000-step warmup. ~16 days, rock solid — no NaN events, no OOM, consistent 28% MFU. Phase 2 (in progress): Extending to 4096 context with 20B more tokens at reduced LR. Should take ~4-7 more days. What it can do right now After Phase 1 the model already generates coherent Italian text — proper grammar, correct use of articles, reasonable topic continuity. It's a 2B, so don't expect GPT-4 reasoning. But for a model this size, trained natively on Italian, the fluency is already beyond what I've seen from Italian fine-tunes of English models at similar scale. I'll share samples after Phase 2, when the model has full 4K context. What's next Phase 2 completion (est. ~1 week) HuggingFace release of the base model — weights, tokenizer, config, full model card SFT phase for instruction following (Phase 3) Community benchmarks — I want to test against Italian fine-tunes of Llama/Gemma/Qwen at similar sizes Why I'm posting now I want to know what you'd actually find useful. A few questions for the community: Anyone working with Italian NLP? I'd love to know what benchmarks or tasks matter most to you. What eval suite would you want to see? I'm planning perplexity on held-out Italian text + standard benchmarks, but if there's a specific Italian eval set I should include, let me know. Interest in the tokenizer alone? The Italian-aware 64K BPE tokenizer might be useful even independently of the model — should I release it separately? Training logs / loss curves? Happy to share the full training story with all the numbers if there's interest. About me I'm a researcher and entrepreneur based in Rome. PhD in Computer Engineering, I teach AI and emerging tech at LUISS university, and I run an innovation company (LEAF) that brings emerging technologies to businesses. Dante-2B started as a research project to prove that you don't need a massive cluster to train a decent model from scratch — you need good data, a clean architecture, and patience. Everything will be open-sourced. The whole pipeline — from corpus download to tokenizer training to pretraining scripts — will be on GitHub. Happy to answer any questions. 🇮🇹 Discussion also on r/LocalLLaMA here submitted by /u/angeletti89 [link] [comments]
View originalUsing AI to untangle 10,000 property titles in Latam, sharing our approach and wanting feedback
Hey. Long post, sorry in advance (Yes, I used an AI tool to help me craft this post in order to have it laid in a better way). So, I've been working on a real estate company that has just inherited a huge mess from another real state company that went bankrupt. So I've been helping them for the past few months to figure out a plan and finally have something that kind of feels solid. Sharing here because I'd genuinely like feedback before we go deep into the build. Context A Brazilian real estate company accumulated ~10,000 property titles across 10+ municipalities over decades, they developed a bunch of subdivisions over the years and kept absorbing other real estate companies along the way, each bringing their own land portfolios with them. Half under one legal entity, half under a related one. Nobody really knows what they have, the company was founded in the 60s. Decades of poor management left behind: Hundreds of unregistered "drawer contracts" (informal sales never filed with the registry) Duplicate sales of the same properties Buyers claiming they paid off their lots through third parties, with no receipts from the company itself Fraudulent contracts and forged powers of attorney Irregular occupations and invasions ~500 active lawsuits (adverse possession claims, compulsory adjudication, evictions, duplicate sale disputes, 2 class action suits) Fragmented tax debt across multiple municipalities A large chunk of the physical document archive is currently held by police as part of an old investigation due to old owners practices The company has tried to organize this before. It hasn't worked. The goal now is to get a real consolidated picture in 30-60 days. Team is 6 lawyers + 3 operators. What we decided to do (and why) First instinct was to build the whole infrastructure upfront, database, automation, the works. We pushed back on that because we don't actually know the shape of the problem yet. Building a pipeline before you understand your data is how you end up rebuilding it three times, right? So with the help of Claude we build a plan that is the following, split it in some steps: Build robust information aggregator (does it make sense or are we overcomplicating it?) Step 1 - Physical scanning (should already be done on the insights phase) Documents will be partially organized by municipality already. We have a document scanner with ADF (automatic document feeder). Plan is to scan in batches by municipality, naming files with a simple convention: [municipality]_[document-type]_[sequence] Step 2 - OCR Run OCR through Google Document AI, Mistral OCR 3, AWS Textract or some other tool that makes more sense. Question: Has anyone run any tool specifically on degraded Latin American registry documents? Step 3 - Discovery (before building infrastructure) This is the decision we're most uncertain about. Instead of jumping straight to database setup, we're planning to feed the OCR output directly into AI tools with large context windows and ask open-ended questions first: Gemini 3.1 Pro (in NotebookLM or other interface) for broad batch analysis: "which lots appear linked to more than one buyer?", "flag contracts with incoherent dates", "identify clusters of suspicious names or activity", "help us see problems and solutions for what we arent seeing" Claude Projects in parallel for same as above Anything else? Step 4 - Data cleaning and standardization Before anything goes into a database, the raw extracted data needs normalization: Municipality names written 10 different ways ("B. Vista", "Bela Vista de GO", "Bela V. Goiás") -> canonical form CPFs (Brazilian personal ID number) with and without punctuation -> standardized format Lot status described inconsistently -> fixed enum categories Buyer names with spelling variations -> fuzzy matched to single entity Tools: Python + rapidfuzz for fuzzy matching, Claude API for normalizing free-text fields into categories. Question: At 10,000 records with decades of inconsistency, is fuzzy matching + LLM normalization sufficient or do we need a more rigorous entity resolution approach (e.g. Dedupe.io)? Step 5 - Database Stack chosen: Supabase (PostgreSQL + pgvector) with NocoDB on top Three options were evaluated: Airtable - easiest to start, but data stored on US servers (LGPD concern for CPFs and legal documents), limited API flexibility, per-seat pricing NocoDB alone - open source, self-hostable, free, but needs server maintenance overhead Supabase - full PostgreSQL + authentication + API + pgvector in one place, $25/month flat, developer-first We chose Supabase as the backend because pgvector is essential for the RAG layer (Step 7) and we didn't want to manage two separate databases. NocoDB sits on top as the visual interface for lawyers and data entry operators who need spreadsheet-like interaction without writing SQL. Each lot becomes a single entity (primary key) with relational links to: contracts, bu
View originalRepository Audit Available
Deep analysis of cohere-ai/cohere-python — architecture, costs, security, dependencies & more
Yes, Cohere offers a free tier. Pricing found: $4.00, $2,500, $5.00, $3,250, $5.00
Key features include: Supports 23 languages for global communication and discovery, Seamlessly integrates into existing systems without disruption, Powers AI applications that reason, act, and generate insights anchored in your data, Quickly converts audio data into highly accurate text outputs, Supports 14 languages and is robust to real-world conversational environments, Integrates with generative and retrieval systems for end-to-end speech-driven workflows, Safe. Flexible. Built for business., The turnkey AI platform that helps your work flow.
Cohere has a public GitHub repository with 383 stars.
Based on user reviews and social mentions, the most common pain points are: openai, gpt, large language model, llm.
Based on 45 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
Mike Volpi
General Partner at Index Ventures
2 mentions