Build custom AI agents and RAG applications with smart context engineering, powered by open AI orchestration.
Orchestrate every step of your AI agent, from retrieval to reasoning to memory and tool use. Haystack’s modular framework gives you full visibility to inspect, debug, and optimize every decision your AI makes. Connect to OpenAI, Anthropic, Mistral, Hugging Face, Weaviate, Pinecone, Elasticsearch, and more with no vendor lock-in. Haystack’s open architecture lets you mix and match components to fit your workflow. Move from prototype to production using the same composable building blocks. Haystack lets you go from a proof-of-concept to a full production system with unified tooling for building, testing, and shipping your AI use cases. Run production workloads across any environment with built-in reliability and observability. Haystack Pipelines are serializable, cloud-agnostic, and Kubernetes-ready, with logging, monitoring, and deployment guides to support you. Enterprise-support for the Haystack framework, with exclusive access to: Sovereign AI toolset built on Haystack to accelerate and scale AI use cases with: Build highly performant RAG pipelines with a multitude of retrieval and generation strategies. From hybrid retrieval to self-correction loops, Haystack has got you covered. Design production-ready AI agents with standardized tool calling and scalable context engineering. Branching and looping pipelines give you full control over complex, multi-step decision flows. Architect a next generation AI app around all modalities, not just text. Haystack can do tasks like image processing and audio transcription too. All of our generators provide a standardized interface so that you can focus on building the perfect bot for your users. The flexibility and composability of Haystack’s prompt flow is unparalleled. Leverage our Jinja-2 templates and build a content generation engine that exactly matches your workflow. Our community on Discord is for everyone interested in NLP, using Haystack or even just getting started! Stay tuned for the latest Haystack community news and events.
Mentions (30d)
0
Reviews
0
Platforms
2
GitHub Stars
24,667
2,692 forks
Features
Use Cases
Industry
information technology & services
Employees
82
Funding Stage
Venture (Round not Specified)
Total Funding
$45.6M
1,276
GitHub followers
71
GitHub repos
24,667
GitHub stars
20
npm packages
4
HuggingFace models
[D] MemPalace claims 100% on LoCoMo and a "perfect score on LongMemEval." Its own BENCHMARKS.md documents why neither is meaningful.
A new open-source memory project called MemPalace launched yesterday claiming "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." The launch tweet went viral reaching over 1.5 million views while the repository picked up over 7,000 GitHub stars in less than 24 hours. The interesting thing is not that the headline numbers are inflated. The interesting thing is that the project's own BENCHMARKS.md file documents this in detail, while the launch tweet strips these caveats. Some of failure modes line up with the methodology disputes the field has been arguing about for over a year (Zep vs Mem0, Letta's "Filesystem All You Need" reproducibility post, etc.). 1. The LoCoMo 100% is a top_k bypass. The runner uses top_k=50. LoCoMo's ten conversations have 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than 50 sessions, so top_k=50 retrieves the entire conversation as the candidate pool every time. The Sonnet rerank then does reading comprehension over all sessions. BENCHMARKS.md says this verbatim: The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions - the embedding retrieval step is bypassed entirely. The honest LoCoMo numbers in the same file are 60.3% R@10 with no rerank and 88.9% R@10 with hybrid scoring and no LLM. Those are real and unremarkable. A 100% is also independently impossible on the published version of LoCoMo, since roughly 6.4% of the answer key contains hallucinated facts, wrong dates, and speaker attribution errors that any honest system will disagree with. 2. The LongMemEval "perfect score" is a metric category error. Published LongMemEval is end-to-end QA: retrieve from a haystack of prior chat sessions, generate an answer, GPT-4 judge marks it correct. Every score on the published leaderboard is the percentage of generated answers judged correct. The MemPalace LongMemEval runner does retrieval only. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings (all-MiniLM-L6-v2), returns the top five sessions by cosine distance, and checks set membership against the gold session IDs. It computes both recall_any@5 and recall_all@5, and the project reports the softer one. It never generates an answer. It never invokes a judge. None of the LongMemEval numbers in this repository - not the 100%, not the 98.4% "held-out", not the 96.6% raw baseline - are LongMemEval scores in the sense the published leaderboard means. They are recall_any@5 retrieval numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error. 3. The 100% itself is teaching to the test. The hybrid v4 mode that produces the 100% was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. BENCHMARKS.md, line 461, verbatim: This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns. 4. Marketed features that don't exist in the code. The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. mempalace/knowledge_graph.py contains zero occurrences of "contradict". The only deduplication logic is an exact-match check on (subject, predicate, object) triples that blocks identical triples from being added twice. Conflicting facts about the same subject can accumulate indefinitely. 5. "30x lossless compression" is measurably lossy in the project's own benchmarks. The compression module mempalace/dialect.py truncates sentences at 55 characters, filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. Why this matters for the benchmark conversation. The field needs benchmarks where judge reliability is adversarially validated, an
View original[R] Doc-to-LoRA: Learning to Instantly Internalize Contexts from Sakana AI
This is cool paper! Creating loras from docs on the fly using a hypernetwork. "Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior." https://arxiv.org/abs/2602.15902 submitted by /u/Happysedits [link] [comments]
View originalRepository Audit Available
Deep analysis of deepset-ai/haystack — architecture, costs, security, dependencies & more
Haystack uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Private, secure engineering support, Best practices templates deployment guides, Access to flexible services, Flexible pricing based on company size, Visual, code-aligned pipeline design, Data, retrieval, and testing workflows, Secure access controls and auditability, Scalable cloud or on-prem deployment.
Haystack is commonly used for: Operate at Enterprise Scale, Visual, code-aligned pipeline design, Data, retrieval, and testing workflows, Secure access controls and auditability, Scalable cloud or on-prem deployment.
Haystack has a public GitHub repository with 24,667 stars.