Mastra

frameworksubscription + contract + per-seat + tieredFree tier

Read story

Mastra has been highlighted in YouTube videos, suggesting it has an interest or following due to its AI capabilities. However, detailed sentiment analysis or critique is absent in these brief social mentions. While specific strengths and pricing sentiment aren't clear from the social mentions alone, there is no immediate indication of dissatisfaction or major complaints tied directly to Mastra. Overall, it seems to maintain a neutral to positive reputation, but requires deeper user reviews for comprehensive insights.

Website

Mentions (30d)

Reviews

Platforms

GitHub Stars

23,274

1,947 forks

15 integrations10 featuresSeries A

Voices Discussing Mastra

Together AI

Company at Together AI

1 mention

Latest Videos

Opus 4.6 Got Nerfed?! Plus Ramp AI Coworker & Minimax 2.7

Apr 13, 2026

The terminal is not the final interface for AI. I'll put money on it.

Apr 12, 2026

Share:Twitter LinkedIn

Product Screenshots

AI Summary

Features & Use Cases

Features

Browser AgentGoogle Sheet AnalysisChat with DatabasePrinciples of Building AI AgentsMCP Course Mastra 101What is Mastra?Why use Mastra instead of a Python AI framework?Is Mastra an agent builder?What can you build with Mastra?What AI models and providers does Mastra support?

Use Cases

Is Mastra an agent builder?What AI models and providers does Mastra support?

Company Intel

Industry

information technology & services

Employees

Funding Stage

Series A

Total Funding

$22.1M

Social Reach

489

GitHub followers

Developer Ecosystem

102

GitHub repos

23,274

GitHub stars

npm packages

HuggingFace models

Mentions by Platform

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

Pricing

subscription + contract + per-seat + tieredFree tier available

Pricing found: $0 / month, $10/100k, $0.35/hr, $250 / month, $8/100k

Platform Distribution

Sentiment Overview

Positive17% (1)

Neutral83% (5)

Negative0% (0)

Recent Mentions

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

youtube

Mastra AI

View original

reddit@[unknown]3/27/2026

[D] We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally wrong answers

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo (Maharana et al., ACL 2024) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," but the source conversation contains only "this beauty" and the image caption reads "a red sports car." The car model exists only in an internal query field (annotator search strings for stock photos) that no memory system ingests. Systems are evaluated against facts they have no access to. "Last Saturday" on a Thursday should resolve to the preceding Saturday. The answer key says Sunday. A system that performs the date arithmetic correctly is penalized. 24 questions attribute statements to the wrong speaker. A system with accurate speaker tracking will contradict the answer key. The theoretical maximum score for a perfect system is approximately 93.6%. We also tested the LLM judge. LoCoMo uses gpt-4o-mini to score answers against the golden reference. We generated intentionally wrong but topically adjacent answers for all 1,540 questions and scored them using the same judge configuration and prompts used in published evaluations. The judge accepted 62.81% of them. Specific factual errors (wrong name, wrong date) were caught approximately 89% of the time. However, vague answers that identified the correct topic while missing every specific detail passed nearly two-thirds of the time. This is precisely the failure mode of weak retrieval, locating the right conversation but extracting nothing specific, and the benchmark rewards it. There is also no standardized evaluation pipeline. Each system uses its own ingestion method (arguably necessary given architectural differences), its own answer generation prompt, and sometimes entirely different models. Scores are then compared in tables as if they share a common methodology. Multiple independent researchers have documented inability to reproduce published results (EverMemOS #73, Mem0 #3944, Zep scoring discrepancy). Full audit with all 99 errors documented, methodology, and reproducible scripts: locomo-audit LongMemEval LongMemEval-S (Wang et al., 2024) is the other frequently cited benchmark. The issue is different but equally fundamental: it does not effectively isolate memory capability from context window capacity. LongMemEval-S uses approximately 115K tokens of context per question. Current models support 200K to 1M token context windows. The entire test corpus fits in a single context window for most current models. Mastra's research illustrates this: their full-context baseline scored 60.20% with gpt-4o (128K context window, near the 115K threshold). Their observational memory system scored 84.23% with the same model, largely by compressing context to fit more comfortably. The benchmark is measuring context window management efficiency rather than long-term memory retrieval. As context windows continue to grow, the full-context baseline will keep climbing and the benchmark will lose its ability to discriminate. LongMemEval-S tests whether a model can locate information within 115K tokens. That is a useful capability to measure, but it is a context window test, not a memory test. LoCoMo-Plus LoCoMo-Plus (Li et al., 2025) introduces a genuinely interesting new category: "cognitive" questions testing implicit inference rather than factual recall. These use cue-trigger pairs with deliberate semantic disconnect, the system must connect "I just adopted a rescue dog" (cue) to "what kind of pet food should I buy?" (trigger) across sessions without lexical overlap. The concept is sound and addresses a real gap in existing evaluation. The issues: It inherits all 1,540 original LoCoMo questions unchanged, including the 99 score-corrupting errors documented above. The improved judging methodology (task-specific prompts, three-tier scoring, 0.80+ human-LLM agreement) was only validated on the new cognitive questions. The original five categories retain the same broken ground truth with no revalidation. The judge model defaults to gpt-4o-mini. Same lack of pipeline standardization. The new cognitive category is a meaningful contribution. The inherited evaluation infrastructure retains the problems described above. Requirements for meaningful long-term memory evaluation Based on this analysis, we see several requirements for benchmarks that can meaningfully

View original

Integrations

Google Sheets APISlackMicrosoft TeamsZapierTrelloJiraGitHubAWSAzureFirebaseNotionSalesforceStripeTwilioDiscord