Moonshot’s Kimi K2.5 is ‘open,’ 595GB, and built for agent swarms

Two days after releasing what analysts call the most powerful open-source AI model ever created, researchers from China’s Moonshot AI logged onto Reddit to face a restless audience. The Beijing-based startup had reason to show up. Kimi K2.5 had just landed headlines about closing the gap with American AI giants and testing the limits of US. chip export controls. But the developers waiting on r/LocalLLaMA, a forum where engineers trade advice on running powerful language models on everything from a single consumer GPU to a small rack of prosumer hardware, had a different concern.

They wanted to know when they could actually use it.

The three-hour Ask Me Anything session became an unexpectedly candid window into frontier AI development in 2026 — not the polished version that appears in corporate blogs, but the messy reality of debugging failures, managing personality drift, and confronting a fundamental tension that defines open-source AI today.

Moonshot had published the model’s weights for anyone to download and customize. The file runs roughly 595 gigabytes. For most of the developers in the thread, that openness remained theoretical.

Three Moonshot team members participated under the usernames ComfortableAsk4494, zxytim, and ppwwyyxx. Over approximately 187 comments, they fielded questions about architecture, training methodology, and the philosophical puzzle of what gives an AI model its “soul.” They also offered a picture of where the next round of progress will come from — and it wasn’t simply “more parameters.”

Developers asked for smaller models they can actually run, and Moonshot acknowledged it has a problem

The very first wave of questions treated Kimi K2.5 less like a breakthrough and more like a logistics headache.

One user asked bluntly why Moonshot wasn’t creating smaller models alongside the flagship. “Small sizes like 8B, 32B, 70B are great spots for the intelligence density,” they wrote. Another said huge models had become difficult to celebrate because many developers simply couldn’t run them. A third pointed to American competitors as size targets, requesting coder-focused variants that could fit on modest GPUs.

Moonshot’s team didn’t announce a smaller model on the spot. But it acknowledged the demand in terms that suggested the complaint was familiar. “Requests well received!” one co-host wrote. Another noted that Moonshot’s model collection already includes some smaller mixture-of-experts models on Hugging Face, while cautioning that small and large models often require different engineering investments.

The most revealing answer came when a user asked whether Moonshot might build something around 100 billion parameters optimized for local use. The Kimi team responded by floating a different compromise: a 200 billion or 300 billion parameter model that could stay above what it called a “usability threshold” across many tasks.

That reply captured the bind open-weight labs face. A 200-to-300 billion parameter model would broaden access compared to a trillion-parameter system, but it still assumes multi-GPU setups or aggressive quantization. The developers in the thread weren’t asking for “somewhat smaller.” They were asking for models sized for the hardware they actually own — and for a roadmap that treats local deployment as a first-class constraint rather than a hobbyist afterthought.

The team said scaling laws are hitting diminishing returns, and pointed to a different kind of progress

As the thread moved past hardware complaints, it turned to what many researchers now consider the central question in large language models: have scaling laws begun to plateau?

One participant asked directly whether scaling had “hit a wall.” A Kimi representative replied with a diagnosis that has become increasingly common across the industry. “The amount of high-quality data does not grow as fast as the available compute,” they wrote, “so scaling under the conventional ‘next token prediction with Internet data’ will bring less improvement.”

Then the team offered its preferred escape route. It pointed to Agent Swarm, Kimi K2.5’s ability to coordinate up to 100 sub-agents working in parallel, as a form of “test-time scaling” that could open a new path to capability gains. In the team’s framing, scaling doesn’t have to mean only larger pretraining runs. It can also mean increasing the amount of structured work done at inference time, then folding those insights back into training through reinforcement learning.

“There might be new paradigms of scaling that can possibly happen,” one co-host wrote. “Looking forward, it’s likely to have a model that learns with less or even zero human priors.”

The claim implies that the unit of progress may be shifting from parameter count and pretraining loss curves toward systems that can plan, delegate, and verify — using tools and sub-agents as building blocks rather than relying on a single massive forward pass.

Agent Swarm works by keeping each sub-agent’s memory separate from the coordinator

On paper, Agent Swarm sounds like a familiar idea in a new wrapper: many AI agents collaborating on a task. The AMA surfaced the more important details — where the memory goes, how coordination happens, and why orchestration doesn’t collapse into noise.

A developer raised a classic multi-agent concern. At a scale of 100 sub-agents, an orchestrator agent often becomes a bottleneck, both in latency and in what the community calls “context rot” — the degradation in performance that occurs as a conversation history fills with internal chatter and tool traces until the model loses the thread.

A Kimi co-host answered with a design choice that matters for anyone building agent systems in enterprise settings. The sub-agents run with their own working memory and send back results to the orchestrator, rather than streaming everything into a shared context. “This allows us to scale the total context length in a new dimension!” they wrote.

Another developer pressed on performance claims. Moonshot has publicly described Agent Swarm as capable of achieving about 4.5 times speedup on suitable workflows, but skeptics asked whether that figure simply reflects how parallelizable a given task is. The team agreed: it depends. In some cases, the system decides that a task doesn’t require parallel agents and avoids spending the extra compute. It also described sub-agent token budgets as something the orchestrator must manage, assigning each sub-agent a task of appropriate size.

Read as engineering rather than marketing, Moonshot was describing a familiar enterprise pattern: keep the control plane clean, bound the outputs from worker processes, and avoid flooding a coordinator with logs it can’t digest.

Reinforcement learning compute will keep increasing, especially for training agents

The most consequential shift hinted at in the AMA wasn’t a new benchmark score. It was a statement about priorities.

One question asked whether Moonshot was moving compute from “System 1” pretraining to “System 2” reinforcement learning — shorthand for shifting from broad pattern learning toward training that explicitly rewards reasoning and correct behavior over multi-step tasks. A Kimi representative replied that RL compute will keep increasing, and suggested that new RL objective functions are likely, “especially in the agent space.”

That line reads like a roadmap. As models become more tool-using and task-decomposing, labs will spend more of their budget training models to behave well as agents — not merely to predict tokens.

For enterprises, this matters because RL-driven improvements often arrive with tradeoffs. A model can become more decisive, more tool-happy, or more aligned to reward signals that don’t map neatly onto a company’s expectations. The AMA didn’t claim Moonshot had solved those tensions. It did suggest the team sees reinforcement learning as the lever that will matter more in the next cycle than simply buying more GPUs.

When asked about the compute gap between Moonshot and American labs with vastly larger GPU fleets, the team was candid. “The gap is not closing I would say,” one co-host wrote. “But how much compute does one need to achieve AGI? We will see.”

Another offered a more philosophical framing: “There are too many factors affecting available compute. But no matter what, innovation loves constraints.”

The model sometimes calls itself Claude, and Moonshot explained why that happens

Open-weight releases now come with a standing suspicion: did the model learn too much from competitors? That suspicion can harden quickly into accusations of distillation, where one AI learns by training on another AI’s outputs.

A user raised one of the most uncomfortable claims circulating in open-model circles — that K2.5 sometimes identifies itself as “Claude,” Anthropic’s flagship model. The implication was heavy borrowing.

Moonshot didn’t dismiss the behavior. Instead it described the conditions under which it happens. With the right system prompt, the team said, the model has a high probability of answering “Kimi,” particularly in thinking mode. But with an empty system prompt, the model drifts into what the team called an “undefined area,” which reflects pretraining data distributions rather than deliberate training choices.

Then it offered a specific explanation tied to a training decision. Moonshot said it had upsampled newer internet coding data during pretraining, and that this data appears more associated with the token “Claude” — likely because developers discussing AI coding assistants frequently reference Anthropic’s model.

The team pushed back on the distillation accusation with benchmark results. “In fact, K2.5 seems to outperform Claude on many benchmarks,” one co-host wrote. “HLE, BrowseComp, MMMU Pro, MathVision, just to name a few.”

For enterprise adopters, the important point isn’t the internet drama. It’s that identity drift is a real failure mode — and one that organizations can often mitigate by controlling system prompts rather than leaving the model’s self-description to chance. The AMA treated prompt governance not as a user-experience flourish, but as operational hygiene.

Users said the model lost its personality, and Moonshot admitted that “soul” is hard to measure

A recurring theme in the thread was that K2.5’s writing style feels more generic than earlier Kimi models. Users described it as more like a standard “helpful assistant” — a tone many developers now see as the default personality of heavily post-trained models. One user said they loved the personality of Kimi K2 and asked what happened.

A Kimi co-host acknowledged that each new release brings some personality change and described personality as subjective and hard to evaluate. “This is a quite difficult problem,” they wrote. The team said it wants to improve the issue and make personality more customizable per user.

In a separate exchange about whether strengthening coding capability compromises creative writing and emotional intelligence, a Kimi representative argued there’s no inherent conflict if the model is large enough. But maintaining “writing taste” across versions is difficult, they said, because the reward model is constantly evolving. The team relies on internal benchmarks — a kind of meta-evaluation — to track creative writing progress and adjust reward models accordingly.

Another response went further, using language that would sound unusual in a corporate AI specification but familiar to people who use these tools daily. The team talked about the “soul” of a reward model and suggested the possibility of storing a user “state” reflecting taste and using it to condition the model’s outputs.

That exchange points to a product frontier that enterprises often underestimate. Style drift isn’t just aesthetics. It can change how a model explains decisions, how it hedges, how it handles ambiguity, and how it interacts with customers and employees. The AMA made clear that labs increasingly treat “taste” as both an alignment variable and a differentiator — but it remains hard to measure and even harder to hold constant across training runs.

Debugging emerged as the unglamorous truth behind frontier AI research

The most revealing cultural insight came in response to a question about surprises during training and reinforcement learning. A co-host answered with a single word, bolded for emphasis: debugging.

“Whether it’s pre-training or post-training, one thing constantly manifests itself as the utmost priority: debugging,” they wrote.

The comment illuminated a theme running through the entire session. When asked about their “scaling ladder” methodology for evaluating new ideas at different model sizes, zxytim offered an anecdote about failure. The team had once hurried to incorporate Kimi Linear, an experimental linear-attention architecture, into the previous model generation. It failed the scaling ladder at a certain scale. They stepped back and went through what the co-host called “a tough debugging process,” and after months finally made it work.

“Statistically, most ideas that work at small scale won’t pass the scaling ladder,” they continued. “Those that do are usually simple, effective, and mathematically grounded. Research is mostly about managing failure, not celebrating success.”

For technical leaders evaluating AI vendors, the admission is instructive. Frontier capability doesn’t emerge from elegant breakthroughs alone. It emerges from relentless fault isolation — and from organizational cultures willing to spend months on problems that might not work.

Moonshot hinted at what comes next, including linear attention and continual learning

The AMA also acted as a subtle teaser for Kimi’s next generation.

Developers asked whether Kimi K3 would adopt Moonshot’s linear attention research, which aims to handle long context more efficiently than traditional attention mechanisms. Team members suggested that linear approaches are a serious option. “It’s likely that Kimi Linear will be part of K3,” one wrote. “We will also include other optimizations.”

In another exchange, a co-host predicted K3 “will be much, if not 10x, better than K2.5.”

The team also highlighted continual learning as a direction it is actively exploring, suggesting a future where agents can work effectively over longer time horizons — a critical enterprise need if agents are to handle ongoing projects rather than single-turn tasks. “We believe that continual learning will improve agency and allow the agents to work effectively for much longer durations,” one co-host wrote.

On Agent Swarm specifically, the team said it plans to make the orchestration scaffold available to developers once the system becomes more stable. “Hopefully very soon,” they added.

What the AMA revealed about the state of open AI in 2026

The session didn’t resolve every question. Some of the most technical prompts — about multimodal training recipes, defenses against reward hacking, and data governance — were deferred to a forthcoming technical report. That’s not unusual. Many labs now treat the most operationally decisive details as sensitive.

But the thread still revealed where the real contests in AI have moved. The gap that matters most isn’t between China and the United States, or between open and closed. It’s the gap between what models promise and what systems can actually deliver.

Orchestration is becoming the product. Moonshot isn’t only shipping a model. It’s shipping a worldview that says the next gains come from agents that can split work, use tools, and return structured results fast. Open weights are colliding with hardware reality, as developers demand openness that runs locally rather than openness that requires a data center. And the battleground is shifting from raw intelligence to reliability — from beating a benchmark by two points to debugging tool-calling discipline, managing memory in multi-agent workflows, and preserving the hard-to-quantify “taste” that determines whether users trust the output.

Moonshot showed up on Reddit in the wake of a high-profile release and a growing geopolitical narrative. The developers waiting there cared about a more practical question: When does “open” actually mean “usable”?

In that sense, the AMA didn’t just market Kimi K2.5. It offered a snapshot of an industry in transition — from larger models to more structured computation, from closed APIs to open weights that still demand serious engineering to deploy, and from celebrating success to managing failure.

“Research is mostly about managing failure,” one of the Moonshot engineers had written. By the end of the thread, it was clear that deployment is, too.

Author

Syndication

About The Author

Syndication

See author's posts

Syndication

Related Stories

AI models that simulate internal debate dramatically improve accuracy on complex tasks

Infostealers added Clawdbot to their target lists before most security teams knew it was running

AI agents can talk to each other — they just can’t think together yet

You may have missed

AI models that simulate internal debate dramatically improve accuracy on complex tasks

Infostealers added Clawdbot to their target lists before most security teams knew it was running

AI agents can talk to each other — they just can’t think together yet

Moonshot’s Kimi K2.5 is ‘open,’ 595GB, and built for agent swarms — Reddit wants a smaller one