When Liquid AI, a startup founded by MIT computer scientists back in 2023, introduced its Liquid Foundation Models series 2 (LFM2) in July 2025, the pitch was straightforward: deliver the fastest on-device foundation models on the market using the new “liquid” architecture, with training and inference efficiency that made small models a serious alternative to cloud-only large language models (LLMs) such as OpenAI’s GPT series and Google’s Gemini.
The initial release shipped dense checkpoints at 350M, 700M, and 1.2B parameters, a hybrid architecture heavily weighted toward gated short convolutions, and benchmark numbers that placed LFM2 ahead of similarly sized competitors like Qwen3, Llama 3.2, and Gemma 3 on both quality and CPU throughput. The message to enterprises was clear: real-time, privacy-preserving AI on phones, laptops, and vehicles no longer required sacrificing capability for latency.
In the months since that launch, Liquid has expanded LFM2 into a broader product line — adding task-and-domain-specialized variants, a small video ingestion and analysis model, and an edge-focused deployment stack called LEAP — and positioned the models as the control layer for on-device and on-prem agentic systems.
Now, with the publication of the detailed, 51-page LFM2 technical report on arXiv, the company is going a step further: making public the architecture search process, training data mixture, distillation objective, curriculum strategy, and post-training pipeline behind those models.
And unlike earlier open models, LFM2 is built around a repeatable recipe: a hardware-in-the-loop search process, a training curriculum that compensates for smaller parameter budgets, and a post-training pipeline tuned for instruction following and tool use.
Rather than just offering weights and an API, Liquid is effectively publishing a detailed blueprint that other organizations can use as a reference for training their own small, efficient models from scratch, tuned to their own hardware and deployment constraints.
A model family designed around real constraints, not GPU labs
The technical report begins with a premise enterprises are intimately familiar with: real AI systems hit limits long before benchmarks do. Latency budgets, peak memory ceilings, and thermal throttling define what can actually run in production—especially on laptops, tablets, commodity servers, and mobile devices.
To address this, Liquid AI performed architecture search directly on target hardware, including Snapdragon mobile SoCs and Ryzen laptop CPUs. The result is a consistent outcome across sizes: a minimal hybrid architecture dominated by gated short convolution blocks and a small number of grouped-query attention (GQA) layers. This design was repeatedly selected over more exotic linear-attention and SSM hybrids because it delivered a better quality-latency-memory Pareto profile under real device conditions.
This matters for enterprise teams in three ways:
-
Predictability. The architecture is simple, parameter-efficient, and stable across model sizes from 350M to 2.6B.
-
Operational portability. Dense and MoE variants share the same structural backbone, simplifying deployment across mixed hardware fleets.
-
On-device feasibility. Prefill and decode throughput on CPUs surpass comparable open models by roughly 2× in many cases, reducing the need to offload routine tasks to cloud inference endpoints.
Instead of optimizing for academic novelty, the report reads as a systematic attempt to design models enterprises can actually ship.
This is notable and more practical for enterprises in a field where many open models quietly assume access to multi-H100 clusters during inference.
A training pipeline tuned for enterprise-relevant behavior
LFM2 adopts a training approach that compensates for the smaller scale of its models with structure rather than brute force. Key elements include:
-
10–12T token pre-training and an additional 32K-context mid-training phase, which extends the model’s useful context window without exploding compute costs.
-
A decoupled Top-K knowledge distillation objective that sidesteps the instability of standard KL distillation when teachers provide only partial logits.
-
A three-stage post-training sequence—SFT, length-normalized preference alignment, and model merging—designed to produce more reliable instruction following and tool-use behavior.
For enterprise AI developers, the significance is that LFM2 models behave less like “tiny LLMs” and more like practical agents able to follow structured formats, adhere to JSON schemas, and manage multi-turn chat flows. Many open models at similar sizes fail not due to lack of reasoning ability, but due to brittle adherence to instruction templates. The LFM2 post-training recipe directly targets these rough edges.
In other words: Liquid AI optimized small models for operational reliability, not just scoreboards.
Multimodality designed for device constraints, not lab demos
The LFM2-VL and LFM2-Audio variants reflect another shift: multimodality built around token efficiency.
Rather than embedding a massive vision transformer directly into an LLM, LFM2-VL attaches a SigLIP2 encoder through a connector that aggressively reduces visual token count via PixelUnshuffle. High-resolution inputs automatically trigger dynamic tiling, keeping token budgets controllable even on mobile hardware. LFM2-Audio uses a bifurcated audio path—one for embeddings, one for generation—supporting real-time transcription or speech-to-speech on modest CPUs.
For enterprise platform architects, this design points toward a practical future where:
-
document understanding happens directly on endpoints such as field devices;
-
audio transcription and speech agents run locally for privacy compliance;
-
multimodal agents operate within fixed latency envelopes without streaming data off-device.
The through-line is the same: multimodal capability without requiring a GPU farm.
Retrieval models built for agent systems, not legacy search
LFM2-ColBERT extends late-interaction retrieval into a footprint small enough for enterprise deployments that need multilingual RAG without the overhead of specialized vector DB accelerators.
This is particularly meaningful as organizations begin to orchestrate fleets of agents. Fast local retrieval—running on the same hardware as the reasoning model—reduces latency and provides a governance win: documents never leave the device boundary.
Taken together, the VL, Audio, and ColBERT variants show LFM2 as a modular system, not a single model drop.
The emerging blueprint for hybrid enterprise AI architectures
Across all variants, the LFM2 report implicitly sketches what tomorrow’s enterprise AI stack will look like: hybrid local-cloud orchestration, where small, fast models operating on devices handle time-critical perception, formatting, tool invocation, and judgment tasks, while larger models in the cloud offer heavyweight reasoning when needed.
Several trends converge here:
-
Cost control. Running routine inference locally avoids unpredictable cloud billing.
-
Latency determinism. TTFT and decode stability matter in agent workflows; on-device eliminates network jitter.
-
Governance and compliance. Local execution simplifies PII handling, data residency, and auditability.
-
Resilience. Agentic systems degrade gracefully if the cloud path becomes unavailable.
Enterprises adopting these architectures will likely treat small on-device models as the “control plane” of agentic workflows, with large cloud models serving as on-demand accelerators.
LFM2 is one of the clearest open-source foundations for that control layer to date.
The strategic takeaway: on-device AI is now a design choice, not a compromise
For years, organizations building AI features have accepted that “real AI” requires cloud inference. LFM2 challenges that assumption. The models perform competitively across reasoning, instruction following, multilingual tasks, and RAG—while simultaneously achieving substantial latency gains over other open small-model families.
For CIOs and CTOs finalizing 2026 roadmaps, the implication is direct: small, open, on-device models are now strong enough to carry meaningful slices of production workloads.
LFM2 will not replace frontier cloud models for frontier-scale reasoning. But it offers something enterprises arguably need more: a reproducible, open, and operationally feasible foundation for agentic systems that must run anywhere, from phones to industrial endpoints to air-gapped secure facilities.
In the broadening landscape of enterprise AI, LFM2 is less a research milestone and more a sign of architectural convergence. The future is not cloud or edge—it’s both, operating in concert. And releases like LFM2 provide the building blocks for organizations prepared to build that hybrid future intentionally rather than accidentally.
