Microsoft unveils seven MAI models for specific tasks and introduces Frontier Tuning

At Build 2026, on 2 June, Microsoft AI announced a family of seven models developed entirely in-house — the label is MAI, Microsoft AI. The bet is a multimodal ecosystem of models specialized for precise tasks — reasoning, code writing, image generation, speech synthesis and transcription — rather than a single generalist to pit against GPT or Claude. It is the move with which Microsoft builds its own proprietary stack alongside the OpenAI models it continues to distribute, behind which it has placed the "Superintelligence" team led by Mustafa Suleyman, formed in November 2025.

For anyone working with applied AI, the interesting news concerns strategy more than the number of models: specialization in place of one do-everything model, and a mechanism to adapt these models to each company's data and workflows. Here is what shipped, with the numbers on the table.

Diagram of the Microsoft AI MAI family: a central hub and seven specialized models — Thinking-1, Code-1-Flash, Image-2.5 and its Flash variant, Transcribe-1.5, Voice-2 and its Flash variant.

From three to seven models: the timeline

The June launch follows a first step in April. On 2 April 2026 Microsoft AI had already published its first three foundation models — MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2 — on Microsoft Foundry and the MAI Playground. That was the starting signal.

At Build 2026 the family widened to seven, with new or updated versions across every front. The thread Microsoft states is sharp: every model is trained from scratch on its own data — described as "clean, traceable and enterprise-grade" — and by direct construction rather than distillation from other labs. It is a precise stance on data provenance, a theme growing ever more sensitive for European compliance.

The seven models, one by one

MAI-Thinking-1 — the flagship reasoning model. A sparse Mixture-of-Experts with 35 billion active parameters out of roughly 1,000 billion total: a compute footprint kept lean relative to frontier models. Microsoft declares it "on par with Claude Opus 4.6 on SWE-Bench Pro" in agentic coding, and reports 97.0% on AIME 2025 and 94.5% on AIME 2026 in mathematical reasoning. In a blind human evaluation run with partner Surge across 1,276 tasks, users preferred it to Claude Sonnet 4.6. A 256k-token context window, function calling, compatibility with the Chat Completions API. Today in private preview on Foundry.

MAI-Code-1-Flash — lightweight agentic coding. Five billion active parameters, designed for direct integration into GitHub Copilot and VS Code. Microsoft describes it as comparable to Claude Haiku and cheaper: the point here is cost per call in high-volume workflows, more than peak power.

MAI-Image-2.5 (with a Flash variant) — text-to-image and image editing. It supports both generation from a text prompt and editing of existing images. On the public Arena leaderboard it sits third in text-to-image and second in image editing, where — per Microsoft — it surpasses Google's Nano Banana Pro. The Flash variant is compressed for sub-second production.

MAI-Transcribe-1.5 — transcription. Accuracy stated as state-of-the-art on the FLEURS benchmark, five times faster than competing models, with domain-terminology support across 43 languages.

MAI-Voice-2 (with Flash on the way) — speech synthesis. Natural, expressive voice across 15 languages, able to adapt to a voice from a short audio sample and — Microsoft states — with safeguards against misuse.

The real novelty: Frontier Tuning

Here lies the strategically most relevant point, beyond the individual models. Microsoft introduced Microsoft Frontier Tuning: the customer company trains the model on its own real workflows inside reinforcement-learning environments described as private "gyms," accessible to that organization alone.

The figure Microsoft brings as evidence: a MAI model tuned for Excel matches GPT 5.4 with up to ten times the efficiency. On a reference organization's complex operational tasks, a "frontier-tuned" model reportedly achieved the highest win rate among the models tested, at roughly a tenth of the cost.

Translated: the company's institutional knowledge becomes part of the model and stays the company's property. It is the vertical-AI thesis — models adapted to a specific domain more than generalists — applied at industrial scale. For anyone building vertical solutions it reads above all as a confirmation of direction: value shifts toward the proprietary data and the workflow, more than the base model.

Healthcare, silicon, and vision

Three side notes help read the move as a whole:

Mayo Clinic. Microsoft will co-develop a frontier model for healthcare with Mayo Clinic, trained on de-identified clinical data. The model will stay Mayo's property and later become available to other organizations via Foundry. Deep verticalization in a highly sensitive domain.
Maia 200. Microsoft co-designs the models with its own Maia 200 chip, claiming a 1.4x gain in performance per watt. The same self-sufficiency logic it already applies to data centers.
Humanist Superintelligence. The stated frame: advanced systems meant to "serve people and organizations," remaining tools under human control. Take it for what it is — a brand positioning as well as a technical one.

The models reach beyond Microsoft's ecosystem: alongside Foundry and the proprietary products, they ship on OpenRouter, Fireworks and Baseten, and for the first time developers can get their hands on the model weights.

How to read the benchmarks (with due caution)

A necessary clarification, editorially above all. The figures above are stated by Microsoft: the competitors' scores come from their respective official cards, and the "on par or better" human evaluation was run with a Microsoft partner (Surge). They are plausible, documented data — Microsoft also publishes a technical and safety report — and they remain self-reported benchmarks all the same.

Independent leaderboards such as Arena (human preferences) and Artificial Analysis offer a useful external check, and on those Microsoft cites verifiable placements. The practical rule stays simple: lab benchmarks indicate the order of magnitude, more than ground truth. The final judgment, for anyone who has to put these models into production, always comes from the test on their own real use case.

What changes for Italian and European companies

The direction reaches beyond Microsoft. Applied AI is shifting from "one big model that does everything" to constellations of specialized models, smaller, cheaper and context-adapted. Models with 5 or 35 billion active parameters, optimized for cost and efficiency, embedded in the workflows where they are needed and personalized on the data of whoever uses them.

For the Italian and European market, two elements weigh more than the rest. The first is the explicit attention to the provenance and traceability of training data, which speaks directly to compliance needs. The second is the fine-tune ownership model — "your model, on your data, controlled by you" — which opens concrete room for anyone building vertical solutions in regulated sectors.

What to do now

Some practices hold regardless of the vendor chosen. Evaluate specialized models per task, weighing cost per call against the power the use case actually requires, in place of defaulting to the biggest model. Put the provenance and traceability of training data among the vendor selection criteria, alongside performance and price. Clarify from the negotiation onward the ownership of the fine-tune and of the artifacts derived from company data. Treat declared benchmarks as a hypothesis to verify with a test on your own use case, ahead of any production decision.

The next phase plays out on two fronts: the portability of these vertical models across vendors and environments, and companies' ability to turn their operational knowledge into a trainable, owned asset. Whoever takes hold first of the data and the workflow starts ahead when specialization becomes the norm.