Apple has recently shared benchmark details for its two in‑house AI models. The numbers look promising but also confirm that the tech giant still has a way to go in matching the performance of industry leaders. Notably, the smaller model – packed with 3 billion parameters – is now available for third‑party developers.
The approach involves two distinct models: one tailored for on‑device use, and a larger server‑based version designed for demanding tasks. Benchmarks suggest that the on‑device model edges out competitors like Qwen‑2.5‑3B and nearly rivals the capabilities of Qwen‑3‑4B and Gemma‑3‑4B. Still, some experts remain cautious; a small bump in size doesn’t necessarily translate to a leap in overall performance.
In tests steered by human evaluators, Apple’s models trail behind OpenAI’s GPT‑4o. This has led many to speculate on the importance of the company’s collaboration with ChatGPT for future enhancements. Meanwhile, the larger server model draws comparisons to Meta’s Llama‑4‑Scout. Although Apple has kept its parameter details under wraps, hints suggest it might be close to Meta’s setup of 109 billion total parameters, including 17 billion active ones.
What stands out is Apple’s use of a parallel track mixture‑of‑experts architecture. This design enables multiple compact models to operate side‑by‑side, with independent token processing every four layers – a move which cuts communication overhead by 87.5%. While the on‑device model uses aggressive compression techniques to run efficiently on iPhones and iPads, the server model benefits from a unique graphics compression method.
In the realm of image recognition, Apple’s device model holds its own against systems like InternVL‑2.5‑4B, Qwen‑2.5‑VL‑3B‑Instruct, and Gemma‑3‑4B. Apple claims it outperforms both InternVL and Qwen, though it only matches Gemma’s level of competence. For the server model, performance reviews are mixed – it beats Qwen‑2.5‑VL‑32B in fewer than half the tests, yet it still lags behind Llama‑4‑Scout and GPT‑4o, as judged by human assessments.
Under the hood, the smaller model uses a 300‑million parameter AI, contrasting with a 1‑billion parameter system powering the server side. Both models have benefited from enormous training datasets – more than 10 billion image‑text pairs and 175 million documents containing embedded images. Developers can tap into the 3‑billion parameter device model via Apple’s Foundation Models Framework, which works well for tasks like summarisation, information extraction, and improving text comprehension. Just note that it isn’t designed for open‑ended chatbot experiences.
For its part, the server model remains proprietary, fueling Apple Intelligence features across the ecosystem. The accompanying framework bolsters its offering with built‑in AI tools that mesh seamlessly with Apple’s Swift programming language. This lets developers tag data structures so that the system can automatically produce relevant outputs. And to better serve a global audience, Apple has expanded the model’s vocabulary from 100,000 to 150,000 words, backing this up with culturally sensitive tests across 15 languages.
Training data is sourced from what Apple describes as “hundreds of billions of pages” via its web crawler, Applebot, which honours robots.txt exclusions and steers clear of using personal data. Despite these safeguards, debates continue on whether using unopted‑in data effectively implies consent for AI training. Overall, these benchmarks reinforce expectations set at this year’s WWDC – Apple’s AI models have made headway but still face stiff competition from the likes of Google and OpenAI.