In internal tests across reasoning, coding, and writing, Avocado lags behind the top models from Google, OpenAI, and Anthropic. The model performs better than Meta's previous model, Llama 4, which failed to meet expectations last year. It does perform better than Google's Gemini 2.5 from March 2025. However, Gemini 3.0 from November is still a step too far.
To answer that question, our resident RPG-enthusiast Ram Iyer put together a set of five general questions about Baldur's Gate, which we ran against xAI and the three major models in a kind of quasi-benchmark that I've decided to call BaldurBench. In the interest of journalistic transparency, I've made all the chat transcripts public, so you can see them here: Grok, ChatGPT, Claude, and Gemini.