The model’s higher cost and poorer real‑world performance undermine its appeal, forcing enterprises to favor rival LLMs for production coding and agentic workflows.
Google unveiled Gemini 3.1 Pro, a 0.1‑step upgrade touted as a major reasoning leap, featuring a 1 million‑token context window and a 65,000‑token output limit. The company claims a jump to 77.1 % on the ARC‑AGI2 benchmark, positioning the model as a flagship offering.
Independent testing on the creator’s Kingbench suite tells a different story. On the oneshot benchmark Gemini 3.1 Pro scored 96 % (212/220), down from the predecessor’s perfect 100 % while costing $1.73 versus $0.85 to run the same test. More strikingly, its agentic performance fell to 49.2 %—ranked 19th of 46—after a previous 71.4 % score placed it in the top ten.
The regression stems from a bloated planning phase. In a simple terminal‑calculator task the model spent 37 seconds looping through repetitive “thinking” sections, and on an image‑cropper task it lingered in planning for 114 seconds before emitting any code. It also failed to use Kilo CLI’s ask‑question tool, duplicated code, and introduced package‑name typos that caused npm 404 errors, whereas competitors like Sonnet 4.6 or Claude Opus 4.6 move straight to implementation.
For developers paying per token, Gemini 3.1 Pro offers no clear advantage over cheaper or more capable alternatives. Its only viable niche is the free‑tier access where a 96 % oneshot score is attractive; otherwise, models such as Sonnet 4.6, Opus 4.6, or GLM‑5 deliver higher accuracy and far more efficient agentic behavior, casting doubt on Google’s ability to compete on performance‑price grounds.
Comments
Want to join the conversation?
Loading comments...