Gemini 3.1 Pro (Fully Tested): This MODEL Is ACTUALLY BAD & A MESS.

•February 20, 2026

0

AICodeKing

AICodeKing•Feb 20, 2026

Why It Matters

The model’s higher cost and poorer real‑world performance undermine its appeal, forcing enterprises to favor rival LLMs for production coding and agentic workflows.

Key Takeaways

•Gemini 3.1 Pro regresses on oneshot benchmark versus 3 Pro
•Cost doubles while performance drops, making it less economical
•Agentic tasks suffer severe regression, ranking 19th out of 46
•Planning mode loops endlessly, wasting tokens and time
•Competing models like Sonnet 4.6 and Claude Opus outperform significantly

Summary

Google unveiled Gemini 3.1 Pro, a 0.1‑step upgrade touted as a major reasoning leap, featuring a 1 million‑token context window and a 65,000‑token output limit. The company claims a jump to 77.1 % on the ARC‑AGI2 benchmark, positioning the model as a flagship offering.

Independent testing on the creator’s Kingbench suite tells a different story. On the oneshot benchmark Gemini 3.1 Pro scored 96 % (212/220), down from the predecessor’s perfect 100 % while costing $1.73 versus $0.85 to run the same test. More strikingly, its agentic performance fell to 49.2 %—ranked 19th of 46—after a previous 71.4 % score placed it in the top ten.

The regression stems from a bloated planning phase. In a simple terminal‑calculator task the model spent 37 seconds looping through repetitive “thinking” sections, and on an image‑cropper task it lingered in planning for 114 seconds before emitting any code. It also failed to use Kilo CLI’s ask‑question tool, duplicated code, and introduced package‑name typos that caused npm 404 errors, whereas competitors like Sonnet 4.6 or Claude Opus 4.6 move straight to implementation.

For developers paying per token, Gemini 3.1 Pro offers no clear advantage over cheaper or more capable alternatives. Its only viable niche is the free‑tier access where a 96 % oneshot score is attractive; otherwise, models such as Sonnet 4.6, Opus 4.6, or GLM‑5 deliver higher accuracy and far more efficient agentic behavior, casting doubt on Google’s ability to compete on performance‑price grounds.

Original Description

In this video, I'll be telling you about Google's new Gemini 3.1 Pro and whether it's actually worth paying for. After running it through my personal KingBench benchmarks on both one shot and agentic tasks, the results are honestly pretty disappointing compared to its predecessor.

--

Key Takeaways:

📉 Gemini 3.1 Pro regressed on one shot tasks, dropping from 100% to 96% while costing more than double compared to Gemini 3 Pro.

🤖 On agentic benchmarks, Gemini 3.1 Pro scored just 49.2, falling from rank 7 to rank 19 compared to Gemini 3 Pro Preview.

⏳ The model has a serious over-planning problem, spending up to 114 seconds planning before writing a single line of code.

🛠️ It frequently misuses agentic tools, embeds questions into planning responses instead of using the proper ask tool, and makes basic coding mistakes.

💸 At $2 per million input tokens and $12 per million output tokens, there are far better alternatives like Sonnet 4.6 and GLM 5.

🆓 If you're on the free tier through Gemini CLI or Antigravity, it's a great option since 96% for free is hard to beat.

🏆 Sonnet 4.6 leads the agentic leaderboard at 87.9, while Gemini 3.1 Pro lags far behind at 49.2.

0

Comments

Want to join the conversation?

Loading comments...