Stop Asking "Which LLM Should I Use." Start Asking "Which One For This Job."
The right question is not which model to standardize on. It is which model for which job — and the people who figure that out early get a quiet but compounding advantage.
The question I get most often from senior operators is some version of which model should we standardize on. The honest answer is none of them. The right question is which model for which job, and the people who figure that out early get a quiet but compounding advantage.
Naval Ravikant talked about this on his recent code podcast at nav.al/code and his framing is the cleanest I have heard. He uses four LLMs every day and treats each one like a different colleague. Claude for building. Codex for finding bugs. Gemini for searching. Grok for the truth on news.
Let me unpack that because it maps almost perfectly to what I see in the field.
Claude Opus 4.6 is the model you reach for when you want something built well. Long form thinking, code that holds together across files, careful reasoning under pressure. It is slower and more expensive. You use it when the cost of being wrong is high.
GPT-5.3 Codex is the surgical instrument. Drop a stack trace, paste a function that is misbehaving, and it finds the issue faster than the model that wrote the original code. Naval calls it his bug finder and that is exactly the right framing. Specialist, not generalist.
Gemini 3.1 Pro is the searcher. Massive context window, native multimodal, deep integration with Google search under the hood. When you need to chew through a hundred pages of filings or pull together what changed across a domain in the last quarter, Gemini is the one. It is also where I send junior analysts who used to spend a day on a literature review.
Grok is the news model. Real time access, less filtered, comfortable with controversy. Whether you love or hate the politics around it does not matter. For news and current events where you need raw signal before someone has packaged a take, it is useful.
Then below the flagship tier you have the workhorses. GPT-5.2 and Sonnet 4.6 for general purpose work. Haiku 4.5, Gemini Flash, and GPT-5.4 Mini for high volume, latency sensitive tasks. These are your batch processors and your customer facing endpoints.
Three rules to internalize.
One. Stop standardizing on one model. The cost of running four is trivial compared to the productivity loss of using the wrong tool.
Two. Match the model to the job, not to the contract. Procurement loves single vendor deals. Your team will pay for it in subtle ways for years.
Three. Re evaluate every quarter. The model that was best in January is rarely the best in October. This is a fast moving stack.
The operators who get this stop arguing about which LLM is best. They build a small mental library of which model owns which job, and they switch without thinking.
Sources: Naval Ravikant on nav.al/code podcast.