How is the strongest LLM determined?

How is the strongest LLM determined?

Large Language Models (LLMs) are evolving rapidly, with various companies releasing high-performance models.

So, how is the “strongest LLM” determined?

In this article, we summarize:

  • Major companies developing LLMs
  • LLM rankings (based on llm arena)
  • How the rankings are determined
  • Example of a rating battle

Major LLM Developers

CompanyRepresentative ModelsOverview
OpenAIGPT-4, GPT-4 Turbo, GPT-4oThe world’s leading company, operating ChatGPT and leading the field with general-purpose models.
AnthropicClaude 3 Opus, Sonnet, HaikuA rising star known for safety and strong long-text understanding.
Google DeepMindGemini 1.5, Gemini 2.5Focused on multimodal capabilities; strong in both search and reasoning.
MetaLLaMA 2, LLaMA 3A major player promoting open-source LLMs.
xAI (Elon Musk)Grok 1, 2, 3Developed for X (formerly Twitter); known for creative responses.
DeepSeekDeepSeek-VL, DeepSeek-MoE, DeepSeek-CoderHigh-performing Chinese models, especially strong in code generation and MoE technology.
Mistral AIMistral 7B, Mixtral 8x7BProvides lightweight, high-performance open-source models.
CohereCommand R, Command R+Commercial models optimized for RAG (retrieval-augmented generation).
Perplexity AIPPLX-70B-OnlineDialogue LLMs leveraging real-time search.
AlibabaQwen 1.5, Qwen 2.5-MaxOne of the largest players in China, also strong in English language support.

LLM Rankings (Based on Chatbot Arena)

(*As of April 2025)

(Reference: Chatbot Arena https://lmarena.ai/)

Model NameCompanyOverview
Gemini-2.5-Pro-Exp-03-25Google DeepMindLatest Gemini experimental version, strong in long-text and reasoning.
GPT-4oOpenAIMultimodal capable with improved response speed.
Grok-3-PreviewxAICreative and natural dialogue model developed for X.
GPT-4.5-PreviewOpenAIAn improved GPT-4, experimental before official release.
Gemini-2.5-Flash-PreviewGoogle DeepMindGemini variant optimized for speed.
Claude 3 OpusAnthropicTop model of the Claude series, particularly strong in long-text understanding.
Claude 3 SonnetAnthropicLighter than Opus, balancing speed and quality.
GPT-4-Turbo-2024-04-09OpenAICost-performance optimized version of GPT-4.
Gemini 1.5 ProGoogle DeepMindStable in long-text processing and reasoning.
Claude 3 HaikuAnthropicLightest model in the Claude series, excels in fast responses.

How the Rankings are Determined

Overview: Human Voting and Elo Rating

At Chatbot Arena, two models respond to the same prompt. Human users compare and vote for the better response. Rankings are based on Elo rating scores calculated from these votes.

Blind Matches

  • Model names are hidden during comparison.
  • Prevents bias and ensures fair evaluations.

Elo Rating System

The expected win rate is calculated:

 E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}

Rating updates based on match results:

 R'_A = R_A + K(S_A - E_A)

where:

  • R_A: Current rating
  • R'_A: Updated rating
  • S_A: Match outcome (win = 1, lose = 0)
  • K: Constant determining the update magnitude

Winning more than expected gives a larger rating increase, while winning as expected results in a smaller one.

Variety in Prompts

  • Prompts are randomly assigned.
  • Topics include general knowledge, calculations, creative writing, code generation, etc.
  • Judged on correctness, clarity, creativity, and more.

Example of a Rating Battle

Prompt

“Propose a new urban transportation system.”

Model Responses

ModelResponse
Model AProposed a system combining drones and subways to alleviate ground congestion and utilize underground and airspace.
Model BProposed an autonomous bus-centered network using existing infrastructure and aiming for flexible routing and cost savings.

How the Winner is Decided

  • Users consider novelty, feasibility, and clarity to vote for the better response.
  • The winning model’s Elo rating is increased, and the losing model’s rating is decreased.

Other Sample Prompts

  • “Write a short story about a future where time travel is possible.”
  • “Explain black holes in a way that a primary school student can understand.”
  • “Write a Python program for FizzBuzz.”
  • “Propose a new marketing strategy for a company.”
  • “Create a simple recipe for beginners in cooking.”

Conclusion

In Chatbot Arena, rankings are not determined just by “correctness,” but by whether the response feels natural and compelling to humans.

With the continual entry of new models, the rankings will keep evolving. We hope this article helps you navigate the dynamic world of LLMs with ease!


Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *

日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)