How is the strongest LLM determined?

Large Language Models (LLMs) are evolving rapidly, with various companies releasing high-performance models.

So, how is the “strongest LLM” determined?

In this article, we summarize:

Major companies developing LLMs
LLM rankings (based on llm arena)
How the rankings are determined
Example of a rating battle

Major LLM Developers

Company	Representative Models	Overview
OpenAI	GPT-4, GPT-4 Turbo, GPT-4o	The world’s leading company, operating ChatGPT and leading the field with general-purpose models.
Anthropic	Claude 3 Opus, Sonnet, Haiku	A rising star known for safety and strong long-text understanding.
Google DeepMind	Gemini 1.5, Gemini 2.5	Focused on multimodal capabilities; strong in both search and reasoning.
Meta	LLaMA 2, LLaMA 3	A major player promoting open-source LLMs.
xAI (Elon Musk)	Grok 1, 2, 3	Developed for X (formerly Twitter); known for creative responses.
DeepSeek	DeepSeek-VL, DeepSeek-MoE, DeepSeek-Coder	High-performing Chinese models, especially strong in code generation and MoE technology.
Mistral AI	Mistral 7B, Mixtral 8x7B	Provides lightweight, high-performance open-source models.
Cohere	Command R, Command R+	Commercial models optimized for RAG (retrieval-augmented generation).
Perplexity AI	PPLX-70B-Online	Dialogue LLMs leveraging real-time search.
Alibaba	Qwen 1.5, Qwen 2.5-Max	One of the largest players in China, also strong in English language support.

LLM Rankings (Based on Chatbot Arena)

(*As of April 2025)

(Reference: Chatbot Arena https://lmarena.ai/)

Model Name	Company	Overview
Gemini-2.5-Pro-Exp-03-25	Google DeepMind	Latest Gemini experimental version, strong in long-text and reasoning.
GPT-4o	OpenAI	Multimodal capable with improved response speed.
Grok-3-Preview	xAI	Creative and natural dialogue model developed for X.
GPT-4.5-Preview	OpenAI	An improved GPT-4, experimental before official release.
Gemini-2.5-Flash-Preview	Google DeepMind	Gemini variant optimized for speed.
Claude 3 Opus	Anthropic	Top model of the Claude series, particularly strong in long-text understanding.
Claude 3 Sonnet	Anthropic	Lighter than Opus, balancing speed and quality.
GPT-4-Turbo-2024-04-09	OpenAI	Cost-performance optimized version of GPT-4.
Gemini 1.5 Pro	Google DeepMind	Stable in long-text processing and reasoning.
Claude 3 Haiku	Anthropic	Lightest model in the Claude series, excels in fast responses.

How the Rankings are Determined

Overview: Human Voting and Elo Rating

At Chatbot Arena, two models respond to the same prompt. Human users compare and vote for the better response. Rankings are based on Elo rating scores calculated from these votes.

Blind Matches

Model names are hidden during comparison.
Prevents bias and ensures fair evaluations.

Elo Rating System

The expected win rate is calculated:

$E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}$

Rating updates based on match results:

$R'_A = R_A + K(S_A - E_A)$

where:

$R_A$ : Current rating
$R'_A$ : Updated rating
$S_A$ : Match outcome (win = 1, lose = 0)
$K$ : Constant determining the update magnitude

Winning more than expected gives a larger rating increase, while winning as expected results in a smaller one.

Variety in Prompts

Prompts are randomly assigned.
Topics include general knowledge, calculations, creative writing, code generation, etc.
Judged on correctness, clarity, creativity, and more.

Example of a Rating Battle

Prompt

“Propose a new urban transportation system.”

Model Responses

Model	Response
Model A	Proposed a system combining drones and subways to alleviate ground congestion and utilize underground and airspace.
Model B	Proposed an autonomous bus-centered network using existing infrastructure and aiming for flexible routing and cost savings.

How the Winner is Decided

Users consider novelty, feasibility, and clarity to vote for the better response.
The winning model’s Elo rating is increased, and the losing model’s rating is decreased.

Other Sample Prompts

“Write a short story about a future where time travel is possible.”
“Explain black holes in a way that a primary school student can understand.”
“Write a Python program for FizzBuzz.”
“Propose a new marketing strategy for a company.”
“Create a simple recipe for beginners in cooking.”

Conclusion

In Chatbot Arena, rankings are not determined just by “correctness,” but by whether the response feels natural and compelling to humans.

With the continual entry of new models, the rankings will keep evolving. We hope this article helps you navigate the dynamic world of LLMs with ease!