Large Language Models (LLMs) are evolving rapidly, with various companies releasing high-performance models.
So, how is the “strongest LLM” determined?
In this article, we summarize:
- Major companies developing LLMs
- LLM rankings (based on llm arena)
- How the rankings are determined
- Example of a rating battle
Major LLM Developers
Company | Representative Models | Overview |
---|---|---|
OpenAI | GPT-4, GPT-4 Turbo, GPT-4o | The world’s leading company, operating ChatGPT and leading the field with general-purpose models. |
Anthropic | Claude 3 Opus, Sonnet, Haiku | A rising star known for safety and strong long-text understanding. |
Google DeepMind | Gemini 1.5, Gemini 2.5 | Focused on multimodal capabilities; strong in both search and reasoning. |
Meta | LLaMA 2, LLaMA 3 | A major player promoting open-source LLMs. |
xAI (Elon Musk) | Grok 1, 2, 3 | Developed for X (formerly Twitter); known for creative responses. |
DeepSeek | DeepSeek-VL, DeepSeek-MoE, DeepSeek-Coder | High-performing Chinese models, especially strong in code generation and MoE technology. |
Mistral AI | Mistral 7B, Mixtral 8x7B | Provides lightweight, high-performance open-source models. |
Cohere | Command R, Command R+ | Commercial models optimized for RAG (retrieval-augmented generation). |
Perplexity AI | PPLX-70B-Online | Dialogue LLMs leveraging real-time search. |
Alibaba | Qwen 1.5, Qwen 2.5-Max | One of the largest players in China, also strong in English language support. |
LLM Rankings (Based on Chatbot Arena)
(*As of April 2025)
(Reference: Chatbot Arena https://lmarena.ai/)
Model Name | Company | Overview |
Gemini-2.5-Pro-Exp-03-25 | Google DeepMind | Latest Gemini experimental version, strong in long-text and reasoning. |
GPT-4o | OpenAI | Multimodal capable with improved response speed. |
Grok-3-Preview | xAI | Creative and natural dialogue model developed for X. |
GPT-4.5-Preview | OpenAI | An improved GPT-4, experimental before official release. |
Gemini-2.5-Flash-Preview | Google DeepMind | Gemini variant optimized for speed. |
Claude 3 Opus | Anthropic | Top model of the Claude series, particularly strong in long-text understanding. |
Claude 3 Sonnet | Anthropic | Lighter than Opus, balancing speed and quality. |
GPT-4-Turbo-2024-04-09 | OpenAI | Cost-performance optimized version of GPT-4. |
Gemini 1.5 Pro | Google DeepMind | Stable in long-text processing and reasoning. |
Claude 3 Haiku | Anthropic | Lightest model in the Claude series, excels in fast responses. |
How the Rankings are Determined
Overview: Human Voting and Elo Rating
At Chatbot Arena, two models respond to the same prompt. Human users compare and vote for the better response. Rankings are based on Elo rating scores calculated from these votes.
Blind Matches
- Model names are hidden during comparison.
- Prevents bias and ensures fair evaluations.
Elo Rating System
The expected win rate is calculated:
Rating updates based on match results:
where:
: Current rating
: Updated rating
: Match outcome (win = 1, lose = 0)
: Constant determining the update magnitude
Winning more than expected gives a larger rating increase, while winning as expected results in a smaller one.
Variety in Prompts
- Prompts are randomly assigned.
- Topics include general knowledge, calculations, creative writing, code generation, etc.
- Judged on correctness, clarity, creativity, and more.
Example of a Rating Battle
Prompt
“Propose a new urban transportation system.”
Model Responses
Model | Response |
Model A | Proposed a system combining drones and subways to alleviate ground congestion and utilize underground and airspace. |
Model B | Proposed an autonomous bus-centered network using existing infrastructure and aiming for flexible routing and cost savings. |
How the Winner is Decided
- Users consider novelty, feasibility, and clarity to vote for the better response.
- The winning model’s Elo rating is increased, and the losing model’s rating is decreased.
Other Sample Prompts
- “Write a short story about a future where time travel is possible.”
- “Explain black holes in a way that a primary school student can understand.”
- “Write a Python program for FizzBuzz.”
- “Propose a new marketing strategy for a company.”
- “Create a simple recipe for beginners in cooking.”
Conclusion
In Chatbot Arena, rankings are not determined just by “correctness,” but by whether the response feels natural and compelling to humans.
With the continual entry of new models, the rankings will keep evolving. We hope this article helps you navigate the dynamic world of LLMs with ease!