How Are the Best Language Models Developed?

An understandable behind-the-scenes look at ChatGPT, Claude, Gemini and DeepSeek. From tokens and probabilities to RLHF training – this is how modern Large Language Models are created as of February 2025.

Liam van der Viven
Co-Founder & CTO at botBrains
Drivers who understand how a car technically works often drive more safely and effectively. They can identify strange noises or unusual behavior much faster and are aware of their vehicle's limitations. The same applies to large language models (LLMs) like ChatGPT: Those who roughly understand how they tick internally can use them better – and assess their limitations more realistically.
Understanding LLMs better means being able to use LLMs better.
So let's turn you into a popular science expert in roughly three pages. Large language models are essentially statistical "imitators" at their core. They are provided with huge amounts of data and learn to recognize patterns in it. For so-called Large Language Models, this means: We feed them with text and they should predict the next word (token, more on this later). A sentence is created by repeating this step after appending the new token to the sentence. But why do LLMs stop generating? Why do they ask follow-up questions? Why doesn't "Heil" lead to "Hitler" in 20% of cases, even though there's still enough radical material on the internet? We will answer these questions.
For this to work, we first break down the texts into tokens. A token can be a word, part of a word, a punctuation mark, an emoji, or any other character sequence. ChatGPT, for example, has a vocabulary of around 200,000 such tokens. Any text – whether German, English, or a mix of symbols – is translated into a sequence of these tokens. Why 200,000? A vocabulary that's too small makes the token sequences very long, which makes training difficult; a vocabulary that's too large also complicates the model because we have to calculate even more probabilities in each step.
After we have prepared the entire "content-valuable" part of the internet – such as Wikipedia, forums like Reddit – in this way, we get a large neural network to guess countless times which token comes next. This is called Pre-Training or the phase in which the Base Model is created. The result: a pure autocomplete system that has learned the statistical knowledge from mountains of text, but is not yet trimmed for "helpfulness" or user-friendliness.
The next step is Instruct-Finetuning. Humans (so-called Human Labelers) or other helper models create dialogue examples: "question" and "ideal answer." The base model thus learns to give helpful and friendly responses instead of arbitrarily completing text. This creates an Instruct Model that answers questions as you would expect from an assistant. In particular, alignment is already introduced here, so the model learns to respond to "Heil" with "Hitler," but then puts it in the right context ("Heil Hitler" is a Nazi greeting that was established as an official greeting in the Third Reich (1933-1945). It served for the ideological and propagandistic glorification of Adolf Hitler and the Nazi regime...). The model also learns to withhold certain knowledge.
Why do LLMs stop responding?
Under the hood, conversations are broken down into special "tokens" that form something like a role structure: im_start (e.g., beginning of a user contribution), im_mid (intermediate steps, such as the assistant thinking) and im_end (the end of the assistant's answer). As soon as the model encounters a corresponding end token – i.e., a kind of end point – in the generated text, it interprets this as a signal to conclude its answer. Or the internal dialogue switches to the next role, such as "User." That's why it sometimes seems as if the model would suddenly fall silent or deliberately ask questions: It is oriented towards these token boundaries and thus "knows" at which point the conversation is meaningfully continued or interrupted for the user.
Example: When you send a request to ChatGPT, it internally generates a sequence like:
<start>assistant<middle>How can I help you today?<end>
<start>user<middle>What color is the sun?<end>
<start>assistant<middle>
The AI model doesn't really "know" when to stop; it merely calculates probabilities for the next token. A token sequence could be, for example, Blue.<end>
. As soon as the model generates this <end>
token, the software stops sampling tokens from the Instruct Model. The model therefore doesn't decide for itself, but always has the task of calculating a probability distribution over the vocabulary.
What is Temperature?
Imagine a dartboard divided into several sectors. Each of these sectors represents a possible next token that the language model could output. The surface area is allocated proportionally to the probability. If a certain token has a particularly high probability, its sector is correspondingly large; tokens with lower probability, on the other hand, only get a small section of the board.
In so-called sampling, we essentially "throw" a dart blindly at this board. The token of our hit sector is our next token. Obviously, we hit larger sectors more often.
The Temperature parameter determines how much we "compress" or "stretch" these sectors (probabilities).
A high temperature makes larger sectors smaller, while small sectors grow. This means that even inherently unlikely tokens get more surface area. The dart throw thus more often hits "exotic" tokens. The answers sometimes sound more creative, but can also be more chaotic.
A low temperature enlarges already large sectors and displaces the small ones. This increases the chance that the dart always lands on the same, most likely tokens. The texts become more uniform and predictable, but often seem less original.
How do we ensure that the LLM doesn't immediately stop generating?
Among other things, through Reinforcement Learning from Human Feedback-Finetuning. Instead of always coming up with the answer pairs ourselves, we train a second AI model that can itself estimate how high human approval would be. It is therefore an evaluation instance, also called a "Reward Model". These datasets are built by human annotators who evaluate different answer suggestions from the language model according to their quality. Perhaps it is now also clear to you why we were often asked at ChatGPT to select which of two generations we preferred. These judgments form a dataset from which the Reward Model can be built.
Now our LLM can imagine a series of possible answers to a question. Thus, we now have a combination where our LLM generates a series of possible answers and has them evaluated by the Reward Model. Answers with high scores reinforce the current behavior, while low scores weaken it. This is how the language model learns to preferentially produce answers that are positively evaluated by the Reward Model. Preferring means we increase the probabilities of those tokens. This training runs only to a limited extent so that the model cannot outsmart the Reward Model.
What results are answers that more strongly align with human ideas of relevance, friendliness, and quality. Ideally, it responds more helpfully and consistently, but is still dependent on specific usage limits, as it can still encounter unknown edge cases or erroneous estimates despite RLHF.
Anyone who has been watching the LLM space in recent weeks knows that there is a model called Deep Seek R1 that has made big waves. The background is that it performs at the level of current models, but cost much less in training. This is because the datasets for such Reward Models are getting better and better, so we are getting better at training the LLM and can start with RLHF earlier.
In summary:
- Base Model: Translate huge text data into an autocomplete model.
- Instruct Model: Human example answers → Chat assistant behavior.
- Fine Tuned Model: Fine-tuning through targeted rewards or corrections.
Those who roughly understand that ChatGPT and Co. are fundamentally based on probabilities of tokens recognize more quickly why they sometimes hallucinate nonsense or output correct-seeming but false facts. Similar to a driver who "knows" their car, with this knowledge one can specifically use the strengths of an AI model and more confidently navigate its peculiarities. How audio and images can be processed will be discussed another time.