Understanding Pre-Training in Large Language Models

Pre-training is the phase where we teach a model how language works.

Before a model can answer questions, write code, or chat with us, it needs to learn the structure and patterns of language. This learning happens during pre-training.

From Text to Tokens

Everything begins with raw text.

For example:

“The cat sat on the mat”

A model cannot understand text directly. So the first step is to convert text into tokens.

Tokens are smaller units of text. They may look like words, subwords, or even characters depending on the tokenizer.

So the sentence becomes something like:

[“The”, “cat”, “sat”, “on”, “the”, “mat”]

This process is called tokenization.

One common method used for tokenization is
Byte Pair Encoding.

Instead of splitting text only into words or characters, BPE creates subword tokens.

Let’s understand with an example:

Words:

low
lowest
new
newer

Step 1: Break into characters
l o w
l o w e s t
n e w
n e w e r

Step 2: Find the most frequent pair
For example: “l” + “o” → “lo”

Step 3: Merge it
Now we get:
lo w
lo w e s t

Step 4: Repeat
Eventually, we get useful tokens like:
“low”, “est”, “new”, “er”

So instead of storing every possible word, the model learns reusable building blocks.

Each model is trained with a specific tokenizer.

To find it, you can:

Check the model card
Read official documentation
Look at repositories (like Hugging Face Transformers)

Some common patterns:

Files like merges.txt → BPE
.model files → SentencePiece
vocab.txt → WordPiece

After tokenization, each token is converted into a token ID.

Example:
“cat” → 345

This is just a number used as an index.

Important point:

A token ID does NOT carry meaning.

It is only a reference.

Now comes the key step.

Each token ID is mapped to a vector using an
embedding matrix.

So instead of:

cat → 345

We get:

cat → [0.55, 0.12, -0.44, …]

This list of numbers is called an embedding.

Why not represent a word with just one number?

Because a single number cannot capture meaning.

With multiple numbers (a vector), the model can:

Represent different features
Compare similarity between words
Learn relationships

For example:

“cat” and “dog” will have similar vectors
“cat” and “car” will be far apart

This is known as a distributed representation.

Vectors allow the model to understand relationships.

What Happens During Pre-Training?

Now we connect everything.

The flow looks like this:

Text → Tokens → Token IDs → Embeddings → Model → Prediction

The model is trained to predict the next token.

Example:
“The cat sat on the ___”

The model tries to predict:
“mat”

It produces scores (called logits), converts them to probabilities, and compares with the correct answer.

Then it updates its internal values using
backpropagation.

This process repeats billions of times.

Understanding Pre-Training in Large Language Models

What Happens During Pre-Training?

Related Engineering Architecture

An Introduction to Vision Language Model

Tokens and logits relation in LLM