The LLM Journey: How Neural Networks Learn to Predict the Next Token

Picture of Eshvar Balaji
Eshvar Balaji

Lorem ipsum dolor sit amet consectetur adipiscing elit dolor

In Part 1, we explored how large language models (LLMs) transform messy internet text into clean, tokenized sequences. But what happens next? How do those tokens become the fluent answers we see from models like ChatGPT, Claude, or Gemini?

The answer lies in neural networks — the engines that turn sequences of tokens into predictions of the next word. In this article, we’ll break down how training works inside an LLM, from tokens to parameters, and why scale is everything.

From Tokens to Training Data

Once text has been tokenized (compressed into IDs using methods like Byte Pair Encoding), the model no longer sees “words” in the traditional sense. Instead, it sees sequences of numbers:

“Hello world” : [15339, 1917]

These token IDs are the input to the model. During training, the network learns to predict the next token in a sequence, given the previous ones.

This framing is what makes LLMs so powerful: they are essentially next-token prediction machines trained at web scale.

Sliding Windows: How Models See Text

Because sequences can be infinitely long, LLMs use a fixed-length context window.

GPT-3 had a window of 2,048 tokens.

Modern models like GPT-4 and Claude 3.5 support windows of 128k+ tokens.

Within this window, the model processes tokens in parallel using a Transformer architecture. Each token attends to others via self-attention, building context dynamically.

Neural Networks at Work

At the core, training involves two key steps:

Embedding Layer

Each token ID is mapped to a dense vector representation.
Example:

ID 15339 : [0.25, -0.61, 0.73, …]

These embeddings capture semantic meaning.

Transformer Blocks

Self-Attention : allows tokens to “look at” each other and weigh relevance.

Feedforward Layers : transform information into richer representations.

Residual Connections & Layer Norm : stabilize deep training.

Output Layer

The final layer produces a probability distribution over the vocabulary.
Example:

Next token probabilities: [“world”=0.72, “there”=0.14, “friend”=0.06, …]

The predicted token with the highest probability is chosen (or sampled), and the process repeats.

Training Objective: Next Token Prediction

LLMs are trained with a causal language modeling objective:

Given context [w1, w2, w3 …. wn], predict wn+1.

The model computes a probability distribution for each possible next token.

The training loss measures how far off the prediction is from the actual token.

This is done billions (or even trillions) of times across massive datasets, gradually shaping the model’s ability to predict text patterns.

Scale: Parameters and Compute

Modern LLMs are defined by their scale:

Parameters : GPT-3 had 175B, GPT-4 is estimated in the hundreds of billions. ( GPT-4 is estimated to have roughly 1.8 trillion parameters. More specifically, the architecture consisted of eight models, with each internal model made up of 220 billion parameters.)

Training Data : trillions of tokens, filtered from web-scale crawls.

Compute : training often requires thousands of GPUs for weeks or months. (GPT-4’s foundational model pre-training took approximately three months using thousands of GPUs, such as 8,000 NVIDIA H100 GPUs or their equivalent in older A100 GPUs. The exact training time can vary depending on the hardware used and the specific version of the model, but this general timeframe of about 90 to 100 days is consistent across different reports.)

The combination of parameters, data, and compute is what gives rise to emergent abilities like reasoning, coding, and multi-step planning.

Why Next-Token Prediction Works

At first glance, predicting the “next word” may seem trivial. But at scale, it forces the model to:

Capture semantics (knowing that “king” relates to “queen”).

Model context (understanding that “bank” in a financial article differs from “river bank”).

Learn long-range dependencies (keeping track of entities across paragraphs).

What emerges is not just autocomplete, but a statistical engine capable of mimicking understanding.

Key Takeaways

Token IDs are the fundamental input to LLMs.

Training is about predicting the next token, billions of times.

Transformers enable context modeling through self-attention.

Scale in parameters, data, and compute is the secret behind modern performance.

Closing Thoughts

Large language models don’t start out “knowing” language. They learn it step by step by guessing the next word over and over again, refining billions of parameters until patterns of meaning emerge.

Eshvar Balaji

Posted in

Artificial Intelligence
If you're in cybersecurity, risk, or compliance, you're probably feeling the pressure. Regulations like DORA, NIS2, and the AI ActIf you're in cybersecurity, risk, or compliance, you're probably feeling the pressure. Regulations like DORA, NIS2, and the AI Act

Eshvar Balaji

Posted in

Artificial Intelligence
Agentic AI is dominating headlines — self-directed software agents that can reason, act, and adapt across complex workflows. It’s powerful, it’s promising, and it’s...

Eshvar Balaji

Posted in

RegTech

tags