Explaining LLM – No Magic, Just Pattern Machine

“Any sufficiently advanced technology is indistinguishable from magic”

25 January 2025 •

Artificial Intelligence is the second greatest pattern finding machine.

The first one sits in your head.

Our brain has been grinding the game of "find the pattern or become lunch" for millions of years. It's like that one friend who's been playing Candy Crush since it came out and is now embarrassingly good at matching patterns. But instead of matching candy, it was improving at matching patterns leading to survival.

When that weird sound in the dark turns out to be a tiger? Brain is speedrunning the "match threatening sound to dangerous predator" game in milliseconds (high score = staying alive)

When you notice three different Twitter threads about parents struggling with AI homework detection, the rising prices of Chegg subscriptions, and teachers complaining about ChatGPT all in one morning? That's you speedrunning the "emerging market gap pattern matching" game in milliseconds (high score = launching your AI-learning tools platform before the education tech giants catch up)

After billions of playing sessions, we ended up with a thing that is casually processing thousands of patterns every second, like it's no big deal. And it became incredible at it: connecting faces to voices, sounds to meanings and actions to reactions.

But why are patterns such a big deal?

If we ask the second greatest pattern finding machine on "how much of our life is about finding patterns" it outputs fascinating data:

If we assume that human cognition inherently involves pattern recognition during most waking hours, we can estimate its prevalence. Here's a plausible breakdown:
Waking Hours (~16 hours):
Conscious Activities (Work, Study, Conversations, etc.): These heavily involve recognizing patterns in language, behavior, and data.
Unconscious Pattern Recognition: Even seemingly passive activities, like watching TV or scrolling on social media, rely on detecting patterns to make sense of content.
Estimated: ~80-90% of waking hours could involve some form of pattern recognition, directly or indirectly.
Sleeping Hours (~8 hours):
While we aren't consciously recognizing patterns during sleep, the brain processes and consolidates information. This includes finding patterns from the day's experiences during REM sleep and dreaming.
Estimated: ~20-40% of sleep-related brain activity may involve pattern consolidation.
Approximate Total: If 80-90% of waking hours (~12-14 hours) and 20-40% of sleep hours (~2-3 hours) are about recognizing or processing patterns, then approximately 14-17 hours out of a 24-hour day (about 60-70%) could be attributed to pattern-related activities.

Given this reasoning and having just looked at how much of our daily life is structured around patterns recognition, we can imagine how AI will and already massively impacts our lives by taking outsourced tasks from our brain.

To understand this massive shift and leverage it, if you haven't already, we need to understand what exactly causes it – Large Language Models (LLM)

Each time you type something into ChatGPT, LLM is the thing that retrieves it, "understands" and answers you back. Currently, it's a core component of modern AI systems.

It's not magic, or hard, or unpredictable, or can "think". It's pretty straightforward if we furiously crush each word abstraction (eg. "reasoning" "intelligence" "thinking") down to primitive concepts. Exactly what we will do.

What is LLM?

LLM – large language model.

Model – a computer program designed to process input data and produce output data by mimicking a relationship pattern it has learned.

Language – works with human text.

Large – model that contains billions/trillions of parameters.

Computer Program

Specific sequence of instructions (basic mathematical operations) that tell the computer's processor what calculations to perform.

Input Data

Anything converted into numbers that the program can work with.

The process of converting input data into numbers:

Breaking Text Into Tokens
A token is a piece of text - it could be a full word, part of a word, or even a single character. Example: "playing" might be split into "play" and "ing".
Converting Tokens to Number Arrays (Vectors)
Each token gets turned into a list of numbers (usually hundreds or thousands of numbers). These lists of numbers are called "vectors" or "embeddings".
The word "cat" might become: [0.2, -0.5, 0.8, 0.1, ...] (hundreds more numbers)
The word "dog" might become: [0.3, -0.4, 0.7, 0.2, ...] (hundreds more numbers)
Each number in this list represents one tiny aspect of what the word means or how it's used.

"Why use number lists instead of single numbers?"

Single numbers can only show one thing (like bigger/smaller) Lists of numbers can capture many things at once:

How the word is used in sentences
What other words it appears near
What topics it relates to
etc.

"What about images and sounds?"

Images? Numbers. (RGB values like: Red = 255, Green = 0, Blue = 0)

Sound? Numbers. (Wave amplitudes like: 0.1, 0.2, -0.1, etc.)

Output Data

Numbers produced by the program's calculations, which get converted back into a form humans can understand: text, image, sound.

Mimicking

Mimicking – producing similar outputs to what was seen in training data when given similar inputs. Training data – large amounts of raw text (books, websites, articles, code, etc.) that we show to the computer program to teach it patterns.

Example of mimicking:

If training data is millions of sentences with 2x2=5, the program will learn that 2x2=5 and mimic it when you input 2x2.

Relationship pattern

Relationship pattern – how pieces of text relate to each other.

This "how" is stored in:

The word vectors themselves (similar words have similar vectors)
The weight matrices between model layers (how combinations of words affect each other)
The attention mechanism (which words should focus on which other words)

"How we teach it patterns?" or "How it learns?"

Let's decompose the learning process:

1. Neural Network Architecture (what actually processes patterns):

Input layer: converts tokens to initial vectors
Multiple transformer layers: process these vectors
Each transformer layer contains:
- Self-attention mechanism:
  - Queries (Q): "what am I looking for?"
  - Keys (K): "what do other tokens offer?"
  - Values (V): "what information do tokens contain?"
  - For each token: Q×K determines attention scores
  - These scores × V determines what information to take

Multiplies by W2, adds b2: max(0, xW1 + b1)W2 + b2

2. What's actually learning:

Every Q, K, V matrix
Every W and b in feed-forward networks
Token embedding vectors
Position embedding vectors

All these are just numbers that get adjusted

3. How adjusting happens:

Forward pass:
- Input goes through all layers
- Each layer applies its transformations
- Output layer predicts probabilities for next token
- Compare prediction with actual next token
- Calculate how wrong prediction was (loss)
Backward pass (the actual learning):
- For each number that needs learning:
- Calculate how it contributed to wrong prediction
- This calculation is partial derivative of loss
- Done through chain rule: if A affects B affects C
- We calculate: how A affected B × how B affected C
- Each number gets updated:
- old_number - (learning_rate × its_contribution)
- learning_rate is tiny (like 0.0001)
- This is gradient descent

4. Multiple things learning simultaneously:

Direct word relationships (through self-attention)
Grammar patterns (through layer combinations)
Topic patterns (through vector similarities)
Each type affects different numbers in network

5. Why this works:

Numbers that help correct predictions:
- Their contribution to error is negative
- Subtracting negative makes them bigger
- Bigger numbers = stronger patterns
- Positive contribution to error
- Get reduced
- Weaker numbers = weaker wrong patterns

6. Why it needs to be large:

Each pattern piece needs its own numbers
Millions of patterns to learn
Each pattern needs multiple numbers
Billions of numbers total
More numbers = can store more patterns

Learned

The relation was adjusted through training to better match patterns in training data.

Parameters

All the numbers in the model that can be adjusted during learning.

Let's break them down by where they live:

In token embeddings:

Each token (word piece) has its vector Vector length = embedding dimension (e.g., 1024) 50,000 tokens × 1024 numbers = 51.2M parameters These learn to represent word meanings

In position embeddings:

Each position needs its vector Same size as token embeddings 2048 positions × 1024 numbers = 2.1M parameters These learn position meanings

In each transformer layer:

Self-attention has:

Query matrix (1024 × 1024)
Key matrix (1024 × 1024)
Value matrix (1024 × 1024)
Output projection (1024 × 1024)
Total: 4.2M parameters per attention head
Multiple heads: 4.2M × 16 heads = 67.2M parameters

Feed-forward has:

First matrix (1024 × 4096)
Second matrix (4096 × 1024)
Bias vectors (4096 + 1024)
Total: 8.4M parameters

One transformer layer total:

Self-attention: 67.2M
Feed-forward: 8.4M
Layer normalization: few thousand
≈ 75.6M parameters

Multiple layers:

GPT-3 has 96 layers
75.6M × 96 = 7.26B parameters just in layers

Why so many?

Each parameter helps learn:

Word relationships
Grammar rules
Facts and knowledge
Writing styles
Topic connections

More parameters = more capacity to store patterns

But there's a catch:

More parameters = more memory needed
More parameters = more compute needed
More parameters = more training data needed
More parameters = more time to train
More parameters = more expensive

That's why only big companies with lots of resources can train large models from scratch.

What LLMs Can and Cannot Do

Can Do Well:

Answer Knowledge Questions
Input: "What is the capital of France?"
Output: "Paris"
Why: strong relationship pattern exists between "capital France" and "Paris" because this pattern appears frequently in training data.
Complete Common Patterns
Input: "To make an omelet, you need to break some"
Output: "eggs"
Why: very strong relationship pattern exists between "break" and "eggs", plus context connections from cooking-related words.
Writing Style Adaptation
Input: "Explain quantum physics like I'm 5"
Output: [Simple explanation using basic words]
Why: structure relationship pattern exists after "like I'm 5", simpler words get higher scores.

Struggles With:

Basic Math
Input: "What is 123,456 × 789,012?"
Output: [Often wrong]
Why: no direct relationship pattern for specific large numbers. Has to try combining patterns about multiplication, which often leads to errors.
How AI companies are solving it: adds more computer programs on top of LLM that can actually run code and output numbers. LLM doesn't do it.
Current Events
Input: "Who won yesterday's game?"
Output: [Makes up answer]
Why: can only use relationship patterns from training data - can't access new information.
How AI companies are solving it: add more computer programs on top of LLM that can access internet.
Logic Consistency
Input: "John is taller than Mary. Mary is taller than Pete. Who is shortest?"
Output: [Sometimes gets confused]
Why: follows text connections rather than understanding logical relationships. May have conflicting connections that lead to inconsistent answers.
How AI companies are solving it: trains LLM to break down logical problems step by step by using combination of learning with other programs.
- Breaking Down Steps
  Instead of trying to solve everything at once, companies train LLMs to split problems into smaller pieces, like this:
  - First: Write down each fact ("John taller than Mary", "Mary taller than Pete")
  - Next: Compare two people at a time
  - Finally: Combine the comparisons to reach a conclusion
- Chain of Thought
  Add special instructions in training that teach LLMs to "show their work":
  Input: "John is taller than Mary. Mary is taller than Pete. Who is shortest?"
  Step 1: Let's list what we know
  - John > Mary (in height)
  - Mary > Pete (in height)
  Step 2: Combine the facts
  - If John > Mary and Mary > Pete
  - Then John > Mary > Pete
  Step 3: Answer
  - Pete is shortest
- Multiple Attempts
  Programs LLMs to solve the same problem several different ways and compare answers, like having multiple students check each other's work:
  - Solve forward (John → Mary → Pete)
  - Solve backward (Pete → Mary → John)
  - Draw a diagram
  - If all methods give the same answer, it's probably right

Can Be Unreliable:

Fact Hallucination
Input: "What did Einstein eat for breakfast?"
Output: [Makes up detailed but false answer]
Why: when no strong direct connections exist, combines weaker connections about Einstein, breakfast, and typical foods - creating plausible but false information.
How AI companies are solving it:
- Knowledge Checking (RAG - Retrieval Augmented Generation)
  Think of this like having a fact-checker standing next to the LLM. When you ask "What did Einstein eat for breakfast?" Before answering, the LLM checks a trusted database. If no reliable information exists, it says "I don't have verified information about Einstein's breakfast habits" instead of making things up.
- Self-Checking Questions
  Companies train LLMs to question their own answers:
  - "Do I have a reliable source for this?"
  - "Am I mixing up different facts?"
  - "Could this be confused with something similar?"
- Confidence Levels
  Companies train LLMs to rate their confidence in each part of their answers:
  - Human: "What did Einstein eat for breakfast?"
  - High confidence: "Einstein was a physicist who worked at Princeton"
  - Medium confidence: "Einstein likely had breakfast at home most days"
  - Low confidence: Specific details about food preferences without sources
  The LLM then adjusts its response based on these confidence levels, being more direct with verified information and more cautious with uncertain details.
Context Forgetfulness
Input: "What's her favorite color?"
Output: [Mentions different color than stated earlier]
Why: limited context window means older connections may be lost or overridden by more recent ones.
How AI companies are solving it: nothing special, buys more computational power using billions of dollars, innovate and optimize algorithms. But more power seems to be a better approach.

In The End

Understanding how LLMs work changes how you use them:

It's a Pattern Copier, Not a Thinker
- Good for: Finding patterns it has seen before. Example: "Write a professional email" (seen millions of emails)
- Bad for: True reasoning or new ideas. Example: "What's the next breakthrough in physics?" (can only remix existing patterns)
More Context = Better Patterns
- Good: "Given this database schema [full details], write a query to..."
- Bad: "Fix my code" (without showing the code)
Garbage In = Garbage Out
- Good: "Write a function that sorts an array in ascending order"
- Bad: "Make code better" (vague patterns lead to vague outputs)
It's About Probabilities, Not Facts
- If a fact appears in many training examples → high probability → likely correct
- If a fact appears rarely → combines random patterns → might hallucinate

It's not magic. It's not intelligent. It's not creative.

It's a sophisticated pattern-matching calculator that:

Turns your words into numbers
Finds relevant patterns
Predicts most likely next words
Repeats

"Any sufficiently advanced technology is indistinguishable from magic."

We’ll send you new posts when they come out. Nothing annoying.