No PhD. No white papers. Just an IT guy with a lot of experience breaking systems and a high tolerance for things not working the first time.
After spending way too many hours trying to understand AI (we're talking MMO-level commitment here), this is what I've got. It's less about "intelligence" and more about really sophisticated plumbing.
Part 1: The Engine
This is the LLM, the Large Language Model. GPT-4, Claude, Gemini, LLaMA. They're all variations of the same core idea. And the first thing to understand is:
It's glorified autocomplete.
No, really. Your phone's keyboard predicts the next word when you type "I'm on my." The LLM does the same thing, just trained on a massive chunk of the internet instead of your text history. It reads what you gave it and predicts, word by word, what's most likely to come next.
It doesn't "know" that 2+2=4 because it understands math. It knows it because in the billions of pages it trained on, 4 is the most common thing to follow 2+2=. When it writes an email for you, it's not composing. It's reproducing patterns it's seen thousands of times. When those patterns are common, it nails it. When they're not, it makes things up and doesn't tell you.
That's hallucination. It's not a bug (it's a feature™). It's what happens when the autocomplete runs out of strong patterns and fills the gap with the next best guess. Confidently.
I'm not going to bore you with the math behind Transformers, attention mechanisms, or weight matrices. If you're curious, the 2017 paper that started all of this is worth a look. But for this guide, the key points are:
- It's a pattern engine. It predicts, it doesn't think.
- It's static. As of writing this, LLMs are still frozen snapshots of the knowledge available when they were trained. They don't learn from you. They don't update on the fly.
- The names change, the math stays the same. GPT, Claude, Gemini, LLaMA. Different brands, same core concept under the hood.
Part 2: The Harness
So if they're all based on the same math, why does Claude feel different from GPT? Why is one better at code and another better at writing?
The Harness.
The Harness is everything the provider does to the raw model before you ever see it. Anthropic shapes Claude. OpenAI shapes ChatGPT. Google shapes Gemini. Same base concept, different training programs.
This includes the system prompt ("You are a helpful assistant"), safety training, personality tuning, and behavioral guardrails. It's why Claude sounds different from ChatGPT even when you ask them the same question. They went through different onboarding before they ever sat down to do the work.
You don't see this layer. You don't control it. It happened before you showed up. But it's the reason a raw model and a finished product feel like completely different things.
Part 3: The Desk
OK so we have the engine and the harness. Now where does the actual work happen?
The Desk. This is the Gemini app on your phone. The ChatGPT interface. Copilot in your IDE. Claude in your browser. It's the workspace that sits between you and the AI, and it's where most of the "magic" actually lives.
Because here's the thing: the LLM by itself is just a brain in a jar. It can't search the web. It can't read your files. It can't remember what you said 20 minutes ago. It doesn't check its own work. An LLM with no desk is just a text predictor with no tools.
The Desk gives it what it needs:
| Tool on the Desk | What it does |
|---|---|
| Chat history | Feeds the whole conversation back in every time, so the LLM appears to remember |
| Web search | Lets the LLM look things up instead of guessing from training data |
| File access | Gives the LLM your actual documents, code, or data to work with |
| Code execution | Lets the LLM run code and see the results |
| Memory / Notes | Saves things across conversations so relevant info can be pulled back in later |
The LLM isn't going to remember to check the task list it created an hour ago. But if the Desk has code that says "hey, check the task list before responding," it will. The LLM follows instructions. The Desk provides them.
Different products put different tools on the desk. ChatGPT has DALL-E and a code interpreter. Cursor has your entire codebase and a terminal. A raw API call has nothing. Empty desk, just a prompt and a response.
What most people call "AI" is really the Desk + the Harness + the Engine, all working together. When things go wrong, the question isn't "why is the AI stupid?" It's "which layer broke?" (Well, sometimes the AI really is just that stupid.)
Part 4: The Window
This is the part that trips people up once they start actually using AI for real work.
Each LLM has a limit to the amount of information it can keep at hand at any given time. This is the context window, measured in tokens (roughly chunks of words). Think of it as the size of the desk. Everything has to fit on it at once: the system prompt, the chat history, your uploaded files, any search results, and the response it's currently writing.
If the information is not in the training data or the current context window, the AI does not know it. Full stop. It doesn't matter how smart the model is. If it can't see it, it can't use it.
The desk fills up
Some models have small desks (8k tokens). Some have large ones (128k, even a million). But every desk has edges. When it fills up, the oldest stuff quietly falls off the back. No error. No warning. The AI just keeps going with whatever's left.
The AI will not tell you the desk is full. It won't say "I lost some of your earlier context." It will just start giving you worse answers and you'll wonder what changed. Nothing changed. The desk filled up.
The middle gets ignored
Even when everything fits, there's a catch. Research from Stanford (Liu et al., 2024) found that LLMs are significantly worse at using information in the middle of a long context. They pay most attention to what's at the top and bottom of the pile. The stuff in the middle gets skimmed.
Think about the last time someone handed you a 30-page document. You read the beginning carefully, skimmed the middle, and read the conclusion. Same thing.
Retrieval doesn't fix overflow
Some tools use RAG (Retrieval-Augmented Generation) to pull relevant info from past chats or documents. Useful, but whatever gets pulled still has to fit on the desk alongside everything else. If the retrieval sends back 50 pages and the desk has room for 10, something's not making it.
Managing the desk is the skill
This is where you actually get good at working with AI. Not fancier prompts. Not picking the trendiest model. Managing what's on the desk. What goes on, what stays, what gets swapped out, and knowing when to clear the desk and start fresh.
The fix: Requirements docs. When the desk gets too cluttered or a conversation runs too long, start a new chat and put your docs on a clean desk. The AI gets fresh context with all the right information, and you're back to full signal instantly. More on this in Plan Before You Build.
The "For Now" Part
I call this "how it works, for now" because this field moves faster than a server room fire. The Transformer architecture is only from 2017. ChatGPT launched in late 2022. The entire landscape you're looking at is younger than most phones.
But until the math changes fundamentally, remember:
The Engine does the work. The Harness shapes it. The Desk equips it. And you direct it.
Everything else is just a really good magic trick.