Project 04

Working
With AI

Not just using it as a tool — actually understanding what's happening inside. That means running models locally, going deep on how they work, and building real things with them every day.

Claude Code LLMs Local AI VS Code Python
Claude Code session in the terminal

My main coding setup is VS Code with Claude Code running in the terminal. I use it daily, for everything from small edits to building full projects from scratch. Codex and Google's Antigravity have been part of the workflow too, depending on what I'm working on.

What I find more interesting than just prompting is how much you can shape the way these tools work. I've spent time building and using skills in Claude Code, which are essentially reusable instruction sets you invoke to handle specific types of tasks. There's a skill for UI/UX design decisions, one for writing in a human voice, one for working with cost and token efficiency. I've written custom ones too. It changes the tool from something generic into something that fits how I actually work.

VS Code showing Claude Code making edits across multiple files

Claude Code making edits across multiple files in a single session

List of Claude Code skills

Some of the skills I use and have built

I've downloaded and run models locally using LM Studio and Ollama. Both let you pull open-source models and run them on your own hardware, no API, no subscription. LM Studio has a cleaner interface for experimenting, Ollama is better for running models headlessly or integrating them into other tools.

For the interface side I've used Open WebUI, which gives you a ChatGPT-style frontend that connects to whichever local model you have running. It makes it easy to compare models, test prompts, and get a feel for how different sizes and architectures behave side by side.

Running things locally is how you actually learn what the numbers mean. When a 7B model runs fine but a 13B chokes, that's not abstract anymore — you understand why in terms of VRAM, RAM, and what's getting offloaded where.

Using these tools practically pushed me to understand what's actually happening under the hood. Most of this came from running into real limits and having to figure out why.

Context and tokens
Tokens are the unit everything is measured in — not words, not characters, but chunks the model's tokenizer breaks text into. Context window is how many tokens the model can hold in "working memory" at once. Run over it and the model starts forgetting the beginning of the conversation.
Parameters and datasets
Parameters are the weights of the model — the numbers that get tuned during training. More parameters generally means more capacity, but the dataset is just as important: what it was trained on, how it was cleaned, and what it was fine-tuned for afterward all shape what the model actually knows and how it responds.
Quantization
Full-precision models store each weight as a 32-bit or 16-bit float, which gets expensive fast. Quantization compresses those weights to lower bit depths — Q4, Q5, Q8 — so the model fits on consumer hardware. You lose a small amount of accuracy but gain a huge amount of practicality. Most local models you'd actually run are quantized.
GPU, CPU, and RAM offload
When a model is too large to fit entirely in VRAM, layers get offloaded — some stay on the GPU for fast inference, the rest spill onto CPU or system RAM. The more layers on the GPU the faster it runs. Tuning the offload split is often the difference between a model that runs usably and one that crawls.
Hallucinations
Models generate the most statistically likely next token given the context — they don't look things up. When there's no good signal in the training data, they'll produce something that sounds right but isn't. Understanding this changes how you use them: you verify, you don't just trust.
Embeddings
Embeddings are numerical representations of meaning. Words, sentences, or documents get mapped to vectors in high-dimensional space, and things that mean similar things end up close together. They're what powers search, recommendations, and retrieval-augmented generation — the ability to find relevant context before the model ever sees your question.

I'm working toward a proper AI workstation setup, a machine with enough VRAM and RAM to run larger models without heavy offloading, and to experiment with things like fine-tuning and running inference on models that aren't practical on a laptop. The goal is to keep narrowing the gap between using AI and actually understanding it.