Local Brains, Zero Witnesses – AI Without The Cloud Cult

AI Tools • 6 min

🪪 Lab

A 6-minute field guide to running LLMs on your laptop.

Cold Open

You don’t need a server farm to think out loud. You need a model that sits on your machine, answers fast, and doesn’t report your midnight prompts to a quarterly earnings call. Local models won’t write your magnum opus, but they’ll draft, summarize, plan, and prototype—quietly. Think “smart pocketknife,” not “AI deity.” That’s the point.

The Button or the Terminal (pick your poison)

LM Studio (the button): A clean GUI. Download, pick a model, click “Run.” It’s the Kindle of local LLMs—boring in a good way. Hooks into your mic and files if you let it, and you can pin a model like a favorite pen.

Ollama (the terminal): Fewer frills, more control. One liner to pull a model, one liner to chat. Scriptable, portable, pleasantly unglamorous. When you want to know what’s actually happening, it’s perfect.

Ten-Minute Setup (both paths)

LM Studio path

Install the app, open Models, and grab a small, reputable 7B–8B chat model.

Hit Run. Talk to it. If your fans spin like a drone, lower the context window and batch size.

Optional: enable the local server. Your writing tools can now point to http://localhost:port like it’s an API.

Ollama path

Install Ollama. In a terminal: ollama pull llama3 then ollama run llama3.

If it stutters, try a smaller/quantized variant (the ones ending in Q4/Q5 etc.).

Want it inside other apps? Point them at http://localhost:11434 and pick the same model name.

What actually runs on a laptop (without cooking it)

7B–8B models (general chat, brainstorming, email drafts): smooth on modern CPUs/Apple silicon; good enough 90% of the time.

10B–13B models (denser reasoning, better code): doable with 16–32GB RAM and a decent GPU; quantized builds help a lot.

>30B (ego projects): sure, if you enjoy stutter, heat, and regret. Save it for a desktop with real VRAM—or don’t.

Make it useful, not just novel

Local excels at private summarization (meeting notes, PDFs), quick code scaffolds, idea shaping, first-draft outlines, and scratch math/logic that you don’t want to leak. It’s instant: no rate limits, no “capacity” lies, no usage tax. Keep a small stable: one chat model, one code-tilted model, one RAG tool if you need document QA. That’s it. Don’t collect models like Funko Pops.

Keep your fan alive (and your patience)

If the UI offers quantization, use it; quality loss is minor compared to the speed win. Drop context length unless you truly need a novel in one prompt. Cap concurrent requests to one. On battery? CPU-only. On mains? Let the GPU help, then stop when the room feels like a toaster.

Privacy is better, not magical

Local means the model weights and generation happen on your box—great. But if you point it at cloud files, or enable telemetry, or install a “free” extension with sticky fingers, you’re right back on stage under the cookie lights. Treat “localhost” as a boundary: cross it only on purpose.

Pick a brain (no browsing, no forums)

Use any modern 7–8B instruct chat model from the big families. Think function, not fandom:

Generalist, fast reply
“7–8B instruct” from Llama / Mistral. Good baselines for drafts and summaries.

Multilingual/creative tilt
“7–8B instruct” from Qwen. Plays nicer with non-English and edgy phrasing.

Tiny but clever
“~7B instruct” from Phi. Lower RAM, surprising reasoning for its size.

Code-leaning
Any 7–14B “code instruct” variant. If it lags, grab the quantized one and keep context short.

Sanity expectation: on a recent laptop, you’ll see 20–60 tokens/sec at 7–8B with a Q4/Q5 quant. If you aren’t close, you’re pushing context too high or loading too big a model.

Quantization without the eye-glaze

Q4 ≈ fastest, small hit to nuance.
Q5 ≈ sweet spot (what most people actually want).
Q8 ≈ quality flex, bigger memory, not worth it on battery.

Rule of thumb: if the fans are the loudest part of the session, drop one quant level or shrink context.

Ten-minute RAG (private “search” on your docs)

You don’t need a stack. You need a folder and an endpoint on localhost.

Index the folder (one-time, any simple embedder; if your tool has “RAG” toggle, use that).
Query = Your question + (Answer with numbered citations like [1], [2], from the provided chunks only.)
Host = http://localhost:<port> (LM Studio server) or http://localhost:11434 (Ollama).

Receipt mindset: Force citations in the prompt and paste the file names back in the answer. If it can’t point to a chunk, it’s lore.

Privacy receipts (prove it’s local)

Air-gap test
Turn Wi-Fi off, ask for a long summary. If it still answers, you’re truly local.

No silent egress
Run one of these while you chat:
macOS: lsof -i -P | grep -E 'ollama|lm-studio'
Windows (PowerShell): Get-NetTCPConnection | ? { $_.State -eq "Established" }
Linux: ss -tup | grep -E 'ollama|lm-studio'

Telemetry toggle
If your GUI has a “send usage data” switch, it should be off by default. If not, it’s not your friend.

Local boundary
Point tools at localhost only. If you mount cloud drives or enable third-party extensions, you left the monastery.

“It’s slow” triage (30 seconds)

Too hot
Lower context window, switch Q5→Q4, cap batch size to 16–32 tokens.

Too dumb
Move Q4→Q5, add a two-line system prompt with your style and verboten outputs, or step up from 7B→8–13B (plugged in).

Too forgetful
Shorten prompts; pin your working facts in the system prompt; for docs, use RAG instead of hoping the window holds.

Use it like a grown-up

One chat model for words. One code-tilted model for scaffolds. One simple RAG for your PDFs. That’s the whole stable. Everything else is Funko Pop collecting with worse resale value.

What not to do (and why)

Don’t benchmark vibes
Measure tokens/sec and latency, not “it feels smart.”

Don’t “set and forget” prompts
Keep a living system prompt per project (voice, taboos, outline format). Save it next to the draft.

Don’t trust “on-device” marketing
On-device ≠ correct on device. Receipts or it didn’t happen.

Where local loses (and that’s fine)

You won’t beat gigantic cloud models on encyclopedic recall or long, multi-step retrieval across the open web. That’s not failure—that’s scope. Keep local for thinking, drafting, and document QA. When you need heavyweight facts with citations, step outside—briefly, consciously, and bring the result back home.

AI Tools

Next Glitch →

Proof: ledger commit bb22c41 • Sep 22, 2025

Switchcraft

Updated Sep 22, 2025
Truth status: evolving. We patch posts when reality patches itself.