🏠 Getting Started with Local LLMs [01] Learning the Basics
Local LLM Setup Series
- [01] Learning the Basics
- [02] Choosing Models
- [03] Installing Ollama and Checking It Works
- [04] Installing Open WebUI and Using It Across the LAN
👀 Table of Contents
- 🧪 Trying Out Local LLMs
- 🖥️ Local AI Platforms
- 💾 Models
- 🔍 Reading Model Names
- 👉 What Is Q4_K_M?
- 📝 Quick Glossary
- ✍️ Notes
🧪 Trying Out Local LLMs
There are two main reasons I decided to try introducing local LLMs on the PochomLab machine.
1. Understanding a Local LLM Environment
First, I wanted to get a clear sense of how local LLMs actually run on the PochomLab machine.
In the future, I would also like to try a setup where a Raspberry Pi is used as a small terminal for input and display,
while the actual inference runs on the PochomLab machine.
As a first step toward that, I am starting by checking the basics of local LLMs on the main PC.
- What size of models can actually run
- How much speed I can expect
- What roles Ollama and LM Studio play
- Whether this can become a foundation for linking with other devices later on
2. Support for Structuring Writing for ZINE Production
The other purpose is to support writing structure for ZINE production.
Since the start of 2026, I have been encountering more situations where the output of cloud-based AI services feels inconsistent.
- Weak context retention
- Unstable structure
- Output quality fluctuates depending on load and tuning
- Things that worked yesterday may not work today
- It takes time to remove slop and hallucinations
Of course, these services are still useful in many situations, but the longer the writing or the more structural the task becomes,
the more difficult it feels to keep building on the same assumptions each time.
For that reason, I decided to explore local LLMs as an environment that could help support the framework of writing.
✅ Advantages
- Less affected by connection issues or server congestion
- Can be handled as part of my own local environment
- Easier to choose models depending on the purpose
❌ Disadvantages
- Performance depends on the PC’s hardware
- Model files are large, so storage and memory are also required
- Electricity and operating costs increase
🏃♂️ Organizing Terms as I Go
Once I started looking into local LLMs, model names and format names began appearing all at once.
So first, I want to roughly organize the terms that come up often.
🖥️ Local AI Platforms
These are the foundations used to run and manage LLMs (Large Language Models) on your own PC.
Ollama
- A runtime platform that makes it easy to launch local LLMs
- Primarily operated through the command line
- Easy to use as a local API
LM Studio
- A desktop app with an easy-to-use GUI
- Makes it easy to search, download, and switch models
- Can also be used as a local server
💾 Models
LLMs are models trained on large amounts of text in order to understand and generate language.
Just as you choose models like Anything XL or realisticVision in Stable Diffusion,
it seems likely that choosing which pretrained model to use will also become important with local LLMs.
Below is a list of the models that came up as candidates while I was researching local LLMs this time.
| Model Name | Company / Organization | Location |
|---|---|---|
| Llama | Meta | United States |
| Qwen | Alibaba Cloud | China |
| Gemma | Google DeepMind | United Kingdom |
| Phi | Microsoft | United States |
| Mistral | Mistral AI | France |
| LLM-jp | Large Language Model Research and Development Center (LLMC) | Japan |
| Command R | Cohere | Canada |
🔍 Reading Model Names
In local LLMs, you often see names like these:
Llama-3.1-8B-InstructLlama-3.1-8B-Instruct-Q4_K_M.gguf
At first they look long and intimidating, but they become easier to understand when broken down into parts.
Parameter Count: “B”
The B in 8B stands for Billion,
and roughly indicates the scale of the model.
7B: relatively lightweight and easier to run8B: a standard range that is still manageable on a personal PC14B: tends to perform better, but also requires more memory70B: very heavy
Bigger is not automatically better,
but larger models do tend to be more demanding to run.
Quantization: “Q”
Q4 and Q8 refer to types of quantization.
This is a way of compressing and lightening a model so it can run more easily.
Q4: light and easy to handleQ5: balancedQ6/Q8: more quality-oriented, but heavier
In general, larger numbers tend to preserve quality better,
but they also increase file size and memory usage.
GGUF (GPT-Generated Unified Format)
GGUF is a model format commonly used for local inference.
It is especially widespread in environments based on llama.cpp.
👉 What Is Q4_K_M?
For example, a notation like Q4_K_M can be understood roughly as:
- based on 4-bit quantization
- using a K-family quantization method
- set up to balance lightness and quality
I think that level of understanding is enough for now.
Even without following the internal details from the beginning,
it is easier to move forward if you think of it first as a lightweight, practical format.
In the local LLM world, Q4_K_M is often seen as a good balance between size and quality.
📝 Quick Glossary
RAG
RAG is a mechanism where the model does not rely only on its internal knowledge,
but first retrieves external documents or data and reflects that information in its answer.
For example, it can be useful for things like:
- having it read PDFs on your machine
- referencing your own notes
- combining it with document search
Tool
A Tool is a mechanism that allows an LLM to use external functions.
For example:
- doing calculations
- searching
- reading files
- calling APIs
It is used when you want the model to do more than just generate text.
Transformer
Transformer is the architecture that forms the foundation of modern LLMs.
It is important as a core theory, but in the early stages,
it is probably enough to understand it simply as the main framework behind current LLMs.
Classify
This refers to processing that categorizes input text.
For example:
- sorting types of inquiries
- sentiment classification
- tagging
- label assignment
Embed
This means converting words or sentences into numerical data that makes semantic similarity easier to handle.
It is often used as a foundation for RAG and search.
✍️ Notes
Since the beginning of 2026, I have increasingly felt fluctuations in the output quality of generative AI services.
On April 15, I noticed a strong in-product prompt in ChatGPT encouraging users to move to a higher-tier model,
and it made me feel that AI companies may have clearly entered a monetization phase.
From here on, rather than simply letting AI provide the answer,
- deciding what to make it do
- judging the output yourself
- preparing an environment suited to your own purpose
will likely become more and more important.
At PochomLab, that has already been the direction from the start,
so this attempt to introduce local LLMs also feels like an extension of that path.
The period when it was possible to experiment freely with generative AI at low cost may gradually be coming to an end.
As part of preparing for that change, I want to start exploring local LLMs at PochomLab as well.
As a first step, I will begin by organizing the basic terms and mechanisms.