Tiny Models, Local Throttles: Exploring My Local AI Dev Setup

I’ve been exploring how small, open-source language models can fit into a local development setup to improve how I work day-to-day. There’s something satisfying about building a lightweight, responsive system that runs entirely on your own machine. This post is a practical guide to using tiny models with just enough tooling to throttle things locally, and run smarter without adding complexity.

While the spotlight is on state-of-the-art frontier models, I am interested in exploring the capabilities of open-source models that I can run on my Macbook M2 Pro (10-core CPU, 16GB RAM). Working with open-source models locally is interesting and exciting for a few reasons:

Privacy: You own the data
Control: Run on your hardware
Learning: Understanding what it takes to run a model and the trade-offs
Fine-tuning: Allows fine-tuning small models with generous resources to make them on-par or better than SOTA models for specific use-cases
Offline: All of the model’s knowledge made available without an internet connection
Fun!

I’ve tried using LLMs in different parts of my workflow, from the terminal to an IDE assistant to an AI-assisted pair programmer. This guide will help you explore small language models and learn how to integrate them with your day-to-day workflows.

Getting started with local models

The two simplest methods to get started with local models are using llamafiles and Ollama.

Llamafile

To run a llamafile model, simply download a model tagged with llamafile from HuggingFace and do the following:

$ chmod +x <model>.llamafile
$ ./<model>.llamafile

This will start a chat and you’re good to go. It can’t get any simpler than that.

Ollama

Ollama offers an organized approach for managing models, similar to what Docker does for containers (in fact, there are a lot of similarities between the two). To get started, download Ollama for your platform and use it to run a model from their registry like so:

$ ollama run llama3.2:3b

If you want to just provide a prompt without turning on chat, do this instead:

$ ollama run llama3.2:3b "why is the sky blue?"

You can also list all downloaded models using:

$ ollama list

Choosing models to explore

Once you have a medium installed for running models, the next step is to download a model. But which one? There are plenty of options and it is hard to select one. I suggest starting with the following as I’ve found them to be good for regular use:

If you are looking for a model trained on more code, consider exploring:

For image reasoning models, consider:

llava:13b
llama3.2-vision
moondream (a tiny model that packs a punch!)

I’d used a few reasoning models, but not enough to recommend a few, but these are some worth considering:

deepseek-r1 (distilled from DeepSeek-R1)
qwen3

Understanding Model Trade-offs

When selecting a local model, there are two key specifications that impact both size and performance:

Parameters

The number of parameters (like 8b or 16b) refers to the number of learned values in billions after training a model. More parameters mean more knowledge and better reasoning for the model.

Quantization

An optimization to reduce a model’s size by using fewer bits (reducing the floating point precision) to represent weights. Generally represented as Q4_K_M (4-bit), Q8_0 (8-bit), FP32 (original) and such, it provides a trade-off between memory usage and quality. The higher the quantization (Q4 > Q8), the faster the loading and inference, but the lower the output quality.

Finding the balance

On my M2 Pro, I’ve found 7/8B models with Q8 quantization and 12-14B models with Q5 or even Q6 quantization to be a good balance between performance and quality. I’d suggest experimenting with these parameters for your hardware to find yours. The process of finding the most bang-for-buck configuration is both educational and fun!

You can further customize the model with parameters like context size, temperature or even add to the system prompt by creating a modelfile. See the modelfile documentation for more details.

Better Tooling

Although Ollama provides simple interface for model interaction, it is designed to only work with open-source models. To work with any and every model with a consistent interface, consider using one of these two alternatives:

simonw/llm: Access large language models from the command-line
- Simple cli interface, lots of useful functionality. Highly recommended.
- For a better understanding of the features, watch Simon’s talk
open-webui: A UI based AI interface
- For a ChatGPT like user interface for interacting with models
- Makes it easy to drag-drop documents or images and prompt with them

Editor Integration

A simple use-case for having a local model would be to augment your editor workflow. This includes tasks like asking questions without leaving editor, generating, reviewing, and explaining code, generating documentation, scripts, etc:.

Everyone has their preferred editor workflow and configuration. I’ll walk through my personal setup.

Emacs

I’ve been using Sergey Kostyaev’s ellama for a while now, and it works well. There’s also jart/emacs-copilot by Justine Tunney, the author of llamafile, which provides copilot-style code completion that works with a llamafile model. I had some issues getting it to work, but it seems worth trying out.

Here’s my ellama configuration:

(use-package llm
  :straight (:host github :repo "ahyatt/llm"))

(use-package ellama
  :straight (:host github :repo "s-kostyaev/ellama")
  :init
  ;; setup key bindings
  (setopt ellama-keymap-prefix "C-c e")
  (require 'llm-ollama)
  (setopt ellama-provider
	        (make-llm-ollama :chat-model "codestral:latest")))

While exploring other setups, I came across copilot.el – an Emacs plugin for Github Copilot. Curious to understand how this was made to work, I found the below section in Robert Krahn’s blog post:

Even though Copilot is primarily a VSCode utility, making it work in Emacs is fairly straightforward. In essence it is not much different than a language server. The VSCode extension is not open source but since it is implemented in JavaScript you can extract the vsix package as a zip file and get hold of the JS files. As far as I know, the copilot.vim plugin was the first non-VSCode integration that used that approach. The worker.js file that is part of the vsix extension can be started as a node.js process that will read JSON-RPC data from stdin. … An editor like Emacs or VIM can start the worker in a subprocess and then interact with, sending JSON messages and reading JSON responses back via stdout.

Here’s the configuration:

(use-package copilot
  :straight (:host github :repo "copilot-emacs/copilot.el" :files ("*.el"))
  :config
  ;; (add-to-list 'copilot-major-mode-alist '("enh-ruby" . "ruby"))
  (add-hook 'prog-mode-hook 'copilot-mode)
  (define-key copilot-completion-map (kbd "<tab>") 'copilot-accept-completion)
  (define-key copilot-completion-map (kbd "TAB") 'copilot-accept-completion))

(defvar kg/no-copilot-modes '(shell-mode
                              inferior-python-mode
                              eshell-mode
                              term-mode
                              vterm-mode
                              comint-mode
                              compilation-mode
                              debugger-mode
                              dired-mode-hook
                              compilation-mode-hook
                              flutter-mode-hook
                              minibuffer-mode-hook
                              shell-script-modes)
  "Modes in which copilot is inconvenient.")

(defun kg/copilot-disable-predicate ()
  "When copilot should not automatically show completions."
  (or (member major-mode kg/no-copilot-modes)
      (company--active-p)))

(add-to-list 'copilot-disable-predicates #'kg/copilot-disable-predicate)

I’ve also experimented with integrating Aider using aidermacs, but haven’t done enough to write about it yet. I’ll update this post when I have.

Visual Studio Code

My go-to tool here is Cline. Cline is an agentic code assistant that can reason and do tasks like creating/editing files, running commands like tests, automatically fix bugs after running tests etc It can infer the context required to do a task. As a bonus, it also provides the input/output tokens and the cost of each query. Although it works fairly well with local models, using it with a state-of-the-art model like Claude has been a game-changer.

IntelliJ Idea

I use the plugin from Continue.dev that provides a chat as well as a code completion interface. The chat and code completion models can be configured independently. I use llama3.1:8b for the former and starcoder2:3b for the latter.

Evaluating models

Occasionally, you might find yourself wanting to compare responses from different models for a given prompt. You might be curious to see the differences in terms of the content or how something is explained –– if nuances are captured, caveats are mentioned, or if examples are used for illustration.

One tool that I’ve found useful is promptfoo. It is designed as a testing framework where a test case containing prompts, a list of models to evaluate and tests is executed and a report is generated. Here’s a simple configuration:

description: "General Instruction Evaluation"

prompts:
  - "illustrate how  works with an example"

providers:
  - "ollama:chat:llama3.1:8b-instruct-q6_K"
  - "ollama:chat:qwen2.5:14b"
  - "ollama:chat:hf.co/unsloth/gemma-3-12b-it-GGUF:Q4_K_M"

tests:
  - vars:
      topic: dependency injection with Dagger
  - vars:
      topic: gig economy

Once specified, do promptfoo eval to run the test(s) against the models. The generated report provides a nice tabular representation of the prompt and the model responses, and looks like this:

Promptfoo UI

This is however not the right tool if you want to compare and contrast models across a wide variety of use-cases to understand their strengths and flaws. For a more in-depth evaluation, consider using lm-evaluation-harness or deepeval instead.

Public Benchmarks

HuggingFace maintains a open model leaderboard where it constantly evaluates the models hosted on its platform, but it’s slow to load and buggy at times. I’d recommend looking at the following instead:

LLM-stats for a good overview of model benchmarks and comparisons. It has filters for comparing open models with specific parameters. The visualizations are nice too.
Aider benchmarks particularly for code editing and refactoring
StackEval for ability to function as a coding assistant

Be wary of taking these benchmarks at face value. They are simply a filter to pick a few starter models to experiment with. Models can train on benchmark data to appear better, so you need to try it for practical purposes in everyday use. Even better if you have your own evaluation dataset to test a model.

Final Thoughts

There’s still a lot I haven’t tried — newer models, IDE tools, and ideas. But that’s part of the fun. While this setup hasn’t radically transformed my workflow, it has added a few tools to the kit — ones that feel lightweight, local, and surprisingly capable. It’s a starting point for exploring what small models can do in a developer’s everyday environment and I’m curious to see just how far that can go.