After reading Kiran’s blog on self-hosting LLMs, I decided to try it out myself to see how viable they are for real-world tasks. Since I was already working on the Text-to-SQL problem with my colleagues, this seemed like the perfect use case to evaluate.
Key Takeaways
- Local LLMs achieve comparable accuracy to frontier models on easy and moderate Text-to-SQL tasks.
- VRAM is the hard limit for model size and performance. 16GB was just enough to run 14B models comfortably with quantization.
- Tools like Ollama abstract away a lot of the infrastructure overhead, making it surprisingly easy to test local models with an API-style interface.
- Prompt minimalism (+JSON schema) brought ~5% accuracy.
Why Local LLMs?
Two main factors motivated this exploration:
- The control — I could choose which model to run, tweak parameters, test different quantizations, and see how it all affects performance.
- I was curious about how an LLM is loaded onto a GPU, how Ollama handles model execution, and how it uses
llama.cpp
internally.
The Setup
Hardware & Runner
Hardware: 16GB RX 7800 XT GPU
Software: Ollama with Tailscale for remote access
Dataset
Dataset: Bird-Bench
Prompt
prompt = f"""
Given the following question: "{question}"
And the following database table definitions:
{table_ddls}
Generate a valid sqlite query that answers this question.
Ensure the query is syntactically correct and uses only the tables and columns defined above.
The query should only return the exact column(s) necessary to answer the question. Avoid including extra data unless it's the answer.
Return your response strictly in the JSON format, with the following fields:
"query": "your sqlite query here",
"explanation": "a brief explanation of how the query answers the question"
Important: Return ONLY the raw JSON object without any markdown formatting, code blocks, or additional text.
"""
# Add evidence to the prompt if provided
if evidence:
prompt += f"""
Additional evidence to consider:
{evidence}
"""
Models Tested
Qwen 2.5-Coder (14B & 32B, Q4_K/Q8_0), DeepCoder, Devstral.
Agentic Workflow
I wrote a custom script to interact with multiple LLMs in an agentic workflow. In the script, you can specify the provider (e.g., OpenAI, Google, Anthropic, Ollama) and select the model to use. I initially used only cloud-based APIs like Gemini, Claude, and ChatGPT — but adding Ollama as a provider was seamless, and I could run whichever model I want. You can find the code here.
Results
Benchmark 1: 50 Easy Text-to-SQL Tasks
The actual SQL output is available in the github repo here.
Model Name | Parameters | Quantization | SQL Accuracy |
---|---|---|---|
Devstral (Best so far) | 24B | Q8_0 | 54% |
Qwen2.5-Coder | 14B | Q8_0 | 50% |
Qwen2.5-Coder | 32B | Q8_0 | 50% |
Qwen2.5-Coder | 32B | Q4_K | 48% |
Qwen2.5-Coder | 14B | Q4_K | 46% |
Model Performance Snapshot
Set of 5 easy questions, to get an idea of response times for each model/size/quantization.
Model Name | Parameters | Quantization | VRAM Usage | Response Time |
---|---|---|---|---|
Qwen2.5-Coder | 14B | Q4_K | 9GB | 2–5s (fastest) |
Qwen2.5-Coder | 14B | Q8_0 | 15.6GB | 6–11s |
Qwen2.5-Coder | 32B | Q4_K | 19.8GB | 9–47s |
DeepCoder | 14B | Q4_K | 10GB | 17–120s |
Interpretation
- Quantization plays a huge role in running models efficiently.
- Using 4-bit versions made the difference between a responsive and unusable setup, while slightly impacting performance.
- Although quantized models were quick, they were less accurate while generating SQL.
Wrapping Up & What’s Next
For me, local LLMs have become good enough for many workflows. Text-to-SQL is one of them.
- On a single GPU workstation, I can now:
- Run easy-moderate SQL tasks
- Quickly prototype agentic workflows
- Control the full LLM stack
- Easily expose LLMs over the network using Tailscale and the likes
Even though local LLMs are good enough, the frontier models are still quite ahead in terms of benchmarks. On bird-bench alone, there’s a difference of 15-20%. Running bigger and better models will require a bigger GPU. I have several things planned ahead:
- Introduce evals to make the feedback loop quicker and tighter.
- Making the agentic workflow smarter, so that it can do iterations on generating SQL, to increase accuracy.
- Introduce hybrid routing, where most traffic will hit local LLMs, and if, even after multiple iterations, the generated SQL is invalid, the agent falls back to a cloud-hosted frontier model.
- Measure energy usage, and come up with strategies to minimize it.