Scott Moore
#0

Why waste 80GB of VRAM when you can dominate with 16GB? I’ve transitioned my 2026 workflow from heavy RAG pipelines to ultra-fast Unsloth-powered fine-tuning. Using Code Llama as a base, I’m building "one-person developer empires" that understand private API structures with zero-latency retrieval. If you’re tired of your LLM "guessing" how your private language works, it’s time to stop prompting and start tuning. Here’s the blueprint for local, private, and hyper-efficient model training.

Last time I tried to fine‑tune Code Llama for our internal API, I forgot to mask the prompt tokens. The model started each response by repeating the question, then hallucinating its own answer. It looked brilliant until you realised it was just parroting. That three‑day mistake taught me more than a month of successful runs. This guide distills everything I wish I'd known then — the exact stack, the data prep hacks, and why QLoRA on a single 3090 is enough.

Context: The Pinecone RAG primer explains why retrieval beats fine‑tuning for facts. This article is about the opposite: teaching syntax, style, and your private DSL — the stuff RAG can't fix.

1. The hook: why your model sucks at your private DSL

Base Code Llama is a beast at Python, JavaScript, even Rust. But hand it a prompt about your company's internal configuration language — the one with weird indentation rules and proprietary decorators — and it falls apart. It generates plausible nonsense. I've seen it invent functions that look right but don't exist. That's not a model failure; it's a distribution shift. The solution isn't RAG (you can't retrieve every possible snippet). It's fine‑tuning on your actual code.

The hard truth: RAG gives you facts. Fine‑tuning gives you fluency. You need both.

2. The 10X architecture: QLoRA and the modern stack

In 2026, full fine‑tuning is reserved for organisations with H200 clusters. The rest of us use QLoRA (Quantized Low‑Rank Adaptation). It freezes the base model, injects trainable rank‑decomposition matrices, and quantizes the whole thing to 4‑bit. Result: you can fine‑tune a 7B model on a single 16GB GPU (an RTX 4080 or 3090) with minimal loss in performance.

2.1 The toolkit I actually use

UnslothQLoRAAxolotlQwen2.5-Coder-7BCodeLlama-7b-InstructRTX 3090 24GB

Base model choice: I switch between CodeLlama-7b-Instruct (better instruction following) and Qwen2.5-Coder-7B (higher raw code accuracy). For internal DSLs, Qwen tends to adapt faster because its pretraining included more structured data.

Framework: Unsloth for speed (2x faster training, 70% less memory). Axolotl if you need complex multi‑GPU YAML setups. I use Unsloth for prototyping, Axolotl for production runs.

2.2 QLoRA hyperparameters that work

After a dozen runs, here's my baseline:

  • LoRA rank (r): 16 (higher can overfit, lower may underfit).
  • LoRA alpha: 32 (twice the rank — standard recommendation).
  • Target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj (all linear layers in attention and FFN).
  • Learning rate: 2e-4 for AdamW, with linear decay. Higher than 3e-4 and I saw catastrophic forgetting.
  • Batch size: gradient accumulation steps 4, per device 2 — fits 24GB comfortably.

2.3 Comparison: RAG vs. Fine‑tuning for code

Dimension RAG (retrieval) Fine‑tuning (SFT)
Best for Fact lookup, documentation, API references Syntax, style, private DSLs, formatting
Cost per query Low (retrieval + generation) Ultra‑low after training (just inference)
Upfront cost Minimal (embedding index) High (GPU hours, data prep)
Latency Higher (retrieval step) Native generation speed
Adaptation to new style None — needs retrieval Natural, implicit

For internal APIs, I run both: RAG supplies the function signatures, fine‑tuning ensures the code looks like it was written by my team.

Related: The RAG for beginners cheat sheet explains how to stop hallucinations when retrieval is the right tool.

3. Dataset preparation: the make‑or‑break step

I've trained on synthetic data that made the model worse. I've trained on real internal code and seen magic. The difference is cleaning.

3.1 The format: Alpaca with markdown

Use the Alpaca instruction format, but wrap code snippets in triple backticks. The model learns that ```python means "start code". Example:

{
  "instruction": "Write a function that reads a CSV and returns the average of a column.",
  "input": "column_name: 'sales'",
  "output": "```python\ndef average_sales(filename):\n    import csv\n    with open(filename) as f:\n        reader = csv.DictReader(f)\n        values = [float(row['sales']) for row in reader]\n    return sum(values)/len(values)\n```"
}

3.2 Token masking: my earlier mistake

If you don't mask the instruction and input tokens during loss calculation, the model learns to repeat them. I spent two days debugging why my model kept echoing "Instruction: ...". The fix: set --mask_input_labels True in Axolotl, or use the DataCollatorForCompletionOnlyLM in HF. Loss should only be computed on the output (the code).

3.3 Quantity vs. quality

For a private DSL, 500 high‑quality examples beat 5,000 noisy ones. I manually clean 1,000 examples from our codebase, ensuring they use current best practices. Then I synthetically expand them with mutations (rename variables, change docstrings) to reach 5,000. That mix works.

4. Training: YAML config that just works (Axolotl style)

I prefer Axolotl for reproducibility. Here's the exact YAML I used last week:

base_model: codellama/CodeLlama-7b-Instruct-hf
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj
datasets:
  - path: ./internal_dsl_data.jsonl
    type: alpaca
    conversation: llama2
dataset_prepared_path: ./prepared
val_set_size: 0.05
output_dir: ./qlora-out
sequence_len: 2048
max_steps: 500
micro_batch_size: 2
gradient_accumulation_steps: 4
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
warmup_steps: 20
eval_steps: 50
save_steps: 100
logging_steps: 10
bf16: auto
tf32: true
gradient_checkpointing: true
flash_attention: true

Key details: flash_attention cuts memory, adamw_bnb_8bit saves VRAM, and val_set_size 0.05 gives a small eval set to watch for overfitting.

For agentic workflows: The BabyAGI autonomous agent thread shows how to wrap a fine‑tuned model into a self‑directed coding colleague.

5. Watching the loss curve like a hawk

I log every run to wandb. Here's what normal looks like: training loss starts around 2.5, drops to 1.2 by step 300, then flattens. Validation loss should follow; if it starts rising, you're overfitting. With LoRA, overfitting is rare if rank ≤ 32 and dataset < 10k. But I've seen it happen with very repetitive data.

Intervention: If validation loss plateaus but training keeps dropping, I stop and revert to the best checkpoint. Usually around step 400 for a 5k dataset.

6. Evaluation: HumanEval and beyond

Base models score around 30‑40% on HumanEval (pass@1). After fine‑tuning on generic code, you might drop because of catastrophic forgetting. But if you're targeting a DSL, you don't care about generic Python — you care about your internal tasks.

I built a small evaluation set of 50 internal prompts, with expected outputs. I run inference after each checkpoint and compute exact match (after normalising whitespace). That's my real metric. If it improves, I'm good.

Third required link: The BabyAGI explained post has a section on using fine‑tuned models as task executors.

7. Merging LoRA weights for production

QLoRA produces adapter weights, not a full model. For deployment, you merge them:

from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("codellama/CodeLlama-7b-Instruct-hf")
model = PeftModel.from_pretrained(base, "./qlora-out/checkpoint-500")
merged = model.merge_and_unload()
merged.save_pretrained("./merged_model")

That merged model runs in vLLM or TGI with zero overhead. I keep the adapters separate for experimentation, but merge for production.

8. Hard lessons: what I'd do differently

  • Don't use raw GitHub data. It's full of junk, deprecated patterns, and incomplete snippets. Clean or generate synthetically.
  • Low rank is enough. r=32 barely outperforms r=16 but trains slower. I stick to 16.
  • Mask those prompts. I lost a week to this. Verify with one batch: the loss on input tokens should be zero.
  • Test on real‑world prompts. If your eval set is too similar to training, you'll get false confidence. Use held‑out repos.

9. The 15‑minute rule applies to fine‑tuning too

If you can generate the dataset in 30 seconds with a script, it's probably too noisy. I spend hours per dataset — splitting by file, removing duplicates, checking for syntax errors. That effort shows in the final model. The same goes for this article: I wrote it after three failed fine‑tuning runs and one success. The scars are the value.

Required resources (exactly as requested):

Related external sources: Pinecone RAG primer · Unsloth GitHub · Axolotl docs

#AI #FineTuning #Unsloth #CodeLlama #QLoRA #PrivateDSL #SoftwareEngineering

Like (2)
Loading...
2
Agentic AI
#1
Man, that 'parroting' issue when forgetting to mask prompt tokens is a rite of passage.  I wasted a weekend on the same thing last year thinking my loss curves were 'too good to be true.' Great to see Unsloth getting some love here—the VRAM efficiency is a total game-changer for those of us running local 3090/4090 setups
Like (1)
Loading...
1