Qwen3.5-4B-Base Fine-Tuning with Unsloth¶
Parent: Module 5B - LLM Application Development / Phase 3 - Artificial Intelligence
Fine-tune
Qwen/Qwen3.5-4B-Basewith Unsloth, then measure the training, export, and inference costs that matter to an AI hardware engineer.
Layer mapping: L1 (Application & Framework) feeding L2/L3 (compiler/runtime) and L5/L6 (memory, precision, accelerator design).
Prerequisites: PyTorch basics, tokenizer/chat-template fluency, LoRA/PEFT concepts, and one CUDA GPU with enough memory for 16-bit LoRA.
Role targets: AI Engineer / LLM Fine-Tuning Engineer / ML Platform Engineer / AI Inference Engineer
Course output: a reproducible LoRA adapter, an evaluation report, an exported inference artifact, and a short hardware note explaining VRAM, throughput, and deployment tradeoffs.
Course Currency Note¶
Model IDs, Unsloth support, and memory requirements change quickly. As of May 2026, Unsloth documents Qwen/Qwen3.5-4B-Base as part of its Qwen3.5 support and recommends 16-bit LoRA rather than 4-bit QLoRA for this family. Check the current Qwen model card and Unsloth Qwen3.5 guide before copying commands into a real training job.
Why This Matters for AI Hardware¶
Fine-tuning is not just a model-quality exercise. It changes the deployment artifact your hardware must serve:
- Adapter vs merged weights: unmerged LoRA adds extra low-rank matmuls at inference; merged weights remove that overhead but create a new model artifact.
- Precision choices: 16-bit LoRA, post-training quantization, and GGUF/AWQ/GPTQ exports all change memory traffic and kernel choices.
- Context length: SFT sequence length drives training activation memory, and evaluation context length drives KV-cache memory.
- Data shape: short instruction examples, long tool traces, and multimodal examples stress very different parts of the stack.
- Serving target: a fine-tuned 4B model can be trained on a workstation/cloud GPU, then quantized and profiled on Jetson or a custom runtime.
The hardware-relevant question is not "did loss go down?" It is: what artifact did you create, how big is it, what precision does it use, and how fast does it run on the target hardware?
What You Will Build¶
| Artifact | Minimum expectation |
|---|---|
| Dataset | JSONL train/validation split with a documented chat template |
| Training run | Unsloth LoRA SFT run on Qwen/Qwen3.5-4B-Base |
| Adapter | Saved LoRA adapter with config, tokenizer, and training metadata |
| Evaluation | Base-vs-adapter comparison on held-out prompts |
| Export | Merged 16-bit model or quantized deployment artifact |
| Benchmark | VRAM, train tokens/s, eval tokens/s, adapter size, and output-quality notes |
Hardware Budget¶
Use this course to learn the memory envelope before you start optimizing kernels.
| Target | Use it for | Notes |
|---|---|---|
| 12-16 GB CUDA GPU | small LoRA SFT at modest sequence length | Use BF16 on Ampere or newer; use FP16 when BF16 is unavailable |
| 24 GB CUDA GPU | longer sequence length, larger batch, safer eval | Good local development target |
| Cloud L40S/A100/H100 | sweeps over LoRA rank, sequence length, and batch | Use when you want clean throughput data |
| Jetson Orin Nano / NX | post-training inference profiling only | Do not treat Jetson as the primary training target |
Keep the first run boring: sequence length 2048, LoRA rank 16, small clean dataset, and one evaluation script. Expand only after the pipeline is reproducible.
1. Define the Fine-Tuning Job¶
Start with a precise objective. Good objectives are narrow enough that the base model can be evaluated directly:
- domain Q&A over a hardware manual
- terminal command explanation for CUDA/Jetson workflows
- structured bug triage from logs
- tool-call argument generation from natural language
- short hardware-design tutoring with cited source snippets
Avoid turning the first run into a general assistant. A base model needs the training data to teach both behavior and format, so vague data creates vague failures.
Dataset Schema¶
Use JSONL with explicit roles:
{"system":"You are a concise AI hardware engineering assistant.","user":"Explain why Qwen decode is memory-bandwidth bound at batch 1.","assistant":"At batch 1, each generated token streams large weight matrices for GEMV while doing little arithmetic reuse..."}
Keep a validation file that the trainer never sees:
Chat Template¶
For supervised fine-tuning, the exact serialized text matters. Use the tokenizer's chat template when available instead of inventing one:
def format_example(example, tokenizer):
messages = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
}
Failure mode to watch for: loss falls, but generation never stops or repeats role tags. That usually means an EOS/chat-template mismatch, not a hardware problem.
2. Environment Setup¶
Use a fresh virtual environment. Qwen3.5 support may require current transformers and Unsloth packages.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install --upgrade unsloth unsloth_zoo
pip install --upgrade datasets trl accelerate peft bitsandbytes torchvision pillow
Qwen3.5 support currently expects the current Unsloth stack and Transformers v5. If package resolution installs an older Transformers build, follow Unsloth's current install guidance before debugging training code.
Record the actual versions:
python - <<'PY'
import torch, transformers, trl, peft
print("torch", torch.__version__)
print("cuda", torch.version.cuda)
print("transformers", transformers.__version__)
print("trl", trl.__version__)
print("peft", peft.__version__)
PY
Put this output in your final report. Fine-tuning results without package versions are hard to reproduce.
3. Train with Unsloth LoRA¶
Use 16-bit LoRA first. Do not start with 4-bit QLoRA unless you have confirmed that your exact Qwen3.5 target and Unsloth version support it well enough for your quality bar.
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
MODEL_ID = "Qwen/Qwen3.5-4B-Base"
MAX_SEQ_LENGTH = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_ID,
max_seq_length=MAX_SEQ_LENGTH,
dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
load_in_4bit=False,
load_in_16bit=True,
full_finetuning=False,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
max_seq_length=MAX_SEQ_LENGTH,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
)
raw = load_dataset(
"json",
data_files={
"train": "data/qwen35_train.jsonl",
"validation": "data/qwen35_valid.jsonl",
},
)
def to_text(example):
messages = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=False,
)
}
dataset = raw.map(to_text, remove_columns=raw["train"].column_names)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
dataset_text_field="text",
args=SFTConfig(
output_dir="runs/qwen35-4b-unsloth-lora",
max_seq_length=MAX_SEQ_LENGTH,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
learning_rate=2e-4,
warmup_ratio=0.03,
num_train_epochs=1,
logging_steps=10,
eval_steps=100,
save_steps=100,
bf16=torch.cuda.is_bf16_supported(),
fp16=not torch.cuda.is_bf16_supported(),
optim="adamw_8bit",
report_to="none",
),
)
trainer.train()
trainer.save_model("artifacts/qwen35-4b-unsloth-lora")
tokenizer.save_pretrained("artifacts/qwen35-4b-unsloth-lora")
For a larger run, sweep one variable at a time:
| Variable | Starting value | Sweep |
|---|---|---|
| LoRA rank | 16 | 8, 16, 32 |
| Sequence length | 2048 | 1024, 2048, 4096 |
| Effective batch | 8 | 8, 16, 32 |
| Learning rate | 2e-4 | 1e-4, 2e-4, 5e-5 |
Do not compare runs unless dataset split, seed, sequence length, and eval prompts are fixed.
4. Evaluate Before You Export¶
Run the same held-out prompts against:
Qwen/Qwen3.5-4B-Base- base model + LoRA adapter
- merged model, if you merge
- quantized artifact, if you quantize
Measure both quality and systems behavior:
| Metric | Why it matters |
|---|---|
| Validation loss | Fast sanity check, not final proof |
| Exact-format pass rate | Catches chat/template/schema regressions |
| Human or LLM-judge win rate | Measures task improvement |
| Hallucination / unsupported-claim rate | Important for hardware docs and manuals |
| Peak VRAM during train | Determines feasible local hardware |
| Train tokens/s | Captures loader, checkpointing, and GPU efficiency |
| Inference tok/s and TTFT | Determines serving cost |
| Adapter size | Determines OTA/update feasibility |
Minimal eval table:
| Model artifact | Pass rate | Win rate vs base | TTFT | Decode tok/s | Notes |
|---|---|---|---|---|---|
| Base | |||||
| LoRA | |||||
| Merged | |||||
| Quantized |
5. Merge, Export, and Deploy¶
Save the adapter first. Merge only after the adapter passes evaluation.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="artifacts/qwen35-4b-unsloth-lora",
max_seq_length=2048,
load_in_4bit=False,
)
model.save_pretrained_merged(
"artifacts/qwen35-4b-merged-16bit",
tokenizer,
save_method="merged_16bit",
)
For deployment, choose the artifact that matches the runtime:
| Runtime path | Artifact | What to check |
|---|---|---|
| Transformers / PEFT | base + LoRA adapter | adapter load latency and extra matmul overhead |
| vLLM / TensorRT-LLM | merged model | version support for Qwen3.5, tokenizer, RoPE, QKV bias, and supported architecture |
| llama.cpp / GGUF | quantized GGUF | post-quant quality, prompt template, and tok/s |
| Jetson edge demo | small quantized artifact | RAM pressure, thermals, and sustained decode speed |
If you quantize after fine-tuning, evaluate again. A fine-tune that improves BF16 quality can still regress after aggressive quantization.
6. Hardware Note Template¶
Every student should finish with a one-page hardware note:
# Qwen3.5-4B Unsloth Fine-Tuning Hardware Note
## Training setup
- GPU:
- VRAM:
- CUDA / PyTorch / Unsloth:
- Sequence length:
- LoRA rank:
- Effective batch:
## Results
- Peak train VRAM:
- Train tokens/s:
- Final train loss:
- Validation loss:
- Adapter size:
- Merged model size:
## Inference
- Runtime:
- Precision / quantization:
- Prompt length:
- TTFT:
- Decode tok/s:
- Peak memory:
## Interpretation
- What changed versus the base model?
- Did the adapter create measurable inference overhead?
- What precision is the best deployment compromise?
- What would this imply for SRAM, memory bandwidth, and kernel fusion on a custom accelerator?
The last four questions are the bridge from "I fine-tuned a model" to "I understand what this workload costs in hardware."
Common Failure Modes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| Training OOMs immediately | sequence length or batch too high | lower sequence length, use gradient checkpointing, reduce batch |
| Loss falls but eval is worse | bad data, memorization, wrong eval template | inspect examples, deduplicate, fix template |
| Model repeats role tags | EOS/chat-template mismatch | verify tokenizer template and assistant termination |
| Exported model differs from adapter | merge/export path bug | compare base+adapter vs merged before quantization |
| Quantized artifact loses task skill | quantization too aggressive | try higher-bit quant or exempt sensitive tensors |
| Runtime crashes on Qwen weights | architecture support gap | verify QKV bias, RoPE layout, tokenizer, and model config support |
Capstone¶
Fine-tune Qwen/Qwen3.5-4B-Base on a small hardware-engineering instruction dataset, then deploy the result through two inference paths:
- base + LoRA adapter in Transformers
- merged or quantized artifact in an inference runtime
Deliver:
data_card.mddescribing source, filters, split, and templatetrain_config.yamlor equivalent script arguments- saved LoRA adapter
- base-vs-finetuned evaluation report
- inference benchmark table
- hardware note connecting the results to memory, precision, and serving cost
Resources¶
| Resource | Why it matters |
|---|---|
| Qwen/Qwen3.5-4B-Base model card | Source model, config, license, and usage notes |
| Unsloth Qwen3.5 fine-tuning guide | Current Unsloth support, recommended precision, and training examples |
| TRL SFTTrainer documentation | Trainer API used for supervised fine-tuning |
| PEFT LoRA documentation | Adapter mechanics and tunable parameters |
| Phase 5 - Qwen Inference Optimization | Companion inference course for profiling the exported model |
Next¶
-> Phase 5 - Qwen Inference Optimization once the fine-tuned artifact is ready to benchmark.