Qwen3.5-4B-Base Fine-Tuning with Unsloth¶

Parent: Module 5B - LLM Application Development / Phase 3 - Artificial Intelligence

Fine-tune Qwen/Qwen3.5-4B-Base with Unsloth, then measure the training, export, and inference costs that matter to an AI hardware engineer.

Layer mapping: L1 (Application & Framework) feeding L2/L3 (compiler/runtime) and L5/L6 (memory, precision, accelerator design).

Prerequisites: PyTorch basics, tokenizer/chat-template fluency, LoRA/PEFT concepts, and one CUDA GPU with enough memory for 16-bit LoRA.

Role targets: AI Engineer / LLM Fine-Tuning Engineer / ML Platform Engineer / AI Inference Engineer

Course output: a reproducible LoRA adapter, an evaluation report, an exported inference artifact, and a short hardware note explaining VRAM, throughput, and deployment tradeoffs.

Course Currency Note¶

Model IDs, Unsloth support, and memory requirements change quickly. As of May 2026, Unsloth documents Qwen/Qwen3.5-4B-Base as part of its Qwen3.5 support and recommends 16-bit LoRA rather than 4-bit QLoRA for this family. Check the current Qwen model card and Unsloth Qwen3.5 guide before copying commands into a real training job.

Why This Matters for AI Hardware¶

Fine-tuning is not just a model-quality exercise. It changes the deployment artifact your hardware must serve:

Adapter vs merged weights: unmerged LoRA adds extra low-rank matmuls at inference; merged weights remove that overhead but create a new model artifact.
Precision choices: 16-bit LoRA, post-training quantization, and GGUF/AWQ/GPTQ exports all change memory traffic and kernel choices.
Context length: SFT sequence length drives training activation memory, and evaluation context length drives KV-cache memory.
Data shape: short instruction examples, long tool traces, and multimodal examples stress very different parts of the stack.
Serving target: a fine-tuned 4B model can be trained on a workstation/cloud GPU, then quantized and profiled on Jetson or a custom runtime.

The hardware-relevant question is not "did loss go down?" It is: what artifact did you create, how big is it, what precision does it use, and how fast does it run on the target hardware?

What You Will Build¶

Artifact	Minimum expectation
Dataset	JSONL train/validation split with a documented chat template
Training run	Unsloth LoRA SFT run on `Qwen/Qwen3.5-4B-Base`
Adapter	Saved LoRA adapter with config, tokenizer, and training metadata
Evaluation	Base-vs-adapter comparison on held-out prompts
Export	Merged 16-bit model or quantized deployment artifact
Benchmark	VRAM, train tokens/s, eval tokens/s, adapter size, and output-quality notes

Hardware Budget¶

Use this course to learn the memory envelope before you start optimizing kernels.

Target	Use it for	Notes
12-16 GB CUDA GPU	small LoRA SFT at modest sequence length	Use BF16 on Ampere or newer; use FP16 when BF16 is unavailable
24 GB CUDA GPU	longer sequence length, larger batch, safer eval	Good local development target
Cloud L40S/A100/H100	sweeps over LoRA rank, sequence length, and batch	Use when you want clean throughput data
Jetson Orin Nano / NX	post-training inference profiling only	Do not treat Jetson as the primary training target

Keep the first run boring: sequence length 2048, LoRA rank 16, small clean dataset, and one evaluation script. Expand only after the pipeline is reproducible.

1. Define the Fine-Tuning Job¶

Start with a precise objective. Good objectives are narrow enough that the base model can be evaluated directly:

domain Q&A over a hardware manual
terminal command explanation for CUDA/Jetson workflows
structured bug triage from logs
tool-call argument generation from natural language
short hardware-design tutoring with cited source snippets

Avoid turning the first run into a general assistant. A base model needs the training data to teach both behavior and format, so vague data creates vague failures.

Dataset Schema¶

Use JSONL with explicit roles:

{"system":"You are a concise AI hardware engineering assistant.","user":"Explain why Qwen decode is memory-bandwidth bound at batch 1.","assistant":"At batch 1, each generated token streams large weight matrices for GEMV while doing little arithmetic reuse..."}

Keep a validation file that the trainer never sees:

data/qwen35_train.jsonl
data/qwen35_valid.jsonl

Chat Template¶

For supervised fine-tuning, the exact serialized text matters. Use the tokenizer's chat template when available instead of inventing one:

def format_example(example, tokenizer):
    messages = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    return {
        "text": tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
    }

Failure mode to watch for: loss falls, but generation never stops or repeats role tags. That usually means an EOS/chat-template mismatch, not a hardware problem.

2. Environment Setup¶

Use a fresh virtual environment. Qwen3.5 support may require current transformers and Unsloth packages.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install --upgrade unsloth unsloth_zoo
pip install --upgrade datasets trl accelerate peft bitsandbytes torchvision pillow

Qwen3.5 support currently expects the current Unsloth stack and Transformers v5. If package resolution installs an older Transformers build, follow Unsloth's current install guidance before debugging training code.

Record the actual versions:

python - <<'PY'
import torch, transformers, trl, peft
print("torch", torch.__version__)
print("cuda", torch.version.cuda)
print("transformers", transformers.__version__)
print("trl", trl.__version__)
print("peft", peft.__version__)
PY

Put this output in your final report. Fine-tuning results without package versions are hard to reproduce.

3. Train with Unsloth LoRA¶

Use 16-bit LoRA first. Do not start with 4-bit QLoRA unless you have confirmed that your exact Qwen3.5 target and Unsloth version support it well enough for your quality bar.

import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel

MODEL_ID = "Qwen/Qwen3.5-4B-Base"
MAX_SEQ_LENGTH = 2048

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_ID,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    load_in_4bit=False,
    load_in_16bit=True,
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length=MAX_SEQ_LENGTH,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
)

raw = load_dataset(
    "json",
    data_files={
        "train": "data/qwen35_train.jsonl",
        "validation": "data/qwen35_valid.jsonl",
    },
)

def to_text(example):
    messages = [
        {"role": "system", "content": example["system"]},
        {"role": "user", "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    return {
        "text": tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=False,
        )
    }

dataset = raw.map(to_text, remove_columns=raw["train"].column_names)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    dataset_text_field="text",
    args=SFTConfig(
        output_dir="runs/qwen35-4b-unsloth-lora",
        max_seq_length=MAX_SEQ_LENGTH,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        learning_rate=2e-4,
        warmup_ratio=0.03,
        num_train_epochs=1,
        logging_steps=10,
        eval_steps=100,
        save_steps=100,
        bf16=torch.cuda.is_bf16_supported(),
        fp16=not torch.cuda.is_bf16_supported(),
        optim="adamw_8bit",
        report_to="none",
    ),
)

trainer.train()
trainer.save_model("artifacts/qwen35-4b-unsloth-lora")
tokenizer.save_pretrained("artifacts/qwen35-4b-unsloth-lora")

For a larger run, sweep one variable at a time:

Variable	Starting value	Sweep
LoRA rank	16	8, 16, 32
Sequence length	2048	1024, 2048, 4096
Effective batch	8	8, 16, 32
Learning rate	2e-4	1e-4, 2e-4, 5e-5

Do not compare runs unless dataset split, seed, sequence length, and eval prompts are fixed.

4. Evaluate Before You Export¶

Run the same held-out prompts against:

Qwen/Qwen3.5-4B-Base
base model + LoRA adapter
merged model, if you merge
quantized artifact, if you quantize

Measure both quality and systems behavior:

Metric	Why it matters
Validation loss	Fast sanity check, not final proof
Exact-format pass rate	Catches chat/template/schema regressions
Human or LLM-judge win rate	Measures task improvement
Hallucination / unsupported-claim rate	Important for hardware docs and manuals
Peak VRAM during train	Determines feasible local hardware
Train tokens/s	Captures loader, checkpointing, and GPU efficiency
Inference tok/s and TTFT	Determines serving cost
Adapter size	Determines OTA/update feasibility

Minimal eval table:

Model artifact	Pass rate	Win rate vs base	TTFT	Decode tok/s	Notes
Base
LoRA
Merged
Quantized

5. Merge, Export, and Deploy¶

Save the adapter first. Merge only after the adapter passes evaluation.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="artifacts/qwen35-4b-unsloth-lora",
    max_seq_length=2048,
    load_in_4bit=False,
)

model.save_pretrained_merged(
    "artifacts/qwen35-4b-merged-16bit",
    tokenizer,
    save_method="merged_16bit",
)

For deployment, choose the artifact that matches the runtime:

Runtime path	Artifact	What to check
Transformers / PEFT	base + LoRA adapter	adapter load latency and extra matmul overhead
vLLM / TensorRT-LLM	merged model	version support for Qwen3.5, tokenizer, RoPE, QKV bias, and supported architecture
llama.cpp / GGUF	quantized GGUF	post-quant quality, prompt template, and tok/s
Jetson edge demo	small quantized artifact	RAM pressure, thermals, and sustained decode speed

If you quantize after fine-tuning, evaluate again. A fine-tune that improves BF16 quality can still regress after aggressive quantization.

6. Hardware Note Template¶

Every student should finish with a one-page hardware note:

# Qwen3.5-4B Unsloth Fine-Tuning Hardware Note

## Training setup
- GPU:
- VRAM:
- CUDA / PyTorch / Unsloth:
- Sequence length:
- LoRA rank:
- Effective batch:

## Results
- Peak train VRAM:
- Train tokens/s:
- Final train loss:
- Validation loss:
- Adapter size:
- Merged model size:

## Inference
- Runtime:
- Precision / quantization:
- Prompt length:
- TTFT:
- Decode tok/s:
- Peak memory:

## Interpretation
- What changed versus the base model?
- Did the adapter create measurable inference overhead?
- What precision is the best deployment compromise?
- What would this imply for SRAM, memory bandwidth, and kernel fusion on a custom accelerator?

The last four questions are the bridge from "I fine-tuned a model" to "I understand what this workload costs in hardware."

Common Failure Modes¶

Symptom	Likely cause	Fix
Training OOMs immediately	sequence length or batch too high	lower sequence length, use gradient checkpointing, reduce batch
Loss falls but eval is worse	bad data, memorization, wrong eval template	inspect examples, deduplicate, fix template
Model repeats role tags	EOS/chat-template mismatch	verify tokenizer template and assistant termination
Exported model differs from adapter	merge/export path bug	compare base+adapter vs merged before quantization
Quantized artifact loses task skill	quantization too aggressive	try higher-bit quant or exempt sensitive tensors
Runtime crashes on Qwen weights	architecture support gap	verify QKV bias, RoPE layout, tokenizer, and model config support

Capstone¶

Fine-tune Qwen/Qwen3.5-4B-Base on a small hardware-engineering instruction dataset, then deploy the result through two inference paths:

base + LoRA adapter in Transformers
merged or quantized artifact in an inference runtime

Deliver:

data_card.md describing source, filters, split, and template
train_config.yaml or equivalent script arguments
saved LoRA adapter
base-vs-finetuned evaluation report
inference benchmark table
hardware note connecting the results to memory, precision, and serving cost

Resources¶

Resource	Why it matters
Qwen/Qwen3.5-4B-Base model card	Source model, config, license, and usage notes
Unsloth Qwen3.5 fine-tuning guide	Current Unsloth support, recommended precision, and training examples
TRL SFTTrainer documentation	Trainer API used for supervised fine-tuning
PEFT LoRA documentation	Adapter mechanics and tunable parameters
Phase 5 - Qwen Inference Optimization	Companion inference course for profiling the exported model

Next¶

-> Phase 5 - Qwen Inference Optimization once the fine-tuned artifact is ready to benchmark.