Training your own LoRa

We have done efficient LoRA training using three popular frameworks—kohya-ss/sd‑scripts, Ostris AI‑Toolkit, and ComfyUI’s FluxTrainer nodes. It covers concept, environment setup, dataset preparation with BLIP‑2 captioning, detailed walkthroughs of each toolchain (including your kohya‑ss steps), infrastructure on Runpod, and best practices for mixed‑precision, caching, and scheduling.

What Is LoRA?

Low‑Rank Adaptation (LoRA) is a fine‑tuning technique that injects small, trainable low‑rank matrices (the “adapters”) into selected weight layers of a large pre‑trained model, while freezing the original weights. By optimizing only these adapters, LoRA achieves:

Parameter Efficiency: Only a chunk of new parameters is learned per layer instead of the full model.
Memory Savings: Reduced trainable footprint enables mixed‑precision and 8‑bit optimizers even on mid‑range GPUs.
Faster Convergence: Smaller parameter set means fewer gradients to compute and update.

Why Three Frameworks?

Each of the following toolchains offers unique strengths when training LoRA adapters on Stable Diffusion backbones:

kohya‑ss / sd‑scripts
- Scriptable CLI & GUI: Extensive flags for optimizer choice (AdamW8bit, Adafactor), caching, checkpointing, and scheduling.
- TOML‑driven Dataset: Fine control over resolution buckets, shuffling, and caption extensions.
Ostris AI‑Toolkit
- YAML Job Specs: One file describes the entire pipeline—model, optimizer, layers to adapt, sampling procedure, and EMA.
- Built‑in Performance Logs: Automated logging, checkpoint rotation, and flowmatch noise scheduling for sharper results.
ComfyUI FluxTrainer Nodes
- Visual Node Graph: Drag‑and‑drop building blocks for model loading, adapter injection, data pipelines, and saving.
- Low‑VRAM Mode: Optimized attention and BF16 support for training on 10–12 GB cards without scripting.

Dataset Preparation & Captioning

Download, Collect or Scrape the images in a folder, all images should reflect what you want your LoRa to be trained on, for example we can collect 5-20 different images of a person to train a LoRa on him/her. Then the second step is captioning which is the most important one.

BLIP‑2 Captioning

We leveraged Salesforce’s BLIP‑2 (blip2-opt-2.7b-coco) for automatic captions—state‑of‑the‑art on COCO (+2.8 CIDEr) (Hugging Face). Dynamic extension filtering and .txt outputs align with sd‑scripts requirements. Just wrote a basic script to read images, add caption and format according to the dataset format as required by kohya-ss

import shutil
from pathlib import Path
from transformers import pipeline
from PIL import Image
import torch
# Directories
IMAGES_DIR = Path("Images")
OUTPUT_DIR = Path("prepared_dataset")
CAPTION_EXT = ".txt"
# Create output and subset directories
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
SUBSET_DIR = OUTPUT_DIR / "Images"
SUBSET_DIR.mkdir(exist_ok=True)
# 1. Retrieve all Pillow-supported image extensions dynamically
#    registered_extensions() returns dict: {'.png': 'PNG', '.jpg': 'JPEG', ...}
exts = Image.registered_extensions()
# 2. Filter for extensions that Pillow can open
SUPPORTED_EXTS = {
    ext.lower() for ext, fmt in exts.items()
    if fmt in Image.OPEN
}
# Initialize the Hugging Face captioning pipeline
captioner = pipeline(
    "image-to-text",
    model="Salesforce/blip2-opt-2.7b-coco",
    torch_dtype=torch.float32,
    device=0  # use -1 for CPU
)
# Process each supported image in the source folder
for img_path in IMAGES_DIR.iterdir():
    if img_path.suffix.lower() in SUPPORTED_EXTS:
        # Copy image to the subset directory
        dest_img = SUBSET_DIR / img_path.name
        shutil.copy(img_path, dest_img)
        # Open and caption the image
        image = Image.open(img_path).convert("RGB")
        result = captioner(image, max_new_tokens=50)[0]["generated_text"]
        # Write caption to .txt file alongside the image
        caption_file = dest_img.with_suffix(CAPTION_EXT)
        with open(caption_file, "w", encoding="utf-8") as f:
            f.write(result.strip() + "\n")

kohya‑ss Walkthrough

This section details how to set up a reproducible Python environment for LoRA training with kohya‑ss’s SD‑Scripts (sd3 branch) and install all necessary dependencies—accelerate, bitsandbytes, xformers, PyTorch for CUDA 12.4, OpenCV‑Python, and Hugging Face Diffusers—along with optional TMUX for session management. By following these steps, you’ll isolate your project in a virtual environment, ensure compatibility with your GPU drivers, and leverage mixed‑precision and 8‑bit optimizers to maximize VRAM efficiency during training.

Environment Setup & Dependency Installation

Clone the SD‑Scripts repository and check out the sd3 branch.
git clone https://github.com/kohya-ss/sd-scripts.git

cd sd-scripts

git checkout sd3

This pulls the kohya‑ss sd‑scripts code and switches to the sd3 branch, where Flux and Stable Diffusion 3 support reside (GitHub)

Create a Python virtual environment using venv.

python3.10 -m venv venv

The venv module creates an isolated environment so installed packages don’t conflict with the system Python.

Activate the new environment.

source venv/bin/activate

Activation ensures subsequent pip installs target only this venv and not your global site‑packages.

Upgrade pip to the latest version.

pip install --upgrade pip

Keeping pip up to date avoids installer bugs and ensures compatibility with newer wheel formats.

Install the core requirements from the repository.

pip install -r requirements.txt

This brings in SD‑Scripts dependencies such as transformers, tqdm, and safetensors as specified by the project (GitHub).

Install accelerate and bitsandbytes.

pip install accelerate bitsandbytes

accelerate handles mixed‑precision and distributed setups, while bitsandbytes provides 8‑bit AdamW optimizers for reduced VRAM usage.

Install xformers for memory‑efficient attention.

pip install xformers

XFormers accelerates Transformer attention kernels, significantly lowering GPU memory during training.

Install PyTorch matching CUDA 12.4.

pip install torch==2.4.1 torchvision --index-url https://download.pytorch.org/whl/cu124

Installing the specific PyTorch build for CUDA 12.4 ensures full GPU acceleration and compatibility with other libraries

Install OpenCV‑Python.

pip install --upgrade opencv-python

OpenCV supports image I/O and preprocessing tasks integral to dataset loading in SD‑Scripts.

Install Hugging Face Diffusers with PyTorch extras.

pip install --upgrade diffusers[torch]

This provides the Diffusers library optimized for PyTorch backends, enabling model and scheduler utilities.

(Optional) Install tmux for long‑running sessions.

sudo apt update && sudo apt install tmux -y

TMUX lets you detach and reattach to remote training sessions without interruption citeturn1search2.

Model & Encoder Downloads

Set your Hugging Face token.

export HF_TOKEN="hf_YOURTOKEN"

The HF_TOKEN environment variable allows authenticated access to private or gated repositories via the CLI(huggingface.co)

Download the Flux base model and autoencoder.

huggingface-cli download black-forest-labs/FLUX.1-dev \
    flux1-dev.safetensors \
    ae.safetensors \
    --local-dir

This command fetches the FLUX.1‑dev model checkpoints and their associated autoencoder for downstream LoRA training (huggingface.co)

Prepare the sd3 directory and download text encoders.

mkdir -p sd3
huggingface-cli download comfyanonymous/flux_text_encoders \
  clip_l.safetensors \
  t5xxl_fp16.safetensors \
  --local-dir sd3

The CLIP L and T5 XXL FP16 text encoders provide the text conditioning backbones aligned with SD Scripts v3 configuration.

Dataset Preparation & Captioning

Scraping & Organizing Images

Collect and place all training images under my_lora_dataset/Images.
Ensure one folder per LoRA project, with consistent naming and file formats (JPEG, PNG).

Then run the caption script as given above.

Training Execution & Flags for kohya‑ss

LoRA on U-Net Only

export NCCL_BLOCKING_WAIT=1
export NCCL_IB_TIMEOUT=1000
export NCCL_ASYNC_ERROR_HANDLING=1

accelerate launch --num_cpu_threads_per_process 32 \
       flux_train_network.py \
    --mixed_precision bf16 \
    --pretrained_model_name_or_path flux1-dev.safetensors \
    --clip_l sd3/clip_l.safetensors \
    --t5xxl sd3/t5xxl_fp16.safetensors \
    --ae ae.safetensors \
    --save_model_as safetensors \
    --sdpa \
    --persistent_data_loader_workers \
    --max_data_loader_n_workers 16 \
    --seed 42 \
    --save_precision bf16 \
    --network_module networks.lora_flux \
    --network_dim 8 \
    --network_train_unet_only \
    --optimizer_type adamw \
    --learning_rate 1e-4 \
    --gradient_checkpointing \
    --fp8_base \
    --highvram \
    --cache_text_encoder_outputs \
    --cache_latents_to_disk \
    --cache_text_encoder_outputs_to_disk \
    --max_train_epochs 25 \
    --save_every_n_epochs 10 \
    --dataset_config dataset_config.toml \
    --output_dir output_lora \
    --output_name flux_lora \
    --timestep_sampling shift \
    --discrete_flow_shift 3.1582 \
    --model_prediction_type raw \
    --guidance_scale 1.0

accelerate launch orchestrates the distributed run across CPUs/GPUs using your accelerate config.
--mixed_precision bf16 reduces memory usage by using BF16 for model weights & gradients.
--network_module networks.lora_flux points to the LoRA implementation in the codebase.
--network_dim 8 sets the low-rank adapter dimension (r=8).
--network_train_unet_only restricts training to U-Net layers, leaving the text encoder frozen.
Caching flags (cache_text_encoder_outputs, etc.) offload heavy tensors to disk between steps, reducing VRAM pressure.
--gradient_checkpointing splits computation to trade extra compute for lower memory.

Full Fine‑Tuning (Flux‑FT)

accelerate launch --num_cpu_threads_per_process 32 \
    flux_train.py \
    --mixed_precision bf16 \
    --pretrained_model_name_or_path flux1-dev.safetensors \
    --clip_l sd3/clip_l.safetensors \
    --t5xxl sd3/t5xxl_fp16.safetensors \
    --ae ae.safetensors \
    --save_model_as safetensors \
    --sdpa \
    --persistent_data_loader_workers \
    --max_data_loader_n_workers 16 \
    --seed 42 \
    --save_precision bf16 \
    --optimizer_type Adafactor \
    --learning_rate 1e-4 \
    --gradient_checkpointing \
    --highvram \
    --cache_text_encoder_outputs_to_disk \
    --cache_latents_to_disk \
    --save_every_n_epochs 10 \
    --max_train_epochs 25 \
    --dataset_config dataset_config.toml \
    --output_dir output_flux_ft \
    --output_name flux_ft \
    --timestep_sampling shift \
    --discrete_flow_shift 3.1582 \
    --model_prediction_type raw \
    --guidance_scale 1.0 \
    --full_bf16 \
    --fused_backward_pass \
    --blocks_to_swap 8

--optimizer_type Adafactor uses a lower-memory alternative to AdamW.
--full_bf16 enables full BF16 precision for all layers.
--fused_backward_pass accelerates gradient computation using fused kernels.
--blocks_to_swap 8 designates swapping deeper blocks to selectively fine-tune additional layers beyond LoRA.

Resuming & Checkpointing

To resume from an existing checkpoint:

accelerate launch ... --resume

Use --resume flag and ensure output_dir has previous checkpoints. SD‑Scripts will find the latest .pth or .safetensors file and continue training.

Unused/Optional Flags

--cache_text_encoder_outputs_to_disk: cache encoder outputs to disk.
--cache_latents_to_disk: cache U-Net latents.
--cache_text_encoder_outputs: keep encoder outputs in RAM.
--max_data_loader_n_workers: number of worker threads.
--save_precision: model save precision (fp16/bf16).
--sdpa: enables Flash attention if available.

For training a LoRa for 75k steps, it took me approx 27 days on a 48 GB VRAM GPU device and 3-4 days on B200 with 200 GB VRAM, so have patience.

Ostris AI‑Toolkit Walkthrough

Ostris AI‑Toolkit is a comprehensive, YAML‑driven suite for diffusion model fine‑tuning—LoRA included—designed to run efficiently on consumer GPUs and integrate seamlessly with cloud deployments like Runpod (GitHub).

YAML Job Configuration

Define your entire pipeline in one Job.yaml (or similar) file:

job: extension
config:
  name: custom
  process:
  - type: sd_trainer
    training_folder: /workspace/outputs/
    performance_log_every: 100
    device: cuda:0
    trigger_word: mani
    network:
      type: lora
      linear: 32
      linear_alpha: 32
      dropout: 0.25
      network_kwargs:
        only_if_contains:
        - transformer.single_transformer_blocks.7.proj_out
        - transformer.single_transformer_blocks.20.proj_out
    save:
      dtype: float16
      save_every: 100
      max_step_saves_to_keep: 2
      save_format: safetensors
    datasets:
    - folder_path: /workspace/custom
      caption_ext: txt
      caption_dropout_rate: 0
      token_dropout_rate: 0
      cache_latents_to_disk: true
      resolution:
      - 1024
    train:
      batch_size: 32
      loss_type: mse
      train_unet: true
      train_text_encoder: false
      steps: 600
      gradient_checkpointing: true
      noise_scheduler: flowmatch
      skip_first_sample: true
      ema_config:
        use_ema: true
        ema_decay: 0.9999
      dtype: bf16
      lr: 1
      optimizer: prodigy
      optimizer_params:
        weight_decay: 0.01
        decouple: true
        d0: 0.0001
        use_bias_correction: false
        safeguard_warmup: false
        d_coef: 1.0
    model:
      name_or_path: black-forest-labs/FLUX.1-dev
      is_flux: true
      quantize: true
    sample:
      sampler: flowmatch
      sample_every: 100
      width: 1024
      height: 1024
      prompts:
      - "[trigger], photo of a bearded guy"
      - "[trigger], photo of a bearded guy wearing suit"
      neg: ''
      seed: 42
      walk_seed: true
      guidance_scale: 4
      sample_steps: 20
meta:
  name: Job
  version: '1.0'

sd_trainer orchestrates adapter injection, training, and checkpointing in one process.
only_if_contains lets you target specific transformer layers for adapter insertion, reducing compute overhead.
save.max_step_saves_to_keep prevents S3 overload by pruning older checkpoints.

Runpod Serverless Deployment

Ostris integrates with Runpod’s serverless framework via a Docker image and a handler script:

Dockerfile based on nvidia/cuda:11.8.0-base-ubuntu22.04 installs Python 3.10, Git, FFmpeg, then pip install torch runpod and toolkit requirements (YouTube).
rp_handler.py receives job inputs (datasetURL, config_url, config_name), configures AWS env vars, and calls getDatasetAndRun.sh to pull data and launch training. (GitHub)

Shell Script Orchestration (`getDatasetAndRun.sh`)

Dataset Download & Extraction:

wget -O /tmp/dataset.zip "$DATASET_URL"
unzip -q /tmp/dataset.zip -d /workspace/custom

Automatically handles single‑folder or multi‑folder zips.

Config Retrieval:

wget -O ai-toolkit/config/config.yaml "$CONFIG_URL

Optional Diffusers Reinstall:

pip uninstall -y diffusers
pip install git+https://github.com/huggingface/diffusers

Checkpoint Sync to S3:

inotifywait -m /workspace/outputs/custom -e create |
while read -r _ _ file; do
  aws s3 cp "/workspace/outputs/custom/$file" "s3://$BUCKET/$UID/checkpoints/$file"
done

Ensures intermediate .safetensors are backed up in real time.

Key Parameters & Best Practices

dtype: bf16 and cache_latents_to_disk slash VRAM usage on 10–16 GB cards. (Replicate)
noise_scheduler: flowmatch yields sharper denoising trajectories by adaptively balancing noise steps. (GitHub)
EMA (ema_decay: 0.9999) stabilizes adapter weights over long runs.
Checkpoint Pruning via max_step_saves_to_keep prevents S3 flooding.

Resume & Monitoring

Automated Logging: performance_log_every outputs loss metrics every N steps to console/S3.
Checkpoint Resume: Re-run with the same YAML; AI‑Toolkit auto‑detects the latest checkpoint in /workspace/outputs/ and continues.
Real‑Time Dashboard: Use Runpod’s UI or aws s3 sync to fetch logs and checkpoints locally for monitoring.

This standalone section delivers everything needed to configure, deploy, and manage LoRA training with Ostris AI‑Toolkit.

ComfyUI FluxTrainer Nodes

ComfyUI’s FluxTrainer nodes embed LoRA training directly into the ComfyUI visual workflow, enabling users to fine‑tune Stable Diffusion adapters without leaving the UI (GitHub, RunComfy).

Key Features

Low‑VRAM Training (10–12 GB): FluxTrainer leverages optimized attention kernels and BF16 precision to run on consumer GPUs with as little as 10 GB of VRAM.
Seamless UI Integration: Training is performed via draggable nodes—no separate scripts—so you can build, modify, and compare workflows entirely within ComfyUI (RunComfy).
Kohya‑SS Backend Parity: Under the hood, FluxTrainer wraps kohya‑ss’s sd‑scripts, ensuring feature parity (caching, mixed precision, schedulers) with CLI methods.
Layer‑Specific Adapter Injection: The InitFluxLoRATraining node allows you to target specific transformer blocks (e.g., blocks 2 & 7) for adapter placement, conserving compute.

Workflow Setup

Install FluxTrainer Nodes:
Clone the ComfyUI-FluxTrainer repo into your custom_nodes folder or install via ComfyUI Manager (GitHub).
Load an Example Workflow:
Import the flux_lora_train_example_01.json workflow to get starter nodes pre‑configured (Reddit).
Configure Dataset Nodes:
Use TrainDatasetGeneralConfig and TrainDatasetAdd nodes to point at your prepared image folder (with optional captions) (RunComfy).
Initialize Training:
Wire up the InitFluxLoRATraining node with model checkpoints (transformer, VAE, CLIP_L, T5), set adapter rank (r) and α, and define epochs—mirroring kohya‑ss flags (YouTube).
Execute Training: Click Queue in ComfyUI to start the training loop; monitor VRAM usage and console logs in real time within the UI (YouTube).

Dataset Generation Use Case

We have used this in cases where we generated the dataset using a ComfyUI workflow (e.g., the famous Flux Kontext pipeline) to create 20 variations of a person, then trained a LoRA on that dataset (YouTube). The original workflow was made by Lovis Odin.

Practical Tips

Enable split_mode=true: If your system RAM is limited (<32 GB), splitting tensors across CPU/GPU prevents out‑of‑memory errors.
Cloud Deployment via ComfyAI.run: Run ComfyUI workflows on scalable infrastructure for faster throughput and higher VRAM options.
Resume Training Automatically: Re‑run the same workflow; FluxTrainer auto‑detects existing checkpoints and continues from the last saved epoch.

Conclusion

Each framework offers unique strengths:

kohya‑ss for scriptable, deeply configurable pipelines;
Ostris for consumer‑grade GUI/CLI ease and targeted layer control;
ComfyUI FluxTrainer for seamless low‑VRAM, visual experimentation.

When VRAM is ample (>24 GB), kohya‑ss’s full scripting and advanced flags shine. For budget GPUs or rapid visuals, ComfyUI’s FluxTrainer is unbeatable. Ostris AI‑Toolkit sits in the middle, giving CLI simplicity plus YAML‑driven specificity.

Not able to decide which option is better for you? Let's talk and figure that out together.

Exploring Diffusion Transformers: The Next Leap in Image Generation ›

Training your own LoRa

What Is LoRA?

Why Three Frameworks?

kohya‑ss / sd‑scripts

Ostris AI‑Toolkit

ComfyUI FluxTrainer Nodes

Dataset Preparation & Captioning

BLIP‑2 Captioning

kohya‑ss Walkthrough

Environment Setup & Dependency Installation

Model & Encoder Downloads

Dataset Preparation & Captioning

Scraping & Organizing Images

Training Execution & Flags for kohya‑ss

LoRA on U-Net Only

Full Fine‑Tuning (Flux‑FT)

Resuming & Checkpointing

Unused/Optional Flags

Ostris AI‑Toolkit Walkthrough

YAML Job Configuration

Runpod Serverless Deployment

Shell Script Orchestration (getDatasetAndRun.sh)

Dataset Download & Extraction:

Config Retrieval:

Optional Diffusers Reinstall:

Checkpoint Sync to S3:

Key Parameters & Best Practices

Resume & Monitoring

ComfyUI FluxTrainer Nodes

Key Features

Workflow Setup

Dataset Generation Use Case

Practical Tips

Conclusion

Shell Script Orchestration (`getDatasetAndRun.sh`)