Transformer Models¶

PanFormer¶

Full transformer architecture with cross-attention fusion.

from models import PanFormer

model = PanFormer(
    ms_bands=4,
    embed_dim=128,
    depth=4,
    num_heads=8,
    patch_size=4
)

Architecture:

Input: MS + PAN
  |
  v
Patch Embedding (4x4 patches)
  |
  +---> MS Stream: Self-Attention Blocks
  |
  +---> PAN Stream: Self-Attention Blocks
  |
  v
Cross-Attention Fusion (MS queries, PAN keys/values)
  |
  v
Decoder + Reconstruction Head
  |
  v
Output + Residual Connection

Key Components:

Patch Embedding: Converts images to patch tokens
Self-Attention: Models long-range dependencies within each stream
Cross-Attention: Fuses information between MS and PAN streams
Progressive Decoder: Reconstructs high-resolution output

PanFormerLite¶

Lightweight transformer optimized for efficiency.

from models import PanFormerLite

model = PanFormerLite(
    ms_bands=4,
    embed_dim=64,
    depth=2,
    num_heads=4,
    window_size=8
)

Optimizations: - Smaller embedding dimension (64 vs 128) - Fewer transformer blocks (2 vs 4) - Window attention instead of global attention - ~370K parameters vs ~1M for full PanFormer

Window Attention¶

PanFormerLite uses window-based attention for efficiency:

Image (H, W)
  -> Split into windows (window_size x window_size)
  -> Self-attention within each window
  -> Merge windows

This reduces complexity from O(N^2) to O(N * window_size^2).

Training Tips¶

Best Practices

Learning Rate: Use lower LR (1e-4 to 5e-5) for transformers
Warmup: Always use LR warmup (5-10 epochs)
Epochs: Train for 100+ epochs for best results
Loss: Use spectral_focus or advanced loss
Batch Size: Larger batches help transformer training

Example Training¶

python scripts/run_deep_learning.py \
    --model panformer_lite \
    --loss spectral_focus \
    --epochs 100 \
    --lr 5e-5

Benchmark Results¶

Model	PSNR	SSIM	Parameters	Training Time
PanNet	30.79	0.839	340K	1x
PanFormer	35.0+	0.92+	1M	3x
PanFormerLite	34.62	0.908	370K	1.5x