Deep Learning-Based Virtual Try-On System Using Multi-Modal Feature Fusion and Generative Adversarial Networks

Abstract

This paper presents a comprehensive deep learning-based virtual try-on system that addresses the challenge of realistic garment transfer in e-commerce applications. The system leverages multi-modal feature fusion combining cloth-agnostic person representation, pose estimation, and human parsing to enable identity-preserving virtual try-on. Key Contributions: • Multi-Modal Input Architecture: A 41-channel input representation combining cloth-agnostic RGB (3 channels), OpenPose Body25 pose heatmaps (18 channels), and LIP human parsing masks (20 channels) • Advanced Neural Architecture: U-Net generator with self-attention mechanisms (26.4M parameters) and spectral-normalized PatchGAN discriminator (2.8M parameters) for stable adversarial training • Sophisticated Loss Function: Multi-component objective combining adversarial loss (LSGAN), perceptual loss (VGG19, 5 layers), L1 reconstruction, and feature matching losses • Complete Pipeline Implementation: End-to-end system from data preprocessing through model training with systematic analysis of each component Technical Details: Dataset: VITON-HD (10,482 training samples, 2,032 test samples) Framework: PyTorch Architecture: U-Net with self-attention + Spectral-normalized PatchGAN Training: Proof-of-concept validation (10 epochs, CPU-based, 256×192 resolution) Evaluation: SSIM, PSNR, L1 distance metrics with comprehensive quantitative and qualitative analysis

References

Page 1

	Year	Citations

Page 1