PoseMorphAI is a comprehensive pipeline built using ComfyUI and Stable Diffusion, designed to reposition people in images, modify their facial features, and change their clothes seamlessly. This solution leverages advanced pose estimation, facial conditioning, image generation, and detail refinement modules for high-quality output.
PoseMaster is an advanced pipeline for transforming images of people by reposing them, changing their facial features, and modifying their clothes. The project uses state-of-the-art models from ComfyUI and Stable Diffusion to deliver realistic transformations. Key functionalities include:
- Dynamic reposing based on a target pose image.
- Facial feature modification using a reference image.
- Clothes changing through segmentation masks and reference images.
- This is achieved by leveraging a series of pre-trained models and a modular approach to processing input data, image conditioning, and output generation.
To run PoseMaster, you'll need the following dependencies:
- Python 3.7+
- PyTorch
- TorchVision
- Pillow (PIL)
- Numpy
- ComfyUI and its related modules
- Pre-trained models for ControlNet, CLIP, IP-Adapters, FreeU-V2,...
Clone this repository and navigate to the project directory:
git clone <your-repository-url>
cd <your-repository-directory>
Install the required Python packages as described in the Requirements section.
Ensure you have the necessary pre-trained models and place them in the appropriate directories. Some example models include:
- vae-ft-mse-840000-ema-pruned.safetensors
- realdream.safetensors
- control_v11p_sd15_openpose.pth
- and more...
Modify any paths in the code to match your local setup if needed.
PoseMaster starts by loading input images, which include:
- Pose Images: Define the target body positioning.
- Face Images: Used to condition facial features.
- Clothes Images: Provide reference for clothing transformation.
The pipeline initializes several models using ComfyUI nodes, including:
- VAE Loader: Loads a Variational Autoencoder (VAE) for image encoding and decoding.
- Checkpoint Loader: Loads the main Stable Diffusion checkpoint.
- CLIP Text Encoder: Encodes text prompts for conditioning image generation.
- ControlNet Loader: Applies ControlNet conditioning for pose and other inputs.
- IP-Adapter Loader: Integrates visual and textual embeddings for image conditioning.
- Detector & SAM Model Loader: Used for segmentation and detection.
- Pose Estimation: The pose of the input image is estimated using Dwpose models to generate key points for the body, face, and hands.
- Face Conditioning: The face image is processed using the IP-Adapter, which combines visual and textual embeddings to condition image generation.
- ControlNet Conditioning: The pipeline applies ControlNet conditioning based on the pose, facial features, and other inputs.
- FreeU-V2 Processing: Patch-based image processing is performed to enhance image details.
- Sampling and Upscaling: Neural network-based latent sampling and upscaling are performed for high-quality image generation.
- Segmentation: GroundingDINO and SAM models are used to segment specific areas, such as clothes, in the input image.
- Clothes Transformation: The segmented areas are modified using a reference clothes image with CAT-VTON.
The final image is decoded from the latent space, refined using detailers, and saved as output.
This project is licensed under the MIT License. See the LICENSE file for more details.