-
Notifications
You must be signed in to change notification settings - Fork 354
IP‐Adapter‐Face
We use some public datasets (e.g. LAION) to obtain training datasets, in particular, we also used some AI-synthesized images. Specifically, we use the face detection model in the insightface library to filter out images containing only 1 face. In order to ensure image quality here, we only use images with a resolution above 1024, we also filter out some images with smaller faces (It might be better to use a face quality scoring model for filtering here, but we haven't done that yet.). We also did some data augmentation, the most important thing is to crop images with different face proportions so that the model can generate images with various face proportions, such as full-body or half-body photos. Before training, we first crop out the face, the code we use is as follows:
import cv2
import numpy as np
import insightface
from insightface.app import FaceAnalysis
from insightface.data import get_image as ins_get_image
from insightface.utils import face_align
app = FaceAnalysis(providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
img = cv2.imread("person.png")
faces = app.get(img)
norm_face = face_align.norm_crop(img, landmark=faces[0].kps, image_size=224)
Initially we did not segment the face, but later we found that segmenting the face (removing the background) would reduce the dependence on the background.
During the training process, we only used horizontal flipping, but in later versions we used stronger data augmentation (such as color transformation, etc.)
We mainly consider two image encoders:
- CLIP image encoder: here we use OpenCLIP ViT-H, CLIP image embeddings are good for face structure;
- Face recognition model: here we use arcface model from insightface, the normed ID embedding is good for ID similarity. (Note that normalized embedding is required here. In our earliest experiments, we do some wrong experiments. This is also the reason why the FaceID model was launched relatively late.)
In addition, we also tried to use DINO. Our preliminary experiments were better than CLIP, but we did not release a specific model.
We use the same training strategy as IP-Adapter:
- SD 1.5: 512x512, batch size = 8*8, lr = 1e-4 (using 8xV100s with 32GB)
- SDXL: (1) 512x512, batch size = 88, lr = 1e-4; (2) 1024x1024 (or multi scale), batch size = 48, lr=1e-5 (also use noise offset) (using 8xA100s with 40GB)
The model is same as ip-adapter-plus model, but use cropped face image as condition. Here, we use a Q-Former (16 tokens) to extract face features from CLIP image embeddings.
We found that 16 tokens are not enough to learn the face structure, so in this version we directly use an MLP to map CLIP image embeddings into new features as input to the IP-Adapter. Therefore, this model is a little better than plus-face.
We use face ID embedding from a face recognition model instead of CLIP image embedding, additionally, we use LoRA to improve ID consistency. Hence, IP-Adapter-FaceID = a IP-Adapter model + a LoRA. Why use LoRA? Because we found that ID embedding is not as easy to learn as CLIP embedding, and adding LoRA can improve the learning effect. If only portrait photos are used for training, ID embedding is relatively easy to learn, so we get IP-Adapter-FaceID-Portrait. This also shows that ID embedding can learn better if appropriate strategies are adopted, such as ID loss or GAN loss. (The amount of training data is also critical.)
We found that using only ID embedding, the model generation results were unstable, for example, they were greatly affected by prompts, so we combined ID embedding and CLIP embedding. Specifically, we use ID embedding as the query for Q-Former.
As discussed before, CLIP embedding is easier to learn than ID embedding, so IP-Adapter-FaceID-Plus prefers CLIP embedding, which makes the model less editable. So in the V2 version, we slightly modified the structure and turned it into a shortcut structure: ID embedding + CLIP embedding (use Q-Former). At the same time, during the training process, we dropped 50% of the CLIP embedding. Specifically, we used drop path training strategy.