You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was going through the paper, there was this image manipulation method through text difference.
It went like this:
z_i := original image CLIP embedding
z_t := new text CLIP embedding/ embedding of the text for current image manipulation
z_t0 := orignal image's corresponding text CLIP embedding/ text embedding of the text 'a photo' / empty embedding
z_d := l2_norm(z_t - z_t0) <-> text difference vector
z_new /z_theta := spherical_interpolation(z_i, z_d, theta) {where theta is between (0,0.5)} <-> new image's CLIP embedding vector
What I don't understand is, that the CLIP img and text embedding vectors are supposed to be similar vectors (since trained with cosine similarity), and the difference between text embedding vectors of two similar texts will be somewhat perpendicular to either of the vectors, therefore the text diff vector should be very different from the image embedding, and hence the spherical interpolation shouldn't give any meaningful result.
What am I missing? I am unable to understand why this text difference method works.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
First of all, great repository sir!!🤯
I was going through the paper, there was this image manipulation method through text difference.
It went like this:
z_i := original image CLIP embedding
z_t := new text CLIP embedding/ embedding of the text for current image manipulation
z_t0 := orignal image's corresponding text CLIP embedding/ text embedding of the text 'a photo' / empty embedding
z_d := l2_norm(z_t - z_t0) <-> text difference vector
z_new /z_theta := spherical_interpolation(z_i, z_d, theta) {where theta is between (0,0.5)} <-> new image's CLIP embedding vector
What I don't understand is, that the CLIP img and text embedding vectors are supposed to be similar vectors (since trained with cosine similarity), and the difference between text embedding vectors of two similar texts will be somewhat perpendicular to either of the vectors, therefore the text diff vector should be very different from the image embedding, and hence the spherical interpolation shouldn't give any meaningful result.
What am I missing? I am unable to understand why this text difference method works.
Beta Was this translation helpful? Give feedback.
All reactions