This site contains NSFW material and is only for adults.
This is a week-end experiment of applying diffusion models, the same technology that powers Dall-E 2, to adult-content generation.
Loosely inspired by Denoising Diffusion Probabilistic Models https://arxiv.org/abs/2006.11239, Cascaded Diffusion models https://arxiv.org/pdf/2106.15282.pdf and
https://github.com/openai/guided-diffusion
Models have been kept small so that the training could be done using 2 1080Ti in a week-end.
Training was done on ~500k video frames (extracted at one frame each second)
The first row show the sequence of generated pictures in increasing resolution (unconditional 32x32, upsampling2x 64x64,upsampling2x 128x128, cubic-interpolation 400x400).
For the animation pictures (first row is 32x32 and second is 64x64 resolution). It is the same algorithm applied to generate multi-frames one-second apart. It stacks pictures channel-wise, so that the computational cost of generating multiple-frames is roughly the same as generating a single frame. (If you have more compute resources you can stack them side by side as a bigger image and attend to them with an attention module, or use 3d convolutions).
Training from video allows for better context understanding and reduce issues like motion blur.
It is also possible to use virtual reality video as input for better depth perception.
By providing data frames seperated 20-seconds apart the same module can be used to generate video thumbnails.
The roadmap for future developments :
-Conditioning on past (and optionally future frames) generate the intermediate frames.
-Conditioning on title, and categories of the video using https://www.sbert.net/, generate mosaicing thumbnails of the video, to allow custom video from a text prompt.
-Conditioning on the generated thumbnail, generate intermediate frames.
-Conditioning on the generated video frames, generate audio spectrogram.
-Improve quality by increasing scale. (Will require more compute resources...)
Training data can be scrapped from the internet like search-engines do.
The models build up an internal representation of everything they see to compress it into a kind of average of everything they have seen.
If the dataset is too small like a 1k image-set the current model does overfit successfully, and learn to regenerate the exact dataset, but as dataset size increase, because the model is of limited capacity it can't store everything and therefore the generated data become more and more original content.
You can use https://github.com/alex000kim/nsfw_data_scraper to easily scrap adult content.
Once you have an adult content generator that produce original content of good quality, you can use it to train your next model to free yourself from potential copyright issues.